Universal Content Fingerprinting (UCFP)

UCFP (Universal Content Fingerprint) is an open-source framework for generating unique, reproducible, and meaning-aware fingerprints across text, audio, image, video, and document payloads. It unifies exact hashing, perceptual similarity, and semantic embeddings into one coherent pipeline, so developers can identify, compare, and link content deterministically and perceptually. Built in Rust for speed and reliability, UCFP powers use cases such as deduplication, plagiarism detection, data cleaning, content provenance, and multimodal search.

Why UCFP?

Deterministic ingest – strict metadata validation, canonical IDs, and consistent whitespace normalization keep upstream feeds clean.
Reproducible canonical text – Unicode NFKC, lowercasing, punctuation stripping, token offsets, and SHA-256 digests are exposed as standalone helpers.
Perceptual fingerprints – rolling-hash shingles, winnowing, and MinHash signatures make similarity search and near-duplicate detection straightforward.
Semantic embeddings – ufp_semantic turns canonical text into ONNX/API-backed dense vectors with deterministic fallbacks for offline tiers.
Single entry point – the root ucfp crate wires every stage into process_record, process_record_with_perceptual, and process_record_with_semantic, so applications can adopt the full pipeline one call at a time.
Built-in observability – plug in a PipelineMetrics recorder to capture latency and results for ingest, canonical, perceptual, and semantic stages.

Use cases

Use case	What UCFP contributes	Layers & configs
Dataset deduplication	Deterministic IDs and canonical hashes collapse byte-identical submissions	`ufp_ingest` + `IngestConfig`, `ufp_canonical` SHA-256
Plagiarism detection	Token offsets, shingles, and MinHash detect paraphrased overlaps	`ufp_canonical` tokens, `ufp_perceptual` tuned `k`/`w`
Content provenance	Canonical metadata + perceptual signatures trace assets across feeds, storage, and audits	`ufp_ingest`, `PipelineMetrics`, `PerceptualConfig` seeds
Multimodal search	Canonical text + binary passthrough feed embedding stores and downstream modalities	`IngestPayload::Binary`, canonical helpers, embeddings roadmap

Quickstart

Prerequisites

Rust 1.76+ (rustup toolchain install stable)
cargo available on your PATH

Build, lint, and test

cargo fmt --all
cargo clippy --all --all-targets -- -D warnings
cargo test --all

Explore the examples

cargo run --package ufp_ingest --example ingest_demo
cargo run --package ufp_ingest --example batch_ingest
cargo run --package ufp_canonical --example demo
cargo run --package ufp_canonical --example helpers
cargo run --package ufp_perceptual --example fingerprint_demo
cargo run --package ufp_semantic --example embed "Doc Title" "Some text to embed"
cargo run --example full_pipeline              # ingest + semantic + perceptual + index
cargo run                              # end-to-end demo on big_text.txt
cargo run --example pipeline_metrics   # observe metrics events

Architecture Overview

UCFP is a layered pipeline:

Ingest (ufp_ingest) – validates metadata, derives deterministic IDs, normalizes text/binary payloads, and emits CanonicalIngestRecord.
Canonical (ufp_canonical) – converts normalized text into lowercase NFKC strings, token streams with byte offsets, and SHA-256 hashes.
Perceptual (ufp_perceptual) – shingles canonical tokens, applies winnowing, and produces MinHash fingerprints tuned by PerceptualConfig.
Semantic (ufp_semantic) – turns canonical text into dense embeddings via ONNX Runtime or remote HTTP APIs, then normalizes/stubs vectors based on the configured tier.

The root ucfp crate re-exports all public types and orchestrates the stages through:

process_record (ingest + canonicalize),
process_record_with_perceptual (full ingest → canonical → perceptual),
process_record_with_semantic (ingest → canonical → semantic embedding),
process_record_with_*_configs helpers when explicit configuration objects are needed,
process_semantic_document / semanticize_document when you need only the embedding,
big_text_demo for the bundled integration example,
set_pipeline_metrics / PipelineMetrics and set_pipeline_logger for observability hooks.

Layer responsibilities

Layer	Responsibilities	Key types
`ufp_ingest`	Required metadata enforcement, timestamp defaulting, control-character stripping, whitespace normalization, UTF-8 decode	`IngestConfig`, `RawIngestRecord`, `CanonicalIngestRecord`, `CanonicalPayload`
`ufp_canonical`	Unicode normalization, casing/punctuation policies, tokenization with byte offsets, SHA-256 hashing	`CanonicalizeConfig`, `CanonicalizedDocument`, `Token`
`ufp_perceptual`	Rolling-hash shingles, winnowing, MinHash signatures with deterministic seeding and optional parallelism	`PerceptualConfig`, `PerceptualFingerprint`, `WinnowedShingle`, `PerceptualMeta`
`ufp_semantic`	ONNX/API inference, tokenizer lifecycle management, deterministic stub embeddings for offline or “fast” tiers	`SemanticConfig`, `SemanticEmbedding`, `SemanticError`

Documentation map

docs/index.html – workspace-wide architecture overview, diagrams, and glossary.
crates/ufp_ingest/doc/ucfp_ingest.md – ingest invariants, metadata normalization flow, and error taxonomy.
crates/ufp_canonical/doc/ufp_canonical.md – canonical transforms, token semantics, and checksum derivation.
crates/ufp_perceptual/doc/ufp_perceptual.md – shingling/winnowing internals, MinHash tuning guidance, and performance notes.
crates/ufp_semantic/doc/ufp_semantic.md – ONNX/API setup, deterministic stub tiers, and embedding configuration tips.

Config quick reference

Config type	Knobs you probably care about	Default highlights
`IngestConfig`	`default_tenant_id`, `doc_id_namespace`, `strip_control_chars`, `metadata_policy.*`	v1, deterministic namespace UUID, strip-on, policies off
`CanonicalizeConfig`	`normalize_unicode`, `strip_punctuation`, `lowercase`	v1, Unicode NFKC + lowercase, punctuation kept
`PerceptualConfig`	`k`, `w`, `minhash_bands`, `minhash_rows_per_band`, `seed`, `use_parallel`	v1, 9-token shingles, 16x8 MinHash, serial mode

use ucfp::{
    CanonicalizeConfig, IngestConfig, IngestMetadata, IngestPayload, IngestSource, PerceptualConfig,
    RawIngestRecord,
};
use uuid::Uuid;

let ingest_cfg = IngestConfig {
    default_tenant_id: "tenant-acme".into(),
    doc_id_namespace: Uuid::parse_str("3ba60f64-7d5a-11ee-b962-0242ac120002")?,
    strip_control_chars: true,
    ..Default::default()
};

let canonical_cfg = CanonicalizeConfig {
    strip_punctuation: true,
    lowercase: true,
    ..Default::default()
};

let perceptual_cfg = PerceptualConfig {
    k: 7,
    minhash_bands: 32,
    minhash_rows_per_band: 4,
    use_parallel: true,
    ..Default::default()
};

let (doc, fingerprint) = ucfp::process_record_with_perceptual(
    RawIngestRecord {
        id: "demo".into(),
        source: IngestSource::RawText,
        metadata: IngestMetadata {
            tenant_id: None,
            doc_id: None,
            received_at: None,
            original_source: None,
            attributes: None,
        },
        payload: Some(IngestPayload::Text("Streamlined config demo".into())),
    },
    &canonical_cfg,
    &perceptual_cfg,
)?;
assert_eq!(doc.canonical_text, "streamlined config demo");

Pipeline in code

use chrono::{Duration, Utc};
use ucfp::{
    CanonicalizeConfig, IngestMetadata, IngestPayload, IngestSource, PerceptualConfig,
    RawIngestRecord, SemanticConfig, process_record_with_perceptual, semanticize_document,
};

let canonical_cfg = CanonicalizeConfig::default();
let perceptual_cfg = PerceptualConfig { k: 5, ..Default::default() };
let record = RawIngestRecord {
    id: "ingest-1".into(),
    source: IngestSource::RawText,
    metadata: IngestMetadata {
        tenant_id: "tenant".into(),
        doc_id: "doc".into(),
        received_at: Utc::UNIX_EPOCH + Duration::seconds(1_700_000_000),
        original_source: None,
        attributes: None,
    },
    payload: Some(IngestPayload::Text("  Hello   world  ".into())),
};

let (doc, fingerprint) =
    process_record_with_perceptual(record, &canonical_cfg, &perceptual_cfg)?;
assert_eq!(doc.canonical_text, "hello world");
assert_eq!(fingerprint.meta.k, 5);

let semantic_cfg = SemanticConfig {
    mode: "fast".into(),
    tier: "fast".into(),
    ..Default::default()
};
let embedding = semanticize_document(&doc, &semantic_cfg)?;
assert_eq!(embedding.doc_id, doc.doc_id);

Call process_record_with_semantic(...) to obtain the document and embedding together, or semanticize_document(...) when you already have a canonical document on hand.

Failures bubble up as PipelineError::Ingest(_), PipelineError::Canonical(_), PipelineError::Perceptual(_), PipelineError::Semantic(_), or PipelineError::NonTextPayload. The CLI binary in src/main.rs invokes big_text_demo and prints the final MinHash signature generated from crates/ufp_canonical/examples/big_text.txt.

Metrics & Observability

Hook a recorder into set_pipeline_metrics(...) to track stage-level latency and outcomes, or attach a structured logger via set_pipeline_logger(...). The KeyValueLogger helper emits key/value lines such as:

timestamp="2025-02-10T02:15:01.234Z" stage=ingest status=success latency_us=640 record_id="demo"
timestamp="2025-02-10T02:15:01.241Z" stage=canonical status=success latency_us=488 record_id="demo" doc_id="demo"
timestamp="2025-02-10T02:15:01.245Z" stage=perceptual status=success latency_us=377 record_id="demo" doc_id="demo"
timestamp="2025-02-10T02:15:01.249Z" stage=semantic status=success latency_us=512 record_id="demo" doc_id="demo"

examples/pipeline_metrics.rs now wires both metrics and structured logging. Run it with:

cargo run --example pipeline_metrics

Workspace Layout

crates/
  ufp_ingest/        # ingest validation and normalization
  ufp_canonical/     # canonical text pipeline
  ufp_perceptual/    # shingling, winnowing, MinHash
src/                 # workspace exports + CLI demo
tests/               # integration tests (determinism, errors, pipeline)
docs/                # static documentation site
proto/               # schema sketches and diagrams
examples/            # workspace-level demos (metrics, etc.)

Roadmap

Expand ingest metadata policies and validation rules.
Add storage/search integrations for perceptual fingerprints.
Extend the pipeline with cross-modal canonicalizers, fingerprints, and embedding backends:

Modality	Canonicalizer	Fingerprint	Embedding Model
Text	NFKC + tokenization	MinHash	BGE / E5
Image	DCT normalization	pHash	CLIP / SigLIP
Audio	Mel-spectrogram	Winnowing	SpeechCLIP / Whisper
Video	Keyframes	Scene hashes	VideoCLIP / XCLIP
Document	OCR + layout	Layout graph	LayoutLMv3

Introduce semantic extraction and multi-modality pathways (e.g., text + binary embeddings) feeding the existing canonical/perceptual layers.
Enrich observability with structured logging backends and metrics exporters.

Contributing

We welcome fixes, optimizations, and new modalities. Please read CONTRIBUTING.md for the workflow, required checks (cargo fmt, cargo clippy, cargo test), documentation expectations, and guidance on updating the architecture diagram as the pipeline evolves.

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.github/workflows		.github/workflows
benches		benches
crates		crates
examples		examples
proto		proto
src		src
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Universal Content Fingerprinting (UCFP)

Why UCFP?

Use cases

Quickstart

Prerequisites

Build, lint, and test

Explore the examples

Architecture Overview

Layer responsibilities

Documentation map

Config quick reference

Pipeline in code

Metrics & Observability

Workspace Layout

Roadmap

Contributing

About

Uh oh!

Releases

Packages

Languages

License

bravo1goingdark/ucfp

Folders and files

Latest commit

History

Repository files navigation

Universal Content Fingerprinting (UCFP)

Why UCFP?

Use cases

Quickstart

Prerequisites

Build, lint, and test

Explore the examples

Architecture Overview

Layer responsibilities

Documentation map

Config quick reference

Pipeline in code

Metrics & Observability

Workspace Layout

Roadmap

Contributing

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages