feat: Add incremental re-hashing API for large models #562

edonadei · 2025-11-02T03:45:47Z

Motivation

This PR implements incremental model re-hashing to solve a critical performance problem when signing large ML models. Currently, when a user makes a small change to a large model (e.g., updating README.md in a 500GB model), the entire model must be re-hashed before re-signing, which can take hours. This makes it impractical to update documentation or configuration files in large models.

This PR adds a Python API that reuses digests from previous signatures for unchanged files, only re-hashing files that were added or modified. For a 500GB model with a 1KB documentation update, this reduces re-hashing time from hours to seconds (~500,000x speedup).

Changes

signing.manifest_from_signature() - Extracts and verifies manifest from existing signature files, ensuring old signatures haven't been tampered with before reusing their hashes
IncrementalSerializer - Core implementation that compares current model state against existing manifest and only re-hashes changed files
Config.use_incremental_serialization() - Integrates incremental serializer into hashing API
sign_incremental() - High-level convenience API that combines verification + extraction + incremental hashing + signing

Design Decision

Following the discussion in #160, this implementation uses a user-driven approach where changed files are specified via the files_to_hash parameter (e.g., from git diff). This was chosen over automatic change detection because:

File metadata (mtime) is unreliable across systems and git operations
Users know what changed via their workflow (git, manual tracking, etc.)
Keeps implementation simple and reliable
Works with any file tracking system

The implementation also reuses the existing Verifier.verify() path instead of manually parsing DSSE envelopes, which:

Eliminates code duplication
Ensures consistent verification logic
Adds security by verifying old signatures before trusting their hashes

Test Coverage:

tests/manifest_test.py: Tests for manifest data structures
tests/_serialization/incremental_test.py: 7 tests for incremental serializer (new files, modified files, deleted files, mixed scenarios, empty manifests)
tests/hashing_config_test.py: Tests for incremental serialization parameter extraction
tests/_signing/signing_test.py: Tests for DSSE payload parsing and verification

Future Work: Based on maintainer feedback, these could be added in follow-up PRs:

CLI support (--incremental flag)
Documentation in README
Git helper utility (get_changed_files_from_git())
Integration tests with large models

Questions for Maintainers:

Is the files_to_hash parameter approach acceptable, or would you prefer a different change detection mechanism? ✅ Looks good
Should CLI support and documentation be in this PR or separate follow-ups? ✅ Will we done in a next PR
Is the breaking change (adding required verification parameters) acceptable for the incremental signing API?

Testing this PR

# Create and sign an initial model
from model_signing import signing
from pathlib import Path

model_dir = Path("test_model")
model_dir.mkdir()
(model_dir / "weights.bin").write_bytes(b"x" * 1000000)  # 1MB file
(model_dir / "README.md").write_text("Version 1")

# Initial sign
signing.Config().use_elliptic_key_signer(
    private_key=Path("test.key")
).sign(model_dir, "model.sig.v1")

# Modify only README
(model_dir / "README.md").write_text("Version 2 - updated docs")

# Sign incrementally - only re-hashes README.md, reuses weights.bin digest
# Must verify the old signature before reusing its hashes
signing.Config().use_elliptic_key_signer(
    private_key=Path("test.key")
).sign_incremental(
    model_dir,
    old_signature_path="model.sig.v1",
    new_signature_path="model.sig.v2",
    identity="[email protected]",
    oidc_issuer="https://example.com/oauth",
    files_to_hash=[model_dir / "README.md"]
)

# Verify the new signature works
from model_signing import verifying
verifying.Config().use_elliptic_key_verifier(
    public_key=Path("test.pub")
).verify(model_dir, "model.sig.v2")

…tures This method enables reading a manifest from a signature file without performing cryptographic verification. This is the foundation for incremental re-hashing, where we need to know what files were previously signed to determine which files need re-hashing. The method: - Reads and parses Sigstore bundle JSON format - Extracts the DSSE envelope payload - Decodes base64-encoded payload - Validates manifest integrity (root digest matches resources) - Returns a Manifest object Includes comprehensive tests covering: - Valid manifest extraction - Rejection of inconsistent manifests - Error handling for missing files, invalid JSON, and missing envelopes Related to issue sigstore#160 - API for incremental model re-hashing Signed-off-by: Emrick Donadei <[email protected]>

Implements the core incremental hashing logic that compares the current model state against an existing manifest and only re-hashes changed files. Key features: - Reuses digests for unchanged files from previous manifest - Hashes new files not in the previous signature - Handles modified files via files_to_hash parameter - Handles file deletions automatically (omits them from new manifest) - Uses same parallel hashing as standard file serializer The algorithm: 1. Scan current model directory for all files 2. Build set of files to rehash from files_to_hash parameter 3. For each current file: - If not in old manifest: hash it (new file) - If in files_to_hash list: hash it (modified file) - Otherwise: reuse digest from old manifest (unchanged) 4. Deleted files are automatically excluded (not on disk) 5. Return manifest with mix of reused and new digests Usage for incremental signing (e.g., 500GB model, 1KB README changed): # Get changed files from git changed = subprocess.check_output(['git', 'diff', '--name-only', 'HEAD']) files_to_hash = [model_path / f for f in changed.decode().split()] # Only re-hash the changed file(s) serializer.serialize(model_path, files_to_hash=files_to_hash) This provides significant performance improvements - only re-hashing the changed 1KB instead of all 500GB. Includes comprehensive tests covering: - No changes: all digests reused - New file added: only new file hashed - Modified file: only modified file re-hashed - File deleted (auto): removed from manifest - File deleted (in files_to_hash): safely ignored - Mixed changes: all scenarios working together Related to issue sigstore#160 - API for incremental model re-hashing Signed-off-by: Emrick Donadei <[email protected]>

Integrates the IncrementalSerializer into the high-level hashing API, making it accessible through the Config class. Usage: # Extract manifest from previous signature old_manifest = Manifest.from_signature(Path("model.sig.old")) # Configure incremental hashing config = hashing.Config().use_incremental_serialization( old_manifest, hashing_algorithm="sha256" ) # Get changed files and hash them changed_files = [model_path / "README.md"] new_manifest = config.hash(model_path, files_to_hash=changed_files) This method follows the same pattern as use_file_serialization() and use_shard_serialization(), providing a consistent API for users. The configuration: - Accepts an existing manifest to compare against - Supports all the same hashing algorithms (SHA256, BLAKE2, BLAKE3) - Supports the same parameters (chunk_size, max_workers, etc.) - Returns Self for method chaining Related to issue sigstore#160 - API for incremental model re-hashing Signed-off-by: Emrick Donadei <[email protected]>

Provides high-level convenience functions for incremental model signing that combine all the pieces: manifest extraction, incremental hashing, and signing. Two levels of API: 1. Simple function API: sign_incremental( model_path="huge-model/", old_signature_path="model.sig.old", new_signature_path="model.sig.new", files_to_hash=["huge-model/README.md"] ) 2. Configurable class API: Config().use_elliptic_key_signer(private_key="key").sign_incremental( model_path="huge-model/", old_signature_path="model.sig.old", new_signature_path="model.sig.new", files_to_hash=["huge-model/README.md"] ) Both APIs: - Extract manifest from old signature automatically - Configure incremental hashing - Hash only changed/new files - Sign the new manifest - Write the new signature Also added set_allow_symlinks() method to IncrementalSerializer to maintain compatibility with the hashing Config class, which calls this method before serialization. This makes it trivial for users to incrementally sign large models where only a few files changed, avoiding hours of re-hashing. Related to issue sigstore#160 - API for incremental model re-hashing Signed-off-by: Emrick Donadei <[email protected]>

edonadei · 2025-11-03T00:30:12Z

@mihaimaruseac if you can take a look at this, I tried to follow up with the last discussions from #160 (from 2024 that's old) in that thread and tried to implement a solution. It's a bit long and probably imperfect, but I'm open to feedback, there's a questions for maintainer sections to start the discussion.

mihaimaruseac · 2025-11-03T15:01:55Z

Amazing! Will take a look this week (JupyterCon)

- Fix SIM118: Use 'key in dict' instead of 'key in dict.keys()' - Fix E501: Break long lines to stay under 80 characters - Fix F401: Remove unused pytest import from incremental_test.py - Fix F401: Remove unused json import from manifest_test.py All critical lint errors resolved. Signed-off-by: Emrick Donadei <[email protected]>

Auto-format code with ruff to match the project's formatting standards: - Adjust line breaking for long expressions - Format function call arguments consistently - Apply consistent parentheses placement No functional changes, only formatting. Signed-off-by: Emrick Donadei <[email protected]>

edonadei · 2025-11-16T02:05:52Z

Ping @mihaimaruseac if you have some time for a review :)

mihaimaruseac · 2025-11-17T19:05:23Z

Sorry for the delay. Looking now

mihaimaruseac

I like the general direction of this, but let's try to reuse existing infrastructure as much as possible, instead of duplicating.

I didn't look at the tests too much, but I noticed we have new hardcoded values. Can we use the existing goldens approach for this? Or use the fixture mechanism?

src/model_signing/_serialization/incremental.py

mihaimaruseac · 2025-11-17T19:24:27Z

src/model_signing/manifest.py

        return self._serialization_type.serialization_parameters
+
+    @classmethod
+    def from_signature(cls, signature_path: pathlib.Path) -> Self:


This should probably call the verification path first and then use

model-transparency/src/model_signing/_signing/signing.py

Line 353 in 992a56b

def verify(self, signature: Signature) -> manifest.Manifest:

to get the manifest directly without duplication.

To double check (I didn't implemented it yet). Do you mean to do something like this?

# in manifest.py def from_signature( cls, signature_path: pathlib.Path, *, identity: str, oidc_issuer: str, use_staging: bool = False, ) -> Self: signature = sign_sigstore.Signature.read(signature_path) verifier = sign_sigstore.Verifier( identity=identity, oidc_issuer=oidc_issuer, use_staging=use_staging, ) return verifier.verify(signature)

Yes, but I think this can be at a higher level, not in manifest.py (as that could create a circular dependency). We could do it in signing.py, which can import verifying.py.

Ok that's what I tried to do! Will refresh the review for you to take a second look at it

I think the current approach works, but you need to fix the linting issues (name shadowing)

mihaimaruseac · 2025-11-17T20:04:36Z

Is the files_to_hash parameter approach acceptable, or would you prefer a different change detection mechanism?

This is perfect. We'll have users specify these in advance, in a clean pipeline, rather than us having to write and maintain the detection logic

Should CLI support and documentation be in this PR or separate follow-ups?

It's ok to do it after.

Would you like me to add a CHANGELOG entry now or wait until the approach is approved?

Same here for later, assuming we collect everything

When the existing manifest uses shard-based serialization, we cannot reuse shard digests for file-based serialization. This commit updates IncrementalSerializer to detect shard-based manifests via the serialization_type property and automatically rehash all files in that case. - Check serialization_type.method == "shards" in __init__ - Rehash all files if existing manifest was shard-based - Add test_sharded_manifest_rehashes_all_files to verify behavior Signed-off-by: Emrick Donadei <[email protected]>

Refactor the rehash decision logic in serialize() into a dedicated _should_rehash_file() helper method. This improves readability and will allow reuse when adding support for incremental hashing of shards in the future. No functional changes - purely code organization improvement. Signed-off-by: Emrick Donadei <[email protected]>

Simplify the incremental serialization API by automatically extracting serialization parameters (hash algorithm, allow_symlinks, ignore_paths) from the existing manifest when not explicitly specified. This eliminates the need for users to manually specify parameters that should match the original manifest, making the API more user-friendly: Before: config.use_incremental_serialization( old_manifest, hashing_algorithm="sha256", allow_symlinks=False ) After: config.use_incremental_serialization(old_manifest) # Automatically uses same parameters as old_manifest - Extract hash_type from manifest._serialization_type - Map digest names to API parameters (blake2b -> blake2) - Handle shard-based manifests (default to sha256) - Add test to verify parameter extraction works correctly Signed-off-by: Emrick Donadei <[email protected]>

Completely rewrite incremental serialization tests to match the established patterns used in file_test.py and other serialization tests: - Add pytest fixtures (hasher_factory, file_serializer) for reusability - Use existing model fixtures from conftest.py (sample_model_folder) - Use test_support helpers (extract_digests_from_manifest, get_first_file) - Follow arrange-act-assert pattern consistently - Simplify assertions using helper functions - Remove all hardcoded fake digest values - Tests are now ~230 lines shorter and more maintainable All 7 tests pass and follow the same style as the rest of the test suite. Signed-off-by: Emrick Donadei <[email protected]>

When using incremental serialization, warn if set_allow_symlinks() is called with a value that differs from the existing manifest. This helps users identify potential inconsistencies in their manifests. The warning is logged but does not fail the operation, allowing users to override settings if intentional while being made aware of the change. Signed-off-by: Emrick Donadei <[email protected]>

Replace Manifest.from_signature() with signing.manifest_from_signature() that verifies signatures before extracting manifests. This addresses reviewer feedback to reuse existing verification logic and adds security to incremental signing by ensuring old signatures are verified before their hashes are reused. Changes: - Add manifest_from_signature() to signing.py that calls Verifier.verify() - Update sign_incremental() to require identity/oidc_issuer parameters for verification of old signatures - Remove Manifest.from_signature() from manifest.py (eliminated code duplication) - Update documentation examples in hashing.py - Remove redundant tests (DSSE parsing already tested in signing_test.py) This is a breaking change for incremental signing API, but improves security by preventing tampering of old signatures. Signed-off-by: Emrick Donadei <[email protected]>

edonadei added 4 commits November 1, 2025 23:58

edonadei force-pushed the main branch from d385199 to a3194d8 Compare November 2, 2025 14:40

edonadei marked this pull request as ready for review November 3, 2025 00:27

edonadei requested review from a team as code owners November 3, 2025 00:27

edonadei added 2 commits November 3, 2025 10:43

mihaimaruseac requested changes Nov 17, 2025

View reviewed changes

edonadei added 3 commits November 17, 2025 20:37

edonadei force-pushed the main branch from 8ec5cb1 to 89ee38e Compare November 18, 2025 02:25

edonadei force-pushed the main branch from 89ee38e to 2aa106a Compare November 18, 2025 02:28

edonadei force-pushed the main branch 2 times, most recently from 399ed5e to 3fc63b0 Compare November 18, 2025 05:38

edonadei force-pushed the main branch from e5c2bd5 to 1ec5791 Compare November 20, 2025 02:51

edonadei requested a review from mihaimaruseac November 20, 2025 03:08

feat: Add incremental re-hashing API for large models #562

Are you sure you want to change the base?

feat: Add incremental re-hashing API for large models #562

Conversation

edonadei commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Design Decision

Test Coverage:

Future Work: Based on maintainer feedback, these could be added in follow-up PRs:

Questions for Maintainers:

Testing this PR

Uh oh!

edonadei commented Nov 3, 2025

Uh oh!

mihaimaruseac commented Nov 3, 2025

Uh oh!

edonadei commented Nov 16, 2025

Uh oh!

mihaimaruseac commented Nov 17, 2025

Uh oh!

mihaimaruseac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mihaimaruseac Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

edonadei Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

mihaimaruseac Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

edonadei Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

mihaimaruseac Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

mihaimaruseac commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

edonadei commented Nov 2, 2025 •

edited

Loading