-
Notifications
You must be signed in to change notification settings - Fork 315
feat: Add image hashing functions with support for 5 algorithms #5229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Add image hashing functions with support for 5 algorithms #5229
Conversation
- Add average hash (aHash) for basic deduplication - Add perceptual hash (pHash) using DCT for robust similarity detection - Add difference hash (dHash) for efficient pixel comparison - Add wavelet hash (wHash) for rotation/scaling robustness - Add crop-resistant hash (cHash) for cropping robustness - Implement all functions in Rust for performance - Add comprehensive Python API with detailed docstrings - Include 38 test cases covering edge cases and validation - All functions return 64-character binary strings for consistency
- Add unified image_hash function in daft.functions with algorithm parameter - Support 5 hash algorithms: average, perceptual, difference, wavelet, crop_resistant - Remove redundant individual hash functions from Python API - Keep essential series functions in Rust for internal implementation - Add comprehensive test suite with 8 test cases - Fix clippy warnings and ensure code quality - Align with PR Eventual-Inc#5086 deprecation of namespace methods - Add proper type safety with Literal types for algorithm parameter Closes: Image hashing functionality implementation
- Remove test_image_average_hash.py - Remove test_image_crop_resistant_hash.py - Remove test_image_difference_hash.py - Remove test_image_wavelet_hash.py These files were testing the old individual hash functions that were removed in favor of the unified image_hash function. All test coverage is now provided by the comprehensive test_image_hash.py file.
- Add image_hash function to daft.functions.image module with support for 5 algorithms: * average, perceptual, difference, wavelet, crop_resistant - Fix Rust syntax issue: replace is_multiple_of with modulo operator in ops.rs - Refactor Rust hash functions to eliminate code duplication using helper function - Rename hash method to image_hash in SeriesImageNamespace for clarity - Update all tests to use modern functional API (daft.functions.image_hash) - Fix import issues and ensure proper linting compliance - All 5,926 tests passing with no regressions
…ional API - Remove Series.image.image_hash() deprecated method - Remove Expression.image.image_hash() deprecated method - Keep only daft.functions.image_hash() modern functional API - Aligns with PR Eventual-Inc#5086 direction of moving away from namespaces - Cleaner API with no deprecated methods to maintain - All tests still pass with functional API only
- Replace average hash with simplified, consistent implementation - Update perceptual hash with corrected DCT implementation and median-based thresholding - Improve wavelet hash with proper Haar wavelet transform implementation - Enhance crop-resistant hash with ring-based sampling and bilinear interpolation - All algorithms now produce more accurate and consistent hash values - Maintain 64-bit hash output for all algorithms - All tests passing with improved implementations
- Replace 5 separate hash functions (average_hash, perceptual_hash, etc.) with single image_hash(algorithm) function - Follows established pattern of using algorithm parameter instead of separate functions - Matches Python API design with single parametrized function - Reduces code duplication and improves maintainability - All tests passing with unified function approach
- Add generic compute_hash_for_array helper function - Replace 10 duplicated hash functions with calls to helper - Reduce code from ~450 lines to ~50 lines (90% reduction) - Maintain identical public API and behavior - All existing tests pass, no functional changes This addresses the code duplication issue where ImageArray and FixedShapeImageArray had nearly identical implementations for all 5 hash algorithms (average, perceptual, difference, wavelet, crop_resistant).
- Fix test_image_average_hash_basic to use varied pixel values instead of uniform colors - Uniform colors (all 0s or all 255s) produce all 1s in average hash due to algorithm logic - Use images with mixed pixel values to properly test hash differentiation - All hash tests now pass correctly
- Add fixed seed (42) for deterministic random image generation to prevent flaky tests - Rename test_image_hash_series_api to test_image_hash_algorithms_comprehensive for accuracy - Remove misleading comment about 'series API' since Python only exposes functional API - All image hash tests now pass consistently
- Increase image size from 10x10 to 32x32 to reduce collision probability - Create more distinct image patterns with different pixel values - Replace unrealistic 'all hashes must be unique' assertion with more flexible logic - Test that obviously different images (random vs structured) produce different hashes - Allow for potential hash collisions while ensuring core functionality works - All image hash tests now pass consistently
- Move 'from daft.functions import image_hash' and 'from daft import col' to top imports - Remove 4 instances of inline imports from test functions - Follows Python best practices for import organization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR adds comprehensive image hashing functionality to Daft with support for 5 different algorithms: average, perceptual, difference, wavelet, and crop_resistant hashing. The implementation follows established patterns in the Daft codebase by providing a Python API that wraps Rust backend implementations for performance.
The core addition is the image_hash()
function in daft.functions.image
, which accepts an Expression and an optional algorithm parameter (defaulting to "average"). The function uses Literal types for the algorithm parameter to provide better type safety and IDE support, following the codebase's custom style guide. The implementation leverages Daft's existing UDF system, with a new ImageHash
ScalarUDF implemented in Rust that handles the actual hash computations.
The Rust backend implementation in src/daft-image
includes mathematical implementations of all 5 algorithms, with sophisticated techniques like DCT for perceptual hashing, Haar wavelet transforms for wavelet hashing, and ring-based sampling for crop-resistant hashing. All algorithms return consistent 64-character binary strings suitable for deduplication workflows.
The changes integrate cleanly with Daft's existing image processing infrastructure, adding the new functionality to the function registry and maintaining compatibility with both ImageArray
and FixedShapeImageArray
types. The PR includes comprehensive test coverage with 12 tests covering algorithm functionality, edge cases, error handling, and consistency validation.
Confidence score: 4/5
- This PR is generally safe to merge with some areas requiring attention
- Score reflects solid implementation and testing but concerns about complex mathematical operations and potential runtime edge cases
- Pay close attention to the Rust hash computation functions in
src/daft-image/src/ops.rs
Context used:
Rule - Use Literal types instead of str for function parameters that accept a limited set of string values to provide better type safety and IDE support. (link)
10 files reviewed, 2 comments
let algorithm_str = match algorithm { | ||
ImageHashAlgorithm::Average => "average", | ||
ImageHashAlgorithm::Perceptual => "perceptual", | ||
ImageHashAlgorithm::Difference => "difference", | ||
ImageHashAlgorithm::Wavelet => "wavelet", | ||
ImageHashAlgorithm::CropResistant => "crop_resistant", | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: redundant enum-to-string conversion after parsing string-to-enum on line 65
let algorithm_str = match algorithm { | |
ImageHashAlgorithm::Average => "average", | |
ImageHashAlgorithm::Perceptual => "perceptual", | |
ImageHashAlgorithm::Difference => "difference", | |
ImageHashAlgorithm::Wavelet => "wavelet", | |
ImageHashAlgorithm::CropResistant => "crop_resistant", | |
}; | |
// Convert enum to string and call the unified image_hash function | |
let algorithm_str = algorithm_str; | |
crate::series::image_hash(input, algorithm_str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could do let _ = algorithm_str.parse()?;
to maintain the input validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or alternatively do the validation in Python and pass the enum itself into Rust?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True that's a good suggestion too. Here's a similar pattern that uses our FromLiteral trait which does a lot of this work for you.
Define the type, implement FromLiteral, then define an Args type with derive(FunctionArgs).
/// Supported codecs for the decode and encode functions.
#[derive(Debug, Clone, Copy, Serialize, Deserialize, PartialEq, Eq, Hash)]
pub enum Codec {
Base64,
Deflate,
Gzip,
Utf8,
Zlib,
}
impl FromLiteral for Codec {
fn try_from_literal(lit: &Literal) -> DaftResult<Self> {
if let Literal::Utf8(s) = lit {
s.parse()
} else {
Err(DaftError::ValueError(format!(
"Expected a string literal, got {:?}",
lit
)))
}
}
}
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash, FunctionArgs)]
struct Args<T> {
input: T,
codec: Codec,
}
// usage
fn call(&self, inputs: daft_dsl::functions::FunctionArgs<Series>) -> DaftResult<Series> {
let Args { input, codec } = inputs.try_into()?;
}
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5229 +/- ##
==========================================
+ Coverage 74.48% 74.50% +0.01%
==========================================
Files 969 970 +1
Lines 124225 124558 +333
==========================================
+ Hits 92535 92803 +268
- Misses 31690 31755 +65
🚀 New features to boost your workflow:
|
hey @codekshitij i added comments on the other PR that got closed. Could you address these issues? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this! IMO my biggest concern is that its very difficult to verify if this is actually working correctly or not. The best way to properly test is to compare against another library like ImageHash that does the same thing. If you could use that for testing, I would feel more comfortable with this PR
|
||
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please run the pre-commit styles on your PR for consistency?
Args: | ||
expr: Expression to compute hash for. | ||
algorithm: The hashing algorithm to use. Options are: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some details or links or something to explain the types of hashing methods and what their relative strengths and weaknesses are? This would show up on the docs too
let algorithm_str = match algorithm { | ||
ImageHashAlgorithm::Average => "average", | ||
ImageHashAlgorithm::Perceptual => "perceptual", | ||
ImageHashAlgorithm::Difference => "difference", | ||
ImageHashAlgorithm::Wavelet => "wavelet", | ||
ImageHashAlgorithm::CropResistant => "crop_resistant", | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or alternatively do the validation in Python and pass the enum itself into Rust?
assert s.to_pylist()[0].shape[2] == MODE_TO_NUM_CHANNELS[output_mode] | ||
|
||
|
||
def test_image_average_hash_basic(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than testing like this, where we need to understand how the hashing algorithms work under the hood, can you add https://pypi.org/project/ImageHash/ as a testing import and just compare the results with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, why add tests here and then have a separate test_image_hash.py
file later?
Hey i got all your suggestion and will work on that ASAP. |
Thank you all for the review. And I'll fix it all ASAP. @rchowell @srilman @universalmind303 |
Changes Made
Adds image hashing functionality to Daft with support for 5 algorithms: average, perceptual, difference, wavelet, and crop_resistant.
API Usage
Implementation
daft.functions.image_hash()
functiondaft-image
Related Issues
#4889
Checklist
docs/mkdocs.yml
navigation