-
Notifications
You must be signed in to change notification settings - Fork 315
feat: Extend hash variants for xxhash #5276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Summary
This PR extends xxhash functionality by adding XxHash32
and XxHash3
variants to the HashFunctionKind
enum, addressing inconsistency where minhash used xxhash64 while regular hash used xxhash3. However, the implementation is incomplete and will cause runtime failures.
Critical Issues Found:
- Missing implementations for
XxHash32
andXxHash3
variants in core hashing kernels (src/daft-core/src/kernels/hashing.rs
) - Missing implementations for new variants in minhash function (
src/daft-functions/src/minhash.rs
) - Missing "xxhash" alias that should default to xxhash3 as stated in PR description
- Incorrect import statement for xxh32 functions in hash ops
- Inconsistent default hash function (still using XxHash64 instead of XxHash3)
Positive Changes:
- Proper enum extension in
HashFunctionKind
- Good FromStr parsing for specific variants
- Addresses real inconsistency issue described in #5275
The PR concept is sound but needs significant implementation work to avoid runtime panics when users try the new hash variants.
Confidence Score: 1/5
- This PR has critical implementation gaps that will cause runtime failures
- Score reflects incomplete implementation with missing match arms for new enum variants, incorrect imports, and inconsistent defaults that will cause panics when users attempt to use the new xxhash variants
- Critical attention needed for src/daft-core/src/kernels/hashing.rs and src/daft-functions/src/minhash.rs to implement missing hash variant logic
Important Files Changed
File Analysis
Filename | Score | Overview |
---|---|---|
src/daft-hash/src/lib.rs | 3/5 | Added XxHash32 and XxHash3 variants to HashFunctionKind enum, but missing 'xxhash' alias for xxhash3 in FromStr implementation |
src/daft-core/src/kernels/hashing.rs | 2/5 | Renamed XxHash to XxHash64 but missing implementations for XxHash32 and XxHash3 variants in all hash functions, causing runtime failures |
src/daft-functions/src/hash.rs | 3/5 | Uses wrong default hash function XxHash64 instead of XxHash3 as stated in PR description |
src/daft-functions/src/minhash.rs | 2/5 | Missing implementation for XxHash32 and XxHash3 variants in match statement, will cause runtime errors |
src/daft-core/src/array/ops/hash.rs | 2/5 | Wrong import statement for xxh32 functions and incorrect default hash function |
Sequence Diagram
sequenceDiagram
participant User
participant HashFunction
participant HashFunctionKind
participant KernelHashing
participant XxHashVariant
User->>HashFunction: call hash(data, hash_function="xxhash")
HashFunction->>HashFunctionKind: parse("xxhash")
Note over HashFunctionKind: Missing alias: should map "xxhash" -> XxHash3
HashFunctionKind-->>HashFunction: Error: Invalid hash function
User->>HashFunction: call hash(data, hash_function="xxhash3")
HashFunction->>HashFunctionKind: parse("xxhash3")
HashFunctionKind-->>HashFunction: XxHash3
HashFunction->>KernelHashing: hash(array, seed, XxHash3)
KernelHashing->>XxHashVariant: match XxHash3
Note over XxHashVariant: Missing implementation: only XxHash64 handled
XxHashVariant-->>KernelHashing: Panic/Error
User->>MinHashFunction: minhash(data, hash_function="xxhash32")
MinHashFunction->>HashFunctionKind: parse("xxhash32")
HashFunctionKind-->>MinHashFunction: XxHash32
MinHashFunction->>MinHashFunction: match XxHash32
Note over MinHashFunction: Missing implementation: only XxHash64, MurmurHash3, Sha1
MinHashFunction-->>User: Panic/Error
9 files reviewed, 6 comments
.map(|s| s.parse::<HashFunctionKind>()) | ||
.transpose()? | ||
.unwrap_or(HashFunctionKind::XxHash); | ||
.unwrap_or(HashFunctionKind::XxHash64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Default hash function should be XxHash3
instead of XxHash64
to maintain consistency with the PR description that states "default xxhash
will be an alias for xxhash3"
.unwrap_or(HashFunctionKind::XxHash64); | |
.unwrap_or(HashFunctionKind::XxHash3); |
{ | ||
pub fn hash(&self, seed: Option<&UInt64Array>) -> DaftResult<UInt64Array> { | ||
self.hash_with(seed, HashFunctionKind::XxHash) | ||
self.hash_with(seed, HashFunctionKind::XxHash64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Default hash function should be XxHash3
instead of XxHash64
to align with PR description
use xxhash_rust::xxh3::{xxh3_64, xxh3_64_with_seed}; | ||
use xxhash_rust::{ | ||
xxh3::{xxh3_64, xxh3_64_with_seed}, | ||
xxh32::{xxh3_32, xxh3_32_with_seed}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
syntax: Incorrect import: xxh3_32
and xxh3_32_with_seed
don't exist. Should be xxh32
and xxh32_with_seed
from the xxh32
module.
xxh32::{xxh3_32, xxh3_32_with_seed}, | |
xxh32::{xxh32, xxh32_with_seed}, |
Changes Made
I noticed when going through minhash that we are inconsistent as to what xxhash algorithm we actually use. For example, in minhash we use xxhash64 and in regular hash we use xxhash3 (which is basically xxhash128). So I'm extending the terminology to specify all 3. The default
xxhash
will be an alias for xxhash3.Note that I'm also adding xxhash32 because thats what the original Spark script uses. However, it does have less throughput and requires some annoying setup.
Related Issues
Closes #5275.
Checklist
docs/mkdocs.yml
navigation