Skip to content

Conversation

srilman
Copy link
Contributor

@srilman srilman commented Sep 24, 2025

Changes Made

I noticed when going through minhash that we are inconsistent as to what xxhash algorithm we actually use. For example, in minhash we use xxhash64 and in regular hash we use xxhash3 (which is basically xxhash128). So I'm extending the terminology to specify all 3. The default xxhash will be an alias for xxhash3.

Note that I'm also adding xxhash32 because thats what the original Spark script uses. However, it does have less throughput and requires some annoying setup.

Related Issues

Closes #5275.

Checklist

  • Documented in API Docs (if applicable)
  • Documented in User Guide (if applicable)
  • If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
  • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

@srilman srilman marked this pull request as draft September 24, 2025 21:14
@github-actions github-actions bot added the feat label Sep 24, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Summary

This PR extends xxhash functionality by adding XxHash32 and XxHash3 variants to the HashFunctionKind enum, addressing inconsistency where minhash used xxhash64 while regular hash used xxhash3. However, the implementation is incomplete and will cause runtime failures.

Critical Issues Found:

  • Missing implementations for XxHash32 and XxHash3 variants in core hashing kernels (src/daft-core/src/kernels/hashing.rs)
  • Missing implementations for new variants in minhash function (src/daft-functions/src/minhash.rs)
  • Missing "xxhash" alias that should default to xxhash3 as stated in PR description
  • Incorrect import statement for xxh32 functions in hash ops
  • Inconsistent default hash function (still using XxHash64 instead of XxHash3)

Positive Changes:

  • Proper enum extension in HashFunctionKind
  • Good FromStr parsing for specific variants
  • Addresses real inconsistency issue described in #5275

The PR concept is sound but needs significant implementation work to avoid runtime panics when users try the new hash variants.

Confidence Score: 1/5

  • This PR has critical implementation gaps that will cause runtime failures
  • Score reflects incomplete implementation with missing match arms for new enum variants, incorrect imports, and inconsistent defaults that will cause panics when users attempt to use the new xxhash variants
  • Critical attention needed for src/daft-core/src/kernels/hashing.rs and src/daft-functions/src/minhash.rs to implement missing hash variant logic

Important Files Changed

File Analysis

Filename        Score        Overview
src/daft-hash/src/lib.rs 3/5 Added XxHash32 and XxHash3 variants to HashFunctionKind enum, but missing 'xxhash' alias for xxhash3 in FromStr implementation
src/daft-core/src/kernels/hashing.rs 2/5 Renamed XxHash to XxHash64 but missing implementations for XxHash32 and XxHash3 variants in all hash functions, causing runtime failures
src/daft-functions/src/hash.rs 3/5 Uses wrong default hash function XxHash64 instead of XxHash3 as stated in PR description
src/daft-functions/src/minhash.rs 2/5 Missing implementation for XxHash32 and XxHash3 variants in match statement, will cause runtime errors
src/daft-core/src/array/ops/hash.rs 2/5 Wrong import statement for xxh32 functions and incorrect default hash function

Sequence Diagram

sequenceDiagram
    participant User
    participant HashFunction
    participant HashFunctionKind
    participant KernelHashing
    participant XxHashVariant

    User->>HashFunction: call hash(data, hash_function="xxhash")
    HashFunction->>HashFunctionKind: parse("xxhash") 
    Note over HashFunctionKind: Missing alias: should map "xxhash" -> XxHash3
    HashFunctionKind-->>HashFunction: Error: Invalid hash function

    User->>HashFunction: call hash(data, hash_function="xxhash3")
    HashFunction->>HashFunctionKind: parse("xxhash3")
    HashFunctionKind-->>HashFunction: XxHash3
    HashFunction->>KernelHashing: hash(array, seed, XxHash3)
    KernelHashing->>XxHashVariant: match XxHash3
    Note over XxHashVariant: Missing implementation: only XxHash64 handled
    XxHashVariant-->>KernelHashing: Panic/Error

    User->>MinHashFunction: minhash(data, hash_function="xxhash32")
    MinHashFunction->>HashFunctionKind: parse("xxhash32")
    HashFunctionKind-->>MinHashFunction: XxHash32
    MinHashFunction->>MinHashFunction: match XxHash32
    Note over MinHashFunction: Missing implementation: only XxHash64, MurmurHash3, Sha1
    MinHashFunction-->>User: Panic/Error
Loading

9 files reviewed, 6 comments

Edit Code Review Bot Settings | Greptile

.map(|s| s.parse::<HashFunctionKind>())
.transpose()?
.unwrap_or(HashFunctionKind::XxHash);
.unwrap_or(HashFunctionKind::XxHash64);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Default hash function should be XxHash3 instead of XxHash64 to maintain consistency with the PR description that states "default xxhash will be an alias for xxhash3"

Suggested change
.unwrap_or(HashFunctionKind::XxHash64);
.unwrap_or(HashFunctionKind::XxHash3);

{
pub fn hash(&self, seed: Option<&UInt64Array>) -> DaftResult<UInt64Array> {
self.hash_with(seed, HashFunctionKind::XxHash)
self.hash_with(seed, HashFunctionKind::XxHash64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Default hash function should be XxHash3 instead of XxHash64 to align with PR description

use xxhash_rust::xxh3::{xxh3_64, xxh3_64_with_seed};
use xxhash_rust::{
xxh3::{xxh3_64, xxh3_64_with_seed},
xxh32::{xxh3_32, xxh3_32_with_seed},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: Incorrect import: xxh3_32 and xxh3_32_with_seed don't exist. Should be xxh32 and xxh32_with_seed from the xxh32 module.

Suggested change
xxh32::{xxh3_32, xxh3_32_with_seed},
xxh32::{xxh32, xxh32_with_seed},

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support multiple xxhash variants
1 participant