Skip to content

Conversation

@Jay-ju
Copy link
Contributor

@Jay-ju Jay-ju commented Sep 9, 2025

  • Add LMDBConfig class with comprehensive configuration options
  • Update kv_get, kv_batch_get, kv_exists functions to accept KVConfig parameter
  • Add backend validation and error handling for multiple backends
  • Update test cases to use new KVConfig architecture
  • Maintain backward compatibility with legacy URI parameters
  • Provide factory methods for easy configuration creation

This implements the multi-KV backend architecture discussed in PR feedback, enabling support for different storage backends (Lance for AI/ML workloads, LMDB for high-performance caching) through a unified configuration interface.

Changes Made

Related Issues

Checklist

  • Documented in API Docs (if applicable)
  • Documented in User Guide (if applicable)
  • If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
  • Documentation builds and is formatted properly (tag @/ccmao1130 for docs review)

@github-actions github-actions bot added the feat label Sep 9, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR implements a comprehensive multi-backend KV (Key-Value) store architecture for Daft, enabling support for different storage backends through a unified configuration interface. The changes introduce:

Core Architecture Changes:

  • New KVConfig class serving as a unified wrapper for different backend configurations (LanceConfig and LMDBConfig)
  • Session-based KV store management following the established pattern used for catalogs and providers
  • Multi-backend support with Lance (optimized for AI/ML workloads) and LMDB (for high-performance caching)

API Extensions:

  • Updated kv_get, kv_batch_get, and kv_exists functions to accept KVConfig parameters
  • New session management functions: attach_kv, detach_kv, set_kv, get_kv, has_kv, and current_kv
  • Factory methods for easy KV store creation (load_kv function)
  • Backward compatibility maintained with legacy URI parameters

Implementation Details:

  • Rust-Python integration through JSON serialization of configuration objects
  • Column projection optimization for Lance backend to reduce data transfer
  • Comprehensive error handling and validation for backend configurations
  • Builder pattern methods for fluent configuration updates
  • Integration with Daft's expression system and function registry

Dependency Updates:

  • Apache Arrow dependencies upgraded from 54.2.1 to 55.2.0 across multiple crates
  • Chrono updated to 0.4.39
  • New dependencies added: daft-io and serde_json for KV functionality

The architecture allows users to attach multiple KV stores with different backends to a session and switch between them dynamically, providing flexibility for different use cases. The implementation follows established Daft patterns for session-based resource management and maintains API consistency across the framework.

Confidence score: 2/5

  • This PR introduces significant architectural changes but contains multiple critical issues that prevent safe production deployment
  • Score reflects serious implementation gaps including placeholder code, extensive debug logging, missing function definitions, and incomplete integrations
  • Pay close attention to all KV-related files, especially the Rust implementations in src/daft-dsl/src/functions/kv/ and Python bindings

Context used:

Rule - Import statements should be placed at the top of the file rather than inline within functions or methods. (link)

36 files reviewed, no comments

Edit Code Review Bot Settings | Greptile

Comment on lines +88 to +93
let logical_type = match time_unit {
TimeUnit::Second => Some(PrimitiveLogicalType::Integer(IntegerType::Int64)),
TimeUnit::Millisecond => Some(PrimitiveLogicalType::Integer(IntegerType::Int64)),
TimeUnit::Microsecond => Some(PrimitiveLogicalType::Integer(IntegerType::Int64)),
TimeUnit::Nanosecond => Some(PrimitiveLogicalType::Integer(IntegerType::Int64)),
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: All Duration time units map to the same Integer(Int64) logical type, which may not preserve time unit information during deserialization. Consider using different logical type annotations or metadata to distinguish between Second, Millisecond, Microsecond, and Nanosecond units.

Comment on lines 74 to 87
// Try multiple debug approaches
std::fs::write(
"/tmp/rust_debug.log",
"DEBUG: KVGetWithConfig::call invoked\n",
)
.ok();
eprintln!("DEBUG: KVGetWithConfig::call invoked");

// Also try writing to a different location
std::fs::write(
"/workspace/iris_59ecff5f-dd8f-4eb8-a92d-dca852586066/rust_debug.log",
"DEBUG: KVGetWithConfig::call invoked\n",
)
.ok();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Debug logging writes to hardcoded filesystem paths including /tmp and workspace directories. This should be removed before production deployment as it creates security risks and filesystem pollution.

Comment on lines 7 to 10
if TYPE_CHECKING:
from daft.kv import KVConfig, LMDBConfig

from daft.kv import KVConfig, KVStore, LMDBConfig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Remove redundant imports in TYPE_CHECKING block since the same classes are imported at runtime

Suggested change
if TYPE_CHECKING:
from daft.kv import KVConfig, LMDBConfig
from daft.kv import KVConfig, KVStore, LMDBConfig
from daft.kv import KVConfig, KVStore, LMDBConfig

Context Used: Rule - Import statements should be placed at the top of the file rather than inline within functions or methods. (link)

daft/kv/lance.py Outdated
Comment on lines 7 to 11
if TYPE_CHECKING:
from daft.io import IOConfig
from daft.kv import KVConfig, LanceConfig

from daft.kv import KVConfig, KVStore, LanceConfig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: imports are duplicated between TYPE_CHECKING block and regular imports - LanceConfig and KVConfig appear in both

Suggested change
if TYPE_CHECKING:
from daft.io import IOConfig
from daft.kv import KVConfig, LanceConfig
from daft.kv import KVConfig, KVStore, LanceConfig
if TYPE_CHECKING:
from daft.io import IOConfig
from daft.kv import KVConfig, LanceConfig
from daft.kv import KVStore

Comment on lines 486 to 541
fn to_proto(&self) -> ProtoResult<Self::Message> {
not_implemented_err!("kv_expr")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: The protobuf serialization is not implemented yet. This will cause runtime errors if KV expressions are serialized before the implementation is completed.

Comment on lines +67 to +164
from daft.expressions import lit
from daft.session import current_kv, get_kv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: Import statements should be at module top per style guide

Context Used: Rule - Import statements should be placed at the top of the file rather than inline within functions or methods. (link)

def kv_get(
row_ids: Expression | str,
columns: list[str] | str | None = None,
on_error: Literal["raise", "null"] = "raise",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: on_error parameter accepted but never used in implementation

Comment on lines 264 to 265
# TODO: Implement proper IOConfig serialization when needed
config_dict["lance"]["io_config"] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: IOConfig serialization is incomplete and will cause issues when io_config is needed

Comment on lines 134 to 135
"""Test that KV functions fail gracefully when no KV store is attached."""
from daft.functions.kv import kv_batch_get, kv_exists, kv_get
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: This import placement violates the repository's style guide. Import statements should be at the top of the file rather than inline within functions.

Suggested change
"""Test that KV functions fail gracefully when no KV store is attached."""
from daft.functions.kv import kv_batch_get, kv_exists, kv_get
"""Test that KV functions fail gracefully when no KV store is attached."""
# Import moved to top of file as per style guide

Context Used: Rule - Import statements should be placed at the top of the file rather than inline within functions or methods. (link)

Comment on lines 56 to 59
parent.add_function(wrap_pyfunction!(python::kv_get_with_config, parent)?)?;
parent.add_function(wrap_pyfunction!(python::kv_batch_get_with_config, parent)?)?;
parent.add_function(wrap_pyfunction!(python::kv_exists_with_config, parent)?)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: These Python function bindings reference functions that don't appear to be implemented in the python module. The python::kv_get_with_config, python::kv_batch_get_with_config, and python::kv_exists_with_config functions need to be defined in src/daft-dsl/src/python.rs or this will cause compilation errors.

}
}

impl FunctionEvaluator for LanceKVExpr {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use the newer ScalarFunction instead of FunctionEvaluator?

}
}

impl FunctionEvaluator for KVExpr {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. We should use ScalarFunction trait instead of FunctionEvaluator.

Theres a more detailed writeup here on how we want contributors to add new expressions.

Comment on lines 10 to 12
parent.add_fn(KVGetWithConfig);
parent.add_fn(KVBatchGetWithConfig);
parent.add_fn(KVExistsWithConfig);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since these are all *WithConfig, id suggest dropping the suffix.

}

/// Initialize KV Store functions in the global function registry
pub fn register_kv_functions() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this appears unused

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@universalmind303 thank you for your review, I'll check it later

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also retagged @rchowell for review as he reviewed the initial PR.

@Jay-ju Jay-ju force-pushed the feature-kv-interface branch 5 times, most recently from 429d845 to 017c08b Compare October 14, 2025 13:46
@codecov
Copy link

codecov bot commented Oct 14, 2025

Codecov Report

❌ Patch coverage is 34.02633% with 952 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.10%. Comparing base (c113e35) to head (fed04ad).
⚠️ Report is 121 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-dsl/src/functions/kv/memory.rs 0.00% 225 Missing ⚠️
daft/functions/kv.py 15.67% 199 Missing ⚠️
src/daft-dsl/src/functions/kv/kv_functions.rs 53.46% 114 Missing ⚠️
src/daft-session/src/kv.rs 31.88% 94 Missing ⚠️
daft/kv/lmdb.py 0.00% 61 Missing ⚠️
daft/kv/lance.py 0.00% 54 Missing ⚠️
src/daft-dsl/src/functions/kv/mod.rs 0.00% 53 Missing ⚠️
src/daft-session/src/python.rs 64.12% 47 Missing ⚠️
daft/kv/__init__.py 62.06% 33 Missing ⚠️
src/daft-session/src/session.rs 55.00% 27 Missing ⚠️
... and 5 more
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5170      +/-   ##
==========================================
- Coverage   75.31%   70.10%   -5.22%     
==========================================
  Files         990     1000      +10     
  Lines      124917   127276    +2359     
==========================================
- Hits        94083    89226    -4857     
- Misses      30834    38050    +7216     
Files with missing lines Coverage Δ
daft/functions/__init__.py 100.00% <100.00%> (ø)
daft/io/clickhouse/clickhouse_data_sink.py 97.77% <100.00%> (ø)
src/daft-core/src/array/ops/time.rs 87.99% <ø> (ø)
src/daft-dsl/src/functions/kv/registry.rs 100.00% <100.00%> (ø)
src/daft-dsl/src/functions/mod.rs 88.09% <ø> (ø)
src/daft-dsl/src/lib.rs 100.00% <100.00%> (ø)
...ft-local-execution/src/intermediate_ops/project.rs 100.00% <ø> (ø)
src/daft-parquet/src/stream_reader.rs 89.18% <100.00%> (+0.12%) ⬆️
src/daft-session/src/options.rs 66.66% <ø> (ø)
src/lib.rs 94.69% <100.00%> (+0.04%) ⬆️
... and 15 more

... and 172 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment on lines 121 to 122
source: str | None = None,
) -> Expression:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
source: str | None = None,
) -> Expression:
*
store: str | None = None,
) -> Expression:

Please consistently update the 'store' argument to all kv functions to be a named argument after all other expressions. This would look like,

def kv_put(key: Expression | str, value: Expression | Any, *, store: KvStoreLike | None = None): ...

KvStoreLike = KvStore | str  # kv store instance or reference

If the instance/reference is none, then you resolve the KvStore from the session.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rchowell

  • Here, both Str and None are still retained. The current interaction method is to first set_kv(kv_store),
  • and then if store_name (str) is not passed during kv_put, the default kv_store in the session will be used;
  • if it is passed, it will be obtained from the session instead of directly using the kvstore object here.

Comment on lines 196 to 198
# Memory backend: handle get through Python UDF and return Struct aggregated across attached stores
if hasattr(kv_store, "backend_type") and kv_store.backend_type == "memory":
from daft.session import list_kv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to not have the implementation be aware of any particular implementation? I do not follow right now why the memory store is special-cased.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, have removed it

daft/session.py Outdated
Comment on lines 261 to 276
# Keep Python-side registry for aggregation use-cases
self._kv_registry[a] = k
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why specifically do we need special python registration of kv-stores?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's useless. Remove it

Comment on lines 192 to 203
// Create Lance KV expression with extracted parameters
let lance_expr = LanceKVExpr::batch_get(
uri_str,
columns_list,
final_batch_size,
on_error_str,
io_config_opt,
);

// Directly call the Lance implementation
match &lance_expr {
LanceKVExpr::BatchGet {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that kv_batch_get_with_config is hardcoded to Lance for now?

Copy link
Contributor Author

@Jay-ju Jay-ju Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, lance kvstore will be brought up next pr.
This PR only uses a standalone version of memstore to verify the process.

@Jay-ju Jay-ju force-pushed the feature-kv-interface branch 8 times, most recently from 09c0f35 to 0676ed1 Compare October 17, 2025 09:17
@Jay-ju Jay-ju force-pushed the feature-kv-interface branch from 0676ed1 to 0b0fc47 Compare October 17, 2025 10:37
@Jay-ju
Copy link
Contributor Author

Jay-ju commented Oct 17, 2025

@rchowell I fixed your comment. Please take a look when you have time.

@universalmind303
Copy link
Contributor

@Jay-ju, It looks like @rchowell approved 🎉. Unfortunately it appears there are now some merge conflicts. If you could get those resolved, we could get this merged and included into the next release.

@madvart
Copy link
Contributor

madvart commented Nov 25, 2025

@Jay-ju - Thanks for the work on this. Checking to see if you are able to push this through.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants