-
Notifications
You must be signed in to change notification settings - Fork 315
feat: add use_process
flag for @daft.func(...)
#5323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This makes it faster because we don't need to infer the datatype when we know what it is upfront. also makes the `cast` logic check if it's the same type and avoid casting if same type. This speeds up the literal creation a lot.
…/lit-optimizations-pt2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Summary
This PR adds `use_process` parameter support to the `@daft.func` decorator, bringing feature parity with the existing `@daft.udf` decorator. The change allows users to execute functions decorated with `@daft.func` in separate processes rather than threads, which can provide performance benefits by avoiding Python's Global Interpreter Lock (GIL) and offering better fault isolation.The implementation is comprehensive and touches multiple layers of the Daft architecture:
- Protobuf Schema: Added
use_process
field to theRowWiseFn
message definition to enable serialization/deserialization of this configuration across distributed operations - Python API Layer: Updated both
RowWiseUdf
andGeneratorUdf
classes to accept and propagate theuse_process
parameter - Rust Integration: Modified the
row_wise_udf
function signature andRowWisePyFn
struct to store and preserve the flag throughout expression transformations - Expression System: Updated expression transformation logic in multiple files to ensure
use_process
settings are preserved during query optimization passes - Type Definitions: Added proper type annotations to Python stub files for IDE support
The change follows the established patterns from the existing @daft.udf
implementation and maintains backward compatibility by making use_process
an optional parameter. This ensures that both row-wise and generator functions decorated with @daft.func
can now leverage the same process isolation benefits that were previously only available to legacy UDFs.
Important Files Changed
Changed Files
Filename | Score | Overview |
---|---|---|
src/daft-proto/proto/v1/daft.proto | 5/5 | Added optional use_process field to RowWiseFn protobuf message for serialization support |
src/daft-core/src/series/mod.rs | 5/5 | Removed #[inline] attribute from from_arrow method as optimization cleanup |
src/daft-logical-plan/src/partitioning.rs | 5/5 | Updated clustering spec translation to preserve use_process flag for row-wise functions |
daft/daft/init.pyi | 5/5 | Added use_process parameter to row_wise_udf function signature in Python stub file |
src/daft-ir/src/proto/functions.rs | 5/5 | Added protobuf serialization/deserialization support for use_process field |
src/daft-dsl/src/functions/python/mod.rs | 4/5 | Extended UDF property extraction to handle both legacy and row-wise UDFs with use_process |
src/daft-dsl/src/expr/mod.rs | 5/5 | Updated expression transformation to preserve use_process flag during optimization |
src/daft-proto/src/generated/daft.v1.rs | 5/5 | Auto-generated protobuf code adding use_process field to RowWiseFn message |
daft/udf/init.py | 5/5 | Added use_process parameter to @daft.func decorator with proper type annotations |
src/daft-dsl/src/python.rs | 5/5 | Added use_process parameter to row_wise_udf function signature and implementation |
src/daft-logical-plan/src/ops/project.rs | 5/5 | Updated expression replacement logic to preserve use_process during subexpression factoring |
daft/udf/row_wise.py | 4/5 | Modified RowWiseUdf constructor to accept and propagate use_process parameter |
daft/udf/generator.py | 5/5 | Added use_process support to GeneratorUdf class following established patterns |
src/daft-dsl/src/python_udf.rs | 4/5 | Added use_process field to RowWisePyFn struct and related function signatures |
Confidence score: 4/5
- This PR is safe to merge with low risk of breaking existing functionality
- Score reflects comprehensive implementation across multiple layers with consistent patterns, though some files lack active usage of the new flag in execution logic
- Pay close attention to
src/daft-dsl/src/functions/python/mod.rs
anddaft/udf/row_wise.py
for proper UDF property extraction and parameter handling
Sequence Diagram
sequenceDiagram
participant User
participant DaftFuncDecorator as "@daft.func Decorator"
participant PartialUdf as "_PartialUdf"
participant RowWiseUdf as "RowWiseUdf"
participant GeneratorUdf as "GeneratorUdf"
participant PyExpr as "PyExpr"
participant RowWisePyFn as "RowWisePyFn"
User->>DaftFuncDecorator: "@daft.func(use_process=True)"
DaftFuncDecorator->>PartialUdf: create with use_process=True
User->>PartialUdf: call with function
PartialUdf->>PartialUdf: check if generator function
alt Is Generator Function
PartialUdf->>GeneratorUdf: create with use_process=True
GeneratorUdf->>PyExpr: create expression via row_wise_udf
else Is Regular Function
PartialUdf->>RowWiseUdf: create with use_process=True
RowWiseUdf->>PyExpr: create expression via row_wise_udf
end
PyExpr->>RowWisePyFn: create with use_process flag
RowWisePyFn-->>User: return decorated function
Note over User, RowWisePyFn: Function can now be used in DataFrame operations<br/>with process-based execution when use_process=True
14 files reviewed, no comments
…ual-Inc/Daft into cory/daft-func-process
so this is super weird, but @daft.func(return_dtype=bytes, use_process=True)
def download(b: daft.File):
return b.read(0)
df = daft.from_pydict({"path": ["~/Development/Daft/CLAUDE.md"]})
df = df.with_column("content", download(daft.functions.file(df["path"])))
df = df.select("content")
print(df.collect())
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5323 +/- ##
==========================================
- Coverage 75.37% 75.37% -0.01%
==========================================
Files 983 983
Lines 123738 123752 +14
==========================================
+ Hits 93270 93275 +5
- Misses 30468 30477 +9
🚀 New features to boost your workflow:
|
previously mentioned bug was unrelated to the work here. |
Changes Made
adds
use_process
to daft.func to allow same performance benefits asuse_process
fordaft.udf
.Related Issues
Checklist
docs/mkdocs.yml
navigation