Skip to content

Conversation

universalmind303
Copy link
Contributor

Changes Made

adds use_process to daft.func to allow same performance benefits as use_process for daft.udf.

Related Issues

Checklist

  • Documented in API Docs (if applicable)
  • Documented in User Guide (if applicable)
  • If adding a new documentation page, doc is added to docs/mkdocs.yml navigation
  • Documentation builds and is formatted properly

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Summary

This PR adds `use_process` parameter support to the `@daft.func` decorator, bringing feature parity with the existing `@daft.udf` decorator. The change allows users to execute functions decorated with `@daft.func` in separate processes rather than threads, which can provide performance benefits by avoiding Python's Global Interpreter Lock (GIL) and offering better fault isolation.

The implementation is comprehensive and touches multiple layers of the Daft architecture:

  1. Protobuf Schema: Added use_process field to the RowWiseFn message definition to enable serialization/deserialization of this configuration across distributed operations
  2. Python API Layer: Updated both RowWiseUdf and GeneratorUdf classes to accept and propagate the use_process parameter
  3. Rust Integration: Modified the row_wise_udf function signature and RowWisePyFn struct to store and preserve the flag throughout expression transformations
  4. Expression System: Updated expression transformation logic in multiple files to ensure use_process settings are preserved during query optimization passes
  5. Type Definitions: Added proper type annotations to Python stub files for IDE support

The change follows the established patterns from the existing @daft.udf implementation and maintains backward compatibility by making use_process an optional parameter. This ensures that both row-wise and generator functions decorated with @daft.func can now leverage the same process isolation benefits that were previously only available to legacy UDFs.

Important Files Changed

Changed Files
Filename Score Overview
src/daft-proto/proto/v1/daft.proto 5/5 Added optional use_process field to RowWiseFn protobuf message for serialization support
src/daft-core/src/series/mod.rs 5/5 Removed #[inline] attribute from from_arrow method as optimization cleanup
src/daft-logical-plan/src/partitioning.rs 5/5 Updated clustering spec translation to preserve use_process flag for row-wise functions
daft/daft/init.pyi 5/5 Added use_process parameter to row_wise_udf function signature in Python stub file
src/daft-ir/src/proto/functions.rs 5/5 Added protobuf serialization/deserialization support for use_process field
src/daft-dsl/src/functions/python/mod.rs 4/5 Extended UDF property extraction to handle both legacy and row-wise UDFs with use_process
src/daft-dsl/src/expr/mod.rs 5/5 Updated expression transformation to preserve use_process flag during optimization
src/daft-proto/src/generated/daft.v1.rs 5/5 Auto-generated protobuf code adding use_process field to RowWiseFn message
daft/udf/init.py 5/5 Added use_process parameter to @daft.func decorator with proper type annotations
src/daft-dsl/src/python.rs 5/5 Added use_process parameter to row_wise_udf function signature and implementation
src/daft-logical-plan/src/ops/project.rs 5/5 Updated expression replacement logic to preserve use_process during subexpression factoring
daft/udf/row_wise.py 4/5 Modified RowWiseUdf constructor to accept and propagate use_process parameter
daft/udf/generator.py 5/5 Added use_process support to GeneratorUdf class following established patterns
src/daft-dsl/src/python_udf.rs 4/5 Added use_process field to RowWisePyFn struct and related function signatures

Confidence score: 4/5

  • This PR is safe to merge with low risk of breaking existing functionality
  • Score reflects comprehensive implementation across multiple layers with consistent patterns, though some files lack active usage of the new flag in execution logic
  • Pay close attention to src/daft-dsl/src/functions/python/mod.rs and daft/udf/row_wise.py for proper UDF property extraction and parameter handling

Sequence Diagram

sequenceDiagram
    participant User
    participant DaftFuncDecorator as "@daft.func Decorator"
    participant PartialUdf as "_PartialUdf"
    participant RowWiseUdf as "RowWiseUdf"
    participant GeneratorUdf as "GeneratorUdf"
    participant PyExpr as "PyExpr"
    participant RowWisePyFn as "RowWisePyFn"

    User->>DaftFuncDecorator: "@daft.func(use_process=True)"
    DaftFuncDecorator->>PartialUdf: create with use_process=True
    
    User->>PartialUdf: call with function
    PartialUdf->>PartialUdf: check if generator function
    
    alt Is Generator Function
        PartialUdf->>GeneratorUdf: create with use_process=True
        GeneratorUdf->>PyExpr: create expression via row_wise_udf
    else Is Regular Function
        PartialUdf->>RowWiseUdf: create with use_process=True
        RowWiseUdf->>PyExpr: create expression via row_wise_udf
    end
    
    PyExpr->>RowWisePyFn: create with use_process flag
    RowWisePyFn-->>User: return decorated function
    
    Note over User, RowWisePyFn: Function can now be used in DataFrame operations<br/>with process-based execution when use_process=True
Loading

14 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@universalmind303
Copy link
Contributor Author

so this is super weird, but use_process actually crashes if i use it with a daft.File and I have no idea why 🤔

@daft.func(return_dtype=bytes, use_process=True)
def download(b: daft.File):
    return b.read(0)


df = daft.from_pydict({"path": ["~/Development/Daft/CLAUDE.md"]})


df = df.with_column("content", download(daft.functions.file(df["path"])))
df = df.select("content")
print(df.collect())
raceback (most recent call last):
  File "/Users/corygrinstead/Development/Daft/data/daft1.py", line 20, in <module>
    print(df.collect())
          ^^^^^^^^^^^^
  File "/Users/corygrinstead/Development/Daft/daft/dataframe/dataframe.py", line 4015, in collect
    self._materialize_results()
  File "/Users/corygrinstead/Development/Daft/daft/dataframe/dataframe.py", line 3977, in _materialize_results
    self._result_cache = get_or_create_runner().run(self._builder)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/corygrinstead/Development/Daft/daft/runners/native_runner.py", line 67, in run
    results = list(self.run_iter(builder))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/corygrinstead/Development/Daft/daft/runners/native_runner.py", line 106, in run_iter
    for result in results_gen:
  File "/Users/corygrinstead/Development/Daft/daft/execution/native_executor.py", line 44, in <genexpr>
    return (
           ^
  File "/Users/corygrinstead/Development/Daft/daft/execution/udf.py", line 127, in eval_input
    response = self.handle_conn.recv()
               ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/corygrinstead/.local/share/uv/python/cpython-3.11.13-macos-aarch64-none/lib/python3.11/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
          ^^^^^^^^^^^^^^^^^^
  File "/Users/corygrinstead/.local/share/uv/python/cpython-3.11.13-macos-aarch64-none/lib/python3.11/multiprocessing/connection.py", line 430, in _recv_bytes
    buf = self._recv(4)
          ^^^^^^^^^^^^^
  File "/Users/corygrinstead/.local/share/uv/python/cpython-3.11.13-macos-aarch64-none/lib/python3.11/multiprocessing/connection.py", line 399, in _recv
    raise EOFError

Base automatically changed from cory/lit-optimizations-pt2 to main October 2, 2025 22:04
@github-actions github-actions bot added the feat label Oct 2, 2025
Copy link

codecov bot commented Oct 3, 2025

Codecov Report

❌ Patch coverage is 77.27273% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.37%. Comparing base (0bb0713) to head (09e23a0).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-logical-plan/src/ops/project.rs 0.00% 2 Missing ⚠️
src/daft-logical-plan/src/partitioning.rs 0.00% 2 Missing ⚠️
src/daft-dsl/src/python_udf.rs 66.66% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5323      +/-   ##
==========================================
- Coverage   75.37%   75.37%   -0.01%     
==========================================
  Files         983      983              
  Lines      123738   123752      +14     
==========================================
+ Hits        93270    93275       +5     
- Misses      30468    30477       +9     
Files with missing lines Coverage Δ
daft/udf/__init__.py 100.00% <100.00%> (ø)
daft/udf/generator.py 93.18% <100.00%> (+0.15%) ⬆️
daft/udf/row_wise.py 90.00% <100.00%> (+0.16%) ⬆️
src/daft-dsl/src/expr/mod.rs 79.78% <100.00%> (+0.03%) ⬆️
src/daft-dsl/src/functions/python/mod.rs 72.84% <100.00%> (ø)
src/daft-dsl/src/python.rs 80.51% <100.00%> (+0.09%) ⬆️
src/daft-dsl/src/python_udf.rs 83.80% <66.66%> (-3.25%) ⬇️
src/daft-logical-plan/src/ops/project.rs 57.78% <0.00%> (-0.24%) ⬇️
src/daft-logical-plan/src/partitioning.rs 46.38% <0.00%> (-0.29%) ⬇️

... and 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@universalmind303
Copy link
Contributor Author

previously mentioned bug was unrelated to the work here.

@universalmind303 universalmind303 merged commit cd6ebd5 into main Oct 6, 2025
45 checks passed
@universalmind303 universalmind303 deleted the cory/daft-func-process branch October 6, 2025 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant