-
Notifications
You must be signed in to change notification settings - Fork 323
feat: make morsel size configurable to prevent it from being too big or small #5250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR introduces configurable morsel size bounds to the Daft execution engine through two new optional configuration parameters: morsel_size_lower_bound
and morsel_size_upper_bound
. The implementation allows users to control the size of data morsels (chunks) during execution to prevent performance issues from overly large morsels (causing memory pressure) or overly small morsels (causing processing overhead).
The changes span multiple layers of the architecture:
-
Configuration Layer: New fields added to
DaftExecutionConfig
in Rust with corresponding Python bindings, supporting both programmatic configuration and environment variables (DAFT_MORSEL_SIZE_LOWER_BOUND
,DAFT_MORSEL_SIZE_UPPER_BOUND
) -
Pipeline Integration: The execution configuration is systematically threaded through all pipeline nodes (
BlockingSinkNode
,StreamingSinkNode
,IntermediateNode
) viacfg.clone()
calls to enable access to the new bounds -
Dispatcher Logic: The core
get_morsel_size_bounds()
helper function implements precedence logic where global config bounds override operator-specificMorselSizeRequirement
settings when both are present -
Python API: The
daft.context.set_execution_config()
method now accepts the new parameters with validation to ensure lower bound doesn't exceed upper bound
The feature maintains backward compatibility by making both bounds optional (defaulting to None
) and falling back to existing behavior when not configured. The implementation follows established patterns in the codebase for configuration management and includes comprehensive test coverage.
Confidence score: 3/5
- This PR introduces significant complexity with potential performance implications due to extensive config cloning throughout the pipeline
- Score reflects concerns about the hard-coded test assertion and incomplete PR description, though the core implementation appears sound
- Pay close attention to the test file and dispatcher logic implementation for potential runtime issues
11 files reviewed, 5 comments
let morsel_size_lower_bound_var = "DAFT_MORSEL_SIZE_LOWER_BOUND"; | ||
if let Ok(val) = std::env::var(morsel_size_lower_bound_var) { | ||
match val.parse::<usize>() { | ||
Ok(parsed) => cfg.morsel_size_lower_bound = Some(parsed), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Consider validating that parsed
> 0 to prevent zero lower bound which could cause issues in buffer logic
let morsel_size_upper_bound_var = "DAFT_MORSEL_SIZE_UPPER_BOUND"; | ||
if let Ok(val) = std::env::var(morsel_size_upper_bound_var) { | ||
match val.parse::<usize>() { | ||
Ok(parsed) => cfg.morsel_size_upper_bound = Some(parsed), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: Should validate that parsed
> 0 and potentially ensure upper bound is reasonable (not too large) to prevent memory issues
ecf1890
to
012dc75
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5250 +/- ##
==========================================
+ Coverage 74.24% 74.62% +0.38%
==========================================
Files 973 973
Lines 125213 124658 -555
==========================================
+ Hits 92959 93030 +71
+ Misses 32254 31628 -626
🚀 New features to boost your workflow:
|
31df42f
to
a043353
Compare
@colin-ho When you have time, please help me review the changes in this part. The main issue is that the current sink operator cannot control the size of the written data, so here we first implement a workaround. Let's make the upper and lower limits of morse configurable. |
Have you tried using |
@colin-ho Do you mean that repartition(1) becomes into_batches? |
I mean using |
It does work. |
|
@colin-ho Thank you very much. |
The way the current morsel sizing works right now is that there is a default morsel size range of (0, ~128k rows]. This requirement is propagated top down, until an operator with a required morsel size is reached, e.g. UDF with batch size, then this new batch size becomes the new morsel size range. See #4894 for more details. The benefit of this is memory, if the UDF requires batch size of 100, then the upstream scan does not need to scan more than 100 rows at a time. For this PR, lets keep the new configs for min / max morsel size that you have already added, and also add a deprecation warning for the existing |
pipeline_node.propagate_morsel_size_requirement( | ||
MorselSizeRequirement::Flexible(0, cfg.default_morsel_size), | ||
MorselSizeRequirement::Flexible(0, cfg.default_morsel_size), | ||
MorselSizeRequirement::Flexible( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@colin-ho I directly modified the default value here to make it configurable. However, I don't really understand the statement "However, we should not change morsel size directly in write.rs or any other operator". It mainly refers to https://github.com/Eventual-Inc/Daft/blob/main/src/daft-local-execution/src/sinks/write.rs#L220C2-L222C50 ,It seems like this is also a fixed configuration. Why is the maximum value here using int::max instead of 128k rows? Can I modify it according to the current global configuration (even though I already know that this function can be achieved through into_batches)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@colin-ho bump
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the maximum value here using int::max instead of 128k rows?
int::max
means that there won't be any buffering. We use it instead of 128k rows because the write sinks internally do their own buffering, such as the parquet writer, so we don't need to pre-buffer.
Can I modify it according to the current global configuration (even though I already know that this function can be achieved through into_batches)?
Let's not directly modify the batch size based on global configuration on write.rs
. I think for writes, if users want to change the configurations for writes, like batch size or row group size, they should be able to use into_batches
or via a parameter on the write operation itself
I have made modifications according to this, but I have submitted some questions to the corresponding code section. We can discuss them together. |
Changes Made
Related Issues
Checklist
docs/mkdocs.yml
navigation