-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Add JoinType preservation helpers and dynamic_filter_side
; enable dynamic filter pushdown in HashJoinExec
#17518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
kosiew
wants to merge
41
commits into
apache:main
Choose a base branch
from
kosiew:df-join-metadata-16973
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+980
−207
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* **Refactored `create_dynamic_filter`** for better readability and clearer join side handling. * **Improved dynamic filter side tests** for correctness across various join types. * **Cleaned up join key null filtering and filter pushdown logic**, removing unused functions and clarifying join preservation behavior. * **Enhanced `right_side_dynamic_filter_test`** with better clarity and extended support for more join types. * **Upgraded `HashJoinExec`** to: * Clearly specify dynamic filter sides. * Improve documentation. * Support dynamic filter pushdown based on join preservation metadata. * **Refactored `JoinType`** to: * Add helper methods for unmatched row preservation. * Improve clarity in dynamic filter logic. * Support new method signatures used in `HashJoinExec`. * **Consolidated and expanded dynamic filter tests**: * Merged truth table test cases. * Added comprehensive tests for `dynamic_filter_side`. * **Shelved unrelated changes** to maintain focus on dynamic filter and join logic improvements.
* Implemented dynamic filter pushdown in `HashJoinExec` to collect and apply bounds from both join sides. * Added probe expressions and improved bounds accumulation logic, eliminating unnecessary futures for partition-bounds reporting. * Updated inner join logic to prefer the right side for filter pushdown and added support for full join pushdown. JoinType API Improvements * Added methods to `JoinType` for dynamic filter pushdown checks and unmatched-row preservation. * Introduced swap functionality with comprehensive truth-table and unit tests. Refactoring & Test Coverage * Refactored dynamic filter handling in `HashJoinExec` for clarity and performance. * Removed unused dynamic filter side and related expressions. * Expanded test coverage with join-type truth tables, full join scenarios, and `JoinType` swap tests. --- These changes collectively enhance performance and correctness of dynamic filter pushdown across different join types, while simplifying related code and ensuring robust test coverage.
Improves partition bounds reporting and synchronization mechanisms within the dynamic filter logic of `HashJoinExec`. This refactor increases robustness and reliability during query execution. Clarify Dynamic Filter Side Logic-- Adds explanatory comments to document the reasoning behind child selection logic based on the dynamic filter side in `HashJoinExec`, aiding future maintenance and readability.
…ight child pushdown
…onsExec and improve batch creation in build_int32_scan
…sAccumulator - Adjust dynamic filter pushdown conditions to ensure correct join side handling. - Refactor join expressions to clarify which side receives the dynamic filter. - Improve bounds reporting logic in HashJoinStream based on join type.
…oin side and streamline filter creation
…joins in HashJoinExec" This reverts commit 13ed362.
…h non-preserving join types
…ty and maintainability
…titionsExec and CoalesceBatchesExec for improved partitioning and batch handling
…ng HashJoin and TopK execution plans
…lect probe sides for various join types
…add tests for preservation logic
…ide and update related logic
…pping filter creation
51b0bad
to
f5f1fc3
Compare
dynamic_filter_side
; enable dynamic filter pushdown in HashJoinExecdynamic_filter_side
; enable dynamic filter pushdown in HashJoinExec
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
common
Related to common crate
core
Core DataFusion crate
physical-plan
Changes to the physical-plan crate
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
This is part of a series of smaller PRs to reimplement #17090
Rationale for this change
Dynamic filter pushdown previously assumed the probe side (right) for inner joins and was overly conservative or unsafe for non-inner joins because the optimizer lacked clear metadata about which join input must be preserved in output. This caused missed pruning opportunities and required ad-hoc logic in join implementations.
This change introduces explicit preservation metadata on
JoinType
and adynamic_filter_side()
helper so join implementations (currently HashJoinExec) and optimizer/rewriters can determine which input can safely accept dynamic filters. With this information the physical operator can:Overall this enables safer and more effective dynamic filter pushdown across a wider set of join types.
What changes are included in this PR?
High level summary of code changes (by module):
datafusion/common/src/join_type.rs
preserves
,preserves_left
,preserves_right
helpers and smallLEFT_PRESERVING
/RIGHT_PRESERVING
const arrays.dynamic_filter_side()
which returns theJoinSide
eligible for receiving a dynamic filter (orNone
if both sides must be preserved).dynamic_filter_side
truth table.datafusion/core/tests/physical_optimizer/filter_pushdown/
build_scan
,build_hash_join
,build_topk
,sort_expr
, etc.) to construct small plans in tests.datafusion/physical-plan/src/joins/hash_join/exec.rs
join_exprs_for_side(...)
helper and changecreate_dynamic_filter(...)
to accept an explicitJoinSide
argument.join_type.dynamic_filter_side()
to determine where to attach dynamic filters and whether to enable bounds accumulation for a particular side.gather_filters_for_pushdown_with_side(...)
so the correct child receives theDynamicFilterPhysicalExpr
.handle_child_pushdown_result_with_side(...)
respectdynamic_filter_side()
and construct an updatedHashJoinExec
node when a dynamic filter is received from the expected child.JoinSide::None
).datafusion/physical-plan/src/joins/hash_join/shared_bounds.rs
on_right
->join_exprs
and document it as "join expressions on the side receiving the dynamic filter".join_exprs
.SharedBoundsAccumulator
ensuring updates are applied only after all partitions reported.datafusion/physical-plan/src/joins/hash_join/stream.rs
ProbeSideBoundsAccumulator
to accumulate min/max for probe-side when dynamic filters target the left side.probe_bounds_accumulators
andprobe_side_row_count
throughHashJoinStream
and update them while scanning probe batches.Tests & snapshots
DynamicFilterPhysicalExpr
where expected, verify metrics reflect pruning and verify that FULL joins do not incorrectly prune rows.Are these changes tested?
Yes. This PR adds and updates tests at multiple levels:
join_type.rs
verifying thepreserves*
helpers anddynamic_filter_side()
truth table.shared_bounds.rs
andhash_join
modules verifying synchronization, dynamic filter creation errors, and accumulation/reporting behavior.core/tests/physical_optimizer/filter_pushdown
that run parts of the physical plan withFilterPushdown::new_post_optimization()
and assert both the optimized plan (snapshots) and the runtime results (record batches and scan metrics).All new behavior is covered by tests that both assert the plan contains
DynamicFilterPhysicalExpr
(or not) and that runtime metrics / output rows are correct (including assertions that FULL joins preserve rows and are not pruned).Are there any user-facing changes?
API compatibility:
JoinType
) gained extra helper methods but no breaking changes to existing variants or serialization. This should be backward compatible for downstream code that enumeratesJoinType
variants.Implementation notes & rationale
dynamic_filter_side()
encodes conservative semantics: if a join preserves both sides (Full
) it returnsNone
(no dynamic filters). If exactly one side is preserved, the opposite side is eligible to receive a dynamic filter. When neither side is preserved (e.g.,Inner
, semis, antis), a default probe-side preference is used (right by default), butLeftSemi
/LeftAnti
prefer left andRightSemi
/RightAnti
prefer right.HashJoinExec uses
dynamic_filter_side()
to decide three things:DynamicFilterPhysicalExpr
(left or right child description) during filter gather.handle_child_pushdown_result
.SharedBoundsAccumulator
was generalized to acceptjoin_exprs
that represent the join key expressions for the side receiving the dynamic filter. This removes implicit "right-side" assumptions and makes the accumulator symmetric.ProbeSideBoundsAccumulator
mirrors the build-side accumulator behavior to support collecting min/max values when the join expects dynamic filters on the left side.Tests intentionally keep small, in-memory scans and snapshot assertions so reviewers can quickly inspect the expected plan strings and runtime outputs.
Suggested reviewers / areas to focus review on
hash_join::exec.rs
— verify the logic selecting which side to attach dynamic filters and the newgather_filters_for_pushdown_with_side
/handle_child_pushdown_result_with_side
functions.shared_bounds.rs
andstream.rs
— correctness of bounds accumulation and partition synchronization across both sides.core/tests/physical_optimizer/filter_pushdown
— ensure that the new helpers and snapshots represent the intended behavior and are not overly brittle.