-
Notifications
You must be signed in to change notification settings - Fork 1.6k
fix(SubqueryAlias): use maybe_project_redundant_column #17478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Seems like the CI flaked. Retrying... |
cc @LiaCastaneda @berkaysynnada for the review. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me and is indeed cleaner than the fix from before! just left a few questions I had
let func_dependencies = plan.schema().functional_dependencies().clone(); | ||
|
||
let schema = DFSchema::from_unqualified_fields(fields, meta_data)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the join.schema() will have the :<N>
still no? since we are returning the schema from the top level projection ( if it was inserted ). I know this solves the issue + makes the optimizer pass work but I don't understand how the optimizer works with this fix if we are still keeping the :<N>
in the top level schema..
Projection: left.count(Int64(1)) AS count_first, left.category, left.count(Int64(1)):1 AS count_second, right.count(Int64(1)) AS count_third | ||
Left Join: left.id = right.id | ||
SubqueryAlias: left | ||
Projection: left.id, left.count(Int64(1)), left.id:1, left.category, right.id AS id:2, right.count(Int64(1)) AS count(Int64(1)):1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this solves my doubt above, this is not requalifying the column name but its adding the :<N>
as an alias, so I guess the optimizer sees the column names for resultion not the alias?
.collect::<Vec<_>>(); | ||
|
||
// Check if there is at least an alias | ||
let is_projection_needed = aliases.iter().any(Option::is_some); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nit: maybe we can keep a flag inside the iteration above? so we avoid this one
Projection: left.count(Int64(1)) AS count_first, left.category, left.count(Int64(1)):1 AS count_second, right.count(Int64(1)) AS count_third | ||
Left Join: left.id = right.id | ||
SubqueryAlias: left | ||
Projection: left.id, left.count(Int64(1)), left.id:1, left.category, right.id AS id:2, right.count(Int64(1)) AS count(Int64(1)):1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I was looking into the logical optimizer and there is this rule to remove unnecessary projections, I don't think it will remove this one but maybe worth checking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is an unnecessary projection as the aliases are needed for the correct semantic representation of left.count(Int64(1)):1
in the final projection which actually refers to right.count(Int64(1))
which gets aliased to left
.
Which issue does this PR close?
change_redundant_column
lossy approach breaks logical optimizer and physical planner #17405.Rationale for this change
When creating nested
SubqueryAlias
operations in complex joins, DataFusion was incorrectly handling column name conflicts by appending suffixes like:1
to duplicate column names. This caused the physical planner to fail with "Input field name {} does not match with the projection expression {}" errors, as the optimizer couldn't properly match columns with these modified names.The root cause was that the
SubqueryAlias
creation process was stripping qualification information and mixing columns from left and right sides of joins, leading to name collisions that were resolved by adding numeric suffixes. This approach lost important context needed for proper column resolution.What changes are included in this PR?
SubqueryAlias
with a projection-based solutionmaybe_project_redundant_column
function that creates explicit projections with aliases when needed, instead of modifying column names directlymaybe_fix_physical_column_name
function from the physical planner that was attempting to fix these naming issues downstreamSubqueryAlias::try_new
to use the new projection approach, preserving qualification information properlyAre these changes tested?
The changes include a new test case
subquery_alias_confusing_the_optimizer
that reproduces the original issue and verifies the fix works correctly. Note: The newly added functionmaybe_project_redundant_column
is missing comprehensive tests.Are there any user-facing changes?
No user-facing changes. This is an internal fix that resolves query planning errors for complex nested join scenarios without changing the public API or query behavior.