Skip to content

Conversation

notfilippo
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

When creating nested SubqueryAlias operations in complex joins, DataFusion was incorrectly handling column name conflicts by appending suffixes like :1 to duplicate column names. This caused the physical planner to fail with "Input field name {} does not match with the projection expression {}" errors, as the optimizer couldn't properly match columns with these modified names.

The root cause was that the SubqueryAlias creation process was stripping qualification information and mixing columns from left and right sides of joins, leading to name collisions that were resolved by adding numeric suffixes. This approach lost important context needed for proper column resolution.

What changes are included in this PR?

  • Replaced the hacky column renaming approach in SubqueryAlias with a projection-based solution
  • Added maybe_project_redundant_column function that creates explicit projections with aliases when needed, instead of modifying column names directly
  • Removed the maybe_fix_physical_column_name function from the physical planner that was attempting to fix these naming issues downstream
  • Updated SubqueryAlias::try_new to use the new projection approach, preserving qualification information properly
  • Added test case demonstrating the fix for nested subquery alias scenarios

Are these changes tested?

The changes include a new test case subquery_alias_confusing_the_optimizer that reproduces the original issue and verifies the fix works correctly. Note: The newly added function maybe_project_redundant_column is missing comprehensive tests.

Are there any user-facing changes?

No user-facing changes. This is an internal fix that resolves query planning errors for complex nested join scenarios without changing the public API or query behavior.

@github-actions github-actions bot added logical-expr Logical plan and expressions core Core DataFusion crate labels Sep 8, 2025
@github-actions github-actions bot added the substrait Changes to the substrait crate label Sep 9, 2025
@notfilippo
Copy link
Contributor Author

Seems like the CI flaked. Retrying...

@notfilippo
Copy link
Contributor Author

cc @LiaCastaneda @berkaysynnada for the review. Thanks!

Copy link
Contributor

@LiaCastaneda LiaCastaneda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me and is indeed cleaner than the fix from before! just left a few questions I had

let func_dependencies = plan.schema().functional_dependencies().clone();

let schema = DFSchema::from_unqualified_fields(fields, meta_data)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the join.schema() will have the :<N> still no? since we are returning the schema from the top level projection ( if it was inserted ). I know this solves the issue + makes the optimizer pass work but I don't understand how the optimizer works with this fix if we are still keeping the :<N> in the top level schema..

Projection: left.count(Int64(1)) AS count_first, left.category, left.count(Int64(1)):1 AS count_second, right.count(Int64(1)) AS count_third
Left Join: left.id = right.id
SubqueryAlias: left
Projection: left.id, left.count(Int64(1)), left.id:1, left.category, right.id AS id:2, right.count(Int64(1)) AS count(Int64(1)):1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this solves my doubt above, this is not requalifying the column name but its adding the :<N> as an alias, so I guess the optimizer sees the column names for resultion not the alias?

.collect::<Vec<_>>();

// Check if there is at least an alias
let is_projection_needed = aliases.iter().any(Option::is_some);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nit: maybe we can keep a flag inside the iteration above? so we avoid this one

Projection: left.count(Int64(1)) AS count_first, left.category, left.count(Int64(1)):1 AS count_second, right.count(Int64(1)) AS count_third
Left Join: left.id = right.id
SubqueryAlias: left
Projection: left.id, left.count(Int64(1)), left.id:1, left.category, right.id AS id:2, right.count(Int64(1)) AS count(Int64(1)):1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I was looking into the logical optimizer and there is this rule to remove unnecessary projections, I don't think it will remove this one but maybe worth checking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is an unnecessary projection as the aliases are needed for the correct semantic representation of left.count(Int64(1)):1 in the final projection which actually refers to right.count(Int64(1)) which gets aliased to left.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions substrait Changes to the substrait crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

change_redundant_column lossy approach breaks logical optimizer and physical planner
2 participants