Skip to content

Conversation

Jefffrey
Copy link
Contributor

@Jefffrey Jefffrey commented Sep 12, 2025

Which issue does this PR close?

Rationale for this change

So we can use these distinct aggregates via DataFrames

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added documentation Improvements or additions to documentation core Core DataFusion crate functions Changes to functions implementation labels Sep 12, 2025
Comment on lines +65 to +74
pub fn avg_distinct(expr: Expr) -> Expr {
Expr::AggregateFunction(datafusion_expr::expr::AggregateFunction::new_udf(
avg_udaf(),
vec![expr],
true,
None,
vec![],
None,
))
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as how count handles it:

pub fn count_distinct(expr: Expr) -> Expr {
Expr::AggregateFunction(datafusion_expr::expr::AggregateFunction::new_udf(
count_udaf(),
vec![expr],
true,
None,
vec![],
None,
))
}

Comment on lines +504 to +511
min(col("c4")).alias("min(c4)"),
max(col("c4")).alias("max(c4)"),
avg(col("c4")).alias("avg(c4)"),
avg_distinct(col("c4")).alias("avg_distinct(c4)"),
sum(col("c4")).alias("sum(c4)"),
sum_distinct(col("c4")).alias("sum_distinct(c4)"),
count(col("c4")).alias("count(c4)"),
count_distinct(col("c4")).alias("count_distinct(c4)"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to c4 from c12 as c12 had some precision variations for avg_distinct leading to inconsistent test results, and figured it was easier to switch columns than slap round on the outputs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate documentation Improvements or additions to documentation functions Changes to functions implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Introduce avg_distinct() function to dataframe Introduce sum_distinct() function to dataframe
1 participant