Aggregation of empty grouped df should still generate columns #1537

Jolanrensen · 2025-10-31T13:46:49Z

The reason

emptyDf.groupBy { name }.aggregate {
    count() into "count"
}

or

emptyDf.groupBy { name }.count()

would not generate a count column, was because the aggregation was run once for each group and then concatenated. This means it was never called if the dataframe was empty and there were no groups.

This is unexpected behavior and causes runtime exceptions when accessing the result of the aggregation with the compiler plugin (it cannot tell whether a DF is empty or not).

I fixed it by calling the aggregate {} body once with an empty group if there are no groups. This generates all columns as expected even though there's no data inside.

…egation columns with test. Issue #1531

… `true`

AndreiKingsley

I believe we need to add some info about groupBy / aggregation behavior on empty dataframe in our documentation.

Jolanrensen · 2025-11-03T12:45:03Z

I believe we need to add some info about groupBy / aggregation behavior on empty dataframe in our documentation.

hmm maybe yes, but on the other hand, it's not very common and I believe this new behavior is what people will expect; if you aggregate an empty dataframe, your aggregators will have to operate on empty columns. Makes sense, right?

But we could mention it as part of #1535

koperagen · 2025-11-03T13:43:48Z

What if aggregate would throw exception if df is empty? "Fail fast"

Jolanrensen · 2025-11-03T19:00:27Z

@koperagen We could, but I think it's best to not throw an exception if we don't have to. If users create a pipeline for DataFrame<TheirType>, it should behave the same in terms of types regardless of the df being empty or not.
In this case, all statistical aggregation functions have orNull {} versions specifically for empty columns (or they work fine on empty ones, like sum() or count()), that way the user can handle empty dataframes gracefully if they expect them.

Plus, most functions, like max {}, already throw an exception if they get an empty column as input, suggesting that users use maxOrNull {} instead.

...come to think of it. Actually, using -OrNull overloads makes zero sense here

After running this, max doesn't have to be of type DataColumn<String?>, it can remain DataColumn<String>, it's just empty. The problem is that max { col } throws an exception when col is empty, so we can't execute it...

I need to think about this for a bit... Maybe throwing and exception earlier is better after all...

We could really use Rich Errors here! That way max {} could return ColumnIsEmptyError<T> | T and we could make an empty column of type T... When an exception is thrown, the other lines in the aggregate body are not executed so we don't see them at all :/

... actually, another way to look at it: we kinda already use "rich errors" for these statistics functions. We filter out all nulls from the input of columns before performing the operation, so if maxOrNull returns null, this is comparable to returning a ColumnIsEmptyError. This holds for all statistical operations. The compiler plugin would just need to be adjusted to know that "-OrNull overloads of statistics will not introduce nullability to the columns" :)

... I have no idea how, though, because users can also introduce nulls in aggregations whenever they please:
emptyGroupBy.aggregate { null into "nulls" }.nulls and then nulls should definitely be of type Nothing?.

koperagen · 2025-11-04T13:09:27Z

The compiler plugin would just need to be adjusted to know that "-OrNull overloads of statistics will not introduce nullability to the columns" :)

interesting!

Jolanrensen added 2 commits October 31, 2025 14:44

Made sure aggregating an empty grouped dataframe still generates aggr…

5068a6d

…egation columns with test. Issue #1531

Removed removeColumns argument from aggregateGroupBy, as it is always…

7a2de0a

… `true`

Jolanrensen requested review from Allex-Nik, AndreiKingsley and zaleslaw October 31, 2025 13:46

Jolanrensen added the bug Something isn't working label Oct 31, 2025

Jolanrensen added this to the 1.0.0-Beta4 milestone Oct 31, 2025

Jolanrensen requested a review from koperagen October 31, 2025 13:53

AndreiKingsley approved these changes Nov 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Aggregation of empty grouped df should still generate columns #1537

Aggregation of empty grouped df should still generate columns #1537

Uh oh!

Jolanrensen commented Oct 31, 2025 •

edited

Loading

Uh oh!

AndreiKingsley left a comment

Uh oh!

Jolanrensen commented Nov 3, 2025 •

edited

Loading

Uh oh!

koperagen commented Nov 3, 2025 •

edited

Loading

Uh oh!

Jolanrensen commented Nov 3, 2025 •

edited

Loading

Uh oh!

koperagen commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Aggregation of empty grouped df should still generate columns #1537

Are you sure you want to change the base?

Aggregation of empty grouped df should still generate columns #1537

Uh oh!

Conversation

Jolanrensen commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AndreiKingsley left a comment

Choose a reason for hiding this comment

Uh oh!

Jolanrensen commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

koperagen commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jolanrensen commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

koperagen commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Jolanrensen commented Oct 31, 2025 •

edited

Loading

Jolanrensen commented Nov 3, 2025 •

edited

Loading

koperagen commented Nov 3, 2025 •

edited

Loading

Jolanrensen commented Nov 3, 2025 •

edited

Loading