Skip to content

Conversation

@Jolanrensen
Copy link
Collaborator

@Jolanrensen Jolanrensen commented Oct 31, 2025

Fixes #1531

The reason

emptyDf.groupBy { name }.aggregate {
    count() into "count"
}

or

emptyDf.groupBy { name }.count()

would not generate a count column, was because the aggregation was run once for each group and then concatenated. This means it was never called if the dataframe was empty and there were no groups.

This is unexpected behavior and causes runtime exceptions when accessing the result of the aggregation with the compiler plugin (it cannot tell whether a DF is empty or not).

I fixed it by calling the aggregate {} body once with an empty group if there are no groups. This generates all columns as expected even though there's no data inside.

@Jolanrensen Jolanrensen added the bug Something isn't working label Oct 31, 2025
@Jolanrensen Jolanrensen added this to the 1.0.0-Beta4 milestone Oct 31, 2025
@Jolanrensen Jolanrensen requested a review from koperagen October 31, 2025 13:53
Copy link
Collaborator

@AndreiKingsley AndreiKingsley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we need to add some info about groupBy / aggregation behavior on empty dataframe in our documentation.

@Jolanrensen
Copy link
Collaborator Author

Jolanrensen commented Nov 3, 2025

I believe we need to add some info about groupBy / aggregation behavior on empty dataframe in our documentation.

hmm maybe yes, but on the other hand, it's not very common and I believe this new behavior is what people will expect; if you aggregate an empty dataframe, your aggregators will have to operate on empty columns. Makes sense, right?

But we could mention it as part of #1535

@koperagen
Copy link
Collaborator

koperagen commented Nov 3, 2025

What if aggregate would throw exception if df is empty? "Fail fast"

@Jolanrensen
Copy link
Collaborator Author

Jolanrensen commented Nov 3, 2025

@koperagen We could, but I think it's best to not throw an exception if we don't have to. If users create a pipeline for DataFrame<TheirType>, it should behave the same in terms of types regardless of the df being empty or not.
In this case, all statistical aggregation functions have orNull {} versions specifically for empty columns (or they work fine on empty ones, like sum() or count()), that way the user can handle empty dataframes gracefully if they expect them.

Plus, most functions, like max {}, already throw an exception if they get an empty column as input, suggesting that users use maxOrNull {} instead.

...come to think of it. Actually, using -OrNull overloads makes zero sense here
image

After running this, max doesn't have to be of type DataColumn<String?>, it can remain DataColumn<String>, it's just empty. The problem is that max { col } throws an exception when col is empty, so we can't execute it...

I need to think about this for a bit... Maybe throwing and exception earlier is better after all...

We could really use Rich Errors here! That way max {} could return ColumnIsEmptyError<T> | T and we could make an empty column of type T... When an exception is thrown, the other lines in the aggregate body are not executed so we don't see them at all :/

... actually, another way to look at it: we kinda already use "rich errors" for these statistics functions. We filter out all nulls from the input of columns before performing the operation, so if maxOrNull returns null, this is comparable to returning a ColumnIsEmptyError. This holds for all statistical operations. The compiler plugin would just need to be adjusted to know that "-OrNull overloads of statistics will not introduce nullability to the columns" :)

... I have no idea how, though, because users can also introduce nulls in aggregations whenever they please:
emptyGroupBy.aggregate { null into "nulls" }.nulls and then nulls should definitely be of type Nothing?.

@koperagen
Copy link
Collaborator

The compiler plugin would just need to be adjusted to know that "-OrNull overloads of statistics will not introduce nullability to the columns" :)

interesting!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Aggregating an empty DF does not make aggregation columns appear

4 participants