-
Notifications
You must be signed in to change notification settings - Fork 76
Aggregation of empty grouped df should still generate columns #1537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…egation columns with test. Issue #1531
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we need to add some info about groupBy / aggregation behavior on empty dataframe in our documentation.
hmm maybe yes, but on the other hand, it's not very common and I believe this new behavior is what people will expect; if you aggregate an empty dataframe, your aggregators will have to operate on empty columns. Makes sense, right? But we could mention it as part of #1535 |
|
What if aggregate would throw exception if df is empty? "Fail fast" |
|
@koperagen We could, but I think it's best to not throw an exception if we don't have to. If users create a pipeline for Plus, most functions, like ...come to think of it. Actually, using After running this, I need to think about this for a bit... Maybe throwing and exception earlier is better after all... We could really use Rich Errors here! That way ... actually, another way to look at it: we kinda already use "rich errors" for these statistics functions. We filter out all nulls from the input of columns before performing the operation, so if ... I have no idea how, though, because users can also introduce |
interesting! |

Fixes #1531
The reason
emptyDf.groupBy { name }.aggregate { count() into "count" }or
emptyDf.groupBy { name }.count()would not generate a
countcolumn, was because the aggregation was run once for each group and then concatenated. This means it was never called if the dataframe was empty and there were no groups.This is unexpected behavior and causes runtime exceptions when accessing the result of the aggregation with the compiler plugin (it cannot tell whether a DF is empty or not).
I fixed it by calling the
aggregate {}body once with an empty group if there are no groups. This generates all columns as expected even though there's no data inside.