-
Notifications
You must be signed in to change notification settings - Fork 36
Polars Course Missing Data Notebook #96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great first notebook contrib! Left some comments as part of the review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments mainly for stacking & labelling of outputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking a look, I feel like there are some more things we may want to mention to make it more complete so left some comments, but most importantly: For me the NaN explanations read like "NaN is the same as null, just used in its place for floats" which is incorrect in multiple ways.
If you think it would take too much space to explain it better, it might be better to only talk about null
and tell the readers to read the official explanation if they must work with NaN
You may also want to mention the nulls_equal
keyword argument on df.join
, explaining the default behavior and how it changes using it, and perhaps mention how null values are treated on aggregations (both as part of the column(s) to group by as well as the values being aggregated)
polars/11_missing_data.py
Outdated
|
||
@app.cell | ||
def _(mo): | ||
mo.md("""Polars datatype specific features for missing data are NaN(for float values) and null(for everything else).""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
null
and NaN
are completely different things in polars, Not a Number is not a missing value in polars, it is a float value like 0.0
or inf
There are some methods dedicated specially to dealing with it like expr.is_nan()
(similar to but distinct from expr.is_null()
), and you can have both NaN and nulls in float columns.
That confusion might arise from the way NaN is used by other libraries like numpy and polars?
Co-authored-by: Srihari Thyagarajan <[email protected]>
Co-authored-by: Srihari Thyagarajan <[email protected]>
Co-authored-by: Srihari Thyagarajan <[email protected]>
@folicks Hey Felix, let me know when I can review this PR (no rush). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mixed these up?
(pl.col("fruits") == pl.col("fruits")).alias("A_eq_B")
pl.col("score").eq_missing(pl.col("B")).alias("A_eq_missing_B")
Maybe try to use a slightly more realistic dataset rather than having columns literally called A and B
edit; oh wait I didn't realize it was set to Draft, never mind
|
||
# Compare using == operator (same as .eq()) | ||
eq_result = df.with_columns([ | ||
(pl.col("fruits") == pl.col("fruits")).alias("A_eq_B") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That alias is wrong or at at least extremely misleading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@folicks;
Agree with @etrotta here;
You're comparing fruits
to itself but labeling it A_eq_B
, then comparing score
to B
but calling it A_eq_missing_B
. Also doesn't show the diff b/w ==
and eq_missing()
.
I think something like works:
# Compare fruits column to itself
fruits_eq = df.with_columns([
(pl.col("fruits") == pl.col("fruits")).alias("fruits_eq_self")
])
# Compare with eq_missing - shows true for null == null
fruits_eq_missing = df.with_columns([
pl.col("fruits").eq_missing(pl.col("fruits")).alias("fruits_eq_missing_self")
])
) | ||
return | ||
|
||
|
||
@app.cell | ||
def _(df, pl): | ||
def _(df, mo, pl): | ||
mean_age = df.select(pl.col("age").mean()).item() | ||
imputed = df.with_columns( | ||
pl.col("age").fill_null(mean_age).alias("age_imputed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use pl.col("age").fill_null(pl.col("age").mean())
directly, or even pl.col("age").fill_null(strategy="mean")
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the past comments/feedback, appreciate it 🎉; would recommend @etrotta's comments and some minor nits mention in the latest review.
r""" | ||
# Handling Missing Data in Polars | ||
|
||
_By Felix Najera (https://github.com/folicks)._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_By Felix Najera (https://github.com/folicks)._ | |
_By [Felix Najera](https://github.com/folicks)._ |
def _(mo): | ||
mo.md( | ||
r""" | ||
/// details | Disclamier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// details | Disclamier | |
/// details | Disclaimer |
|
||
# Compare using == operator (same as .eq()) | ||
eq_result = df.with_columns([ | ||
(pl.col("fruits") == pl.col("fruits")).alias("A_eq_B") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@folicks;
Agree with @etrotta here;
You're comparing fruits
to itself but labeling it A_eq_B
, then comparing score
to B
but calling it A_eq_missing_B
. Also doesn't show the diff b/w ==
and eq_missing()
.
I think something like works:
# Compare fruits column to itself
fruits_eq = df.with_columns([
(pl.col("fruits") == pl.col("fruits")).alias("fruits_eq_self")
])
# Compare with eq_missing - shows true for null == null
fruits_eq_missing = df.with_columns([
pl.col("fruits").eq_missing(pl.col("fruits")).alias("fruits_eq_missing_self")
])
|
||
@app.cell(hide_code=True) | ||
def _(mo): | ||
mo.md(r"""Above we have a dataframe containing `null` is used to indicate missing data in any data types. For the purposes of this guide we won't mention all pelicularities that may come from alternative dataframes found in other packages such as Pandas.""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mo.md(r"""Above we have a dataframe containing `null` is used to indicate missing data in any data types. For the purposes of this guide we won't mention all pelicularities that may come from alternative dataframes found in other packages such as Pandas.""") | |
mo.md(r"""Above we have a dataframe containing `null` is used to indicate missing data in any data types. For the purposes of this guide we won't mention all peculiarities that may come from alternative dataframes found in other packages such as Pandas.""") |
@folicks Just dropping by to follow-up on if you had a chance to look at this PR again (& the review comments posted above). |
📝 Summary
https://github.com//issues/40📋 Checklist
--sandbox
README.md