Skip to content

Conversation

folicks
Copy link

@folicks folicks commented May 1, 2025

📝 Summary

https://github.com//issues/40

📋 Checklist

  • I have included package dependencies in the notebook file using --sandbox
  • If adding a course, include a README.md
  • Keep language direct and simple.

Copy link
Collaborator

@Haleshot Haleshot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great first notebook contrib! Left some comments as part of the review.

@Haleshot
Copy link
Collaborator

Haleshot commented May 7, 2025

I'd also rephrase your above PR description (on GitHub) to reflect:

  • A summary of the notebook and its contents (you can refer to other PRs for reference eg: this and this).
  • There is no README being added here, so I'd uncheck the box.

Copy link
Collaborator

@Haleshot Haleshot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments mainly for stacking & labelling of outputs.

Copy link
Contributor

@etrotta etrotta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a look, I feel like there are some more things we may want to mention to make it more complete so left some comments, but most importantly: For me the NaN explanations read like "NaN is the same as null, just used in its place for floats" which is incorrect in multiple ways.

If you think it would take too much space to explain it better, it might be better to only talk about null and tell the readers to read the official explanation if they must work with NaN

You may also want to mention the nulls_equal keyword argument on df.join, explaining the default behavior and how it changes using it, and perhaps mention how null values are treated on aggregations (both as part of the column(s) to group by as well as the values being aggregated)


@app.cell
def _(mo):
mo.md("""Polars datatype specific features for missing data are NaN(for float values) and null(for everything else).""")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null and NaN are completely different things in polars, Not a Number is not a missing value in polars, it is a float value like 0.0 or inf
There are some methods dedicated specially to dealing with it like expr.is_nan() (similar to but distinct from expr.is_null()), and you can have both NaN and nulls in float columns.

That confusion might arise from the way NaN is used by other libraries like numpy and polars?

folicks and others added 3 commits May 13, 2025 17:49
Co-authored-by: Srihari Thyagarajan <[email protected]>
Co-authored-by: Srihari Thyagarajan <[email protected]>
@folicks folicks marked this pull request as draft May 28, 2025 03:11
@folicks folicks marked this pull request as ready for review May 28, 2025 03:16
@folicks folicks marked this pull request as draft May 28, 2025 03:17
@Haleshot
Copy link
Collaborator

@folicks Hey Felix, let me know when I can review this PR (no rush).

Copy link
Contributor

@etrotta etrotta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mixed these up?

(pl.col("fruits") == pl.col("fruits")).alias("A_eq_B")
pl.col("score").eq_missing(pl.col("B")).alias("A_eq_missing_B")

Maybe try to use a slightly more realistic dataset rather than having columns literally called A and B

edit; oh wait I didn't realize it was set to Draft, never mind


# Compare using == operator (same as .eq())
eq_result = df.with_columns([
(pl.col("fruits") == pl.col("fruits")).alias("A_eq_B")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That alias is wrong or at at least extremely misleading

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@folicks;
Agree with @etrotta here;

You're comparing fruits to itself but labeling it A_eq_B, then comparing score to B but calling it A_eq_missing_B. Also doesn't show the diff b/w == and eq_missing().

I think something like works:

# Compare fruits column to itself
fruits_eq = df.with_columns([
    (pl.col("fruits") == pl.col("fruits")).alias("fruits_eq_self")
])

# Compare with eq_missing - shows true for null == null  
fruits_eq_missing = df.with_columns([
    pl.col("fruits").eq_missing(pl.col("fruits")).alias("fruits_eq_missing_self")
])

)
return


@app.cell
def _(df, pl):
def _(df, mo, pl):
mean_age = df.select(pl.col("age").mean()).item()
imputed = df.with_columns(
pl.col("age").fill_null(mean_age).alias("age_imputed")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use pl.col("age").fill_null(pl.col("age").mean()) directly, or even pl.col("age").fill_null(strategy="mean")

@folicks
Copy link
Author

folicks commented May 29, 2025

@folicks Hey Felix, let me know when I can review this PR (no rush). yes I think so

Copy link
Collaborator

@Haleshot Haleshot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the past comments/feedback, appreciate it 🎉; would recommend @etrotta's comments and some minor nits mention in the latest review.

r"""
# Handling Missing Data in Polars

_By Felix Najera (https://github.com/folicks)._
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_By Felix Najera (https://github.com/folicks)._
_By [Felix Najera](https://github.com/folicks)._

def _(mo):
mo.md(
r"""
/// details | Disclamier
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// details | Disclamier
/// details | Disclaimer


# Compare using == operator (same as .eq())
eq_result = df.with_columns([
(pl.col("fruits") == pl.col("fruits")).alias("A_eq_B")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@folicks;
Agree with @etrotta here;

You're comparing fruits to itself but labeling it A_eq_B, then comparing score to B but calling it A_eq_missing_B. Also doesn't show the diff b/w == and eq_missing().

I think something like works:

# Compare fruits column to itself
fruits_eq = df.with_columns([
    (pl.col("fruits") == pl.col("fruits")).alias("fruits_eq_self")
])

# Compare with eq_missing - shows true for null == null  
fruits_eq_missing = df.with_columns([
    pl.col("fruits").eq_missing(pl.col("fruits")).alias("fruits_eq_missing_self")
])


@app.cell(hide_code=True)
def _(mo):
mo.md(r"""Above we have a dataframe containing `null` is used to indicate missing data in any data types. For the purposes of this guide we won't mention all pelicularities that may come from alternative dataframes found in other packages such as Pandas.""")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mo.md(r"""Above we have a dataframe containing `null` is used to indicate missing data in any data types. For the purposes of this guide we won't mention all pelicularities that may come from alternative dataframes found in other packages such as Pandas.""")
mo.md(r"""Above we have a dataframe containing `null` is used to indicate missing data in any data types. For the purposes of this guide we won't mention all peculiarities that may come from alternative dataframes found in other packages such as Pandas.""")

@Haleshot
Copy link
Collaborator

@folicks Just dropping by to follow-up on if you had a chance to look at this PR again (& the review comments posted above).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants