Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: scalar reductions on empty inputs #1715

Merged
merged 3 commits into from
Jan 4, 2025

Conversation

camriddell
Copy link
Contributor

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

When performing a scalar reduction on an empty DataFrame, Polars, pandas, and PyArrow generated disparate result sets with default options. For example, when computing the sum of an empty column (via Narwhals)

  • Polars returns a value of 0
  • pandas raises a ValueError (on reconstruction of the Series, computationally it also produces 0 as the scalar)
  • PyArrow returns null (default scalarreduction options requires at least 1 observed value to produce observed result)

pandas & PyArrow backends now produce output that is consistent with Polars.

Furthermore, there is a difference in the result set when using .select vs .with_columns where the former may return a value even if the the input was empty whereas the latter will return an empty DataFrame since its input was empty. Both Polars and PyArrow backends exhibited this behavior so pandas does the same as well now.


See the following example for how the behavior of .sum was changed.

import narwhals as nw
import pandas as pd
import polars as pl
import pyarrow as pa
from itertools import product

@nw.narwhalify
def nw_select(df):
    return (
        df
        .filter(nw.col("name") == "Boo")
        .select(nw.col("value").sum())
    )

@nw.narwhalify
def nw_with_columns(df):
    return (
        df
        .filter(nw.col("name") == "Boo")
        .with_columns(res=nw.col("value").sum())
    )

pl_df = pl.DataFrame(
    {
        "name": ["John", "Doe", "Jane"],
        "value": [10, 5, 20],
    },
)
pd_df = pl_df.to_pandas()
pa_df = pa.Table.from_pandas(pd_df)


for func, df in product([nw_select, nw_with_columns], [pl_df, pd_df, pa_df]):
    print(f' {func.__name__} β†’ {(type(df).__module__.split(".")[0])} '.center(50, '\N{box drawings light horizontal}'))
    try:
        print(func(df))
    except ValueError as e:
        print(e)
    print()

Old behavior

  • pandas errors when reconstructing a series from a scalar (index is empty, but passed data is not)
  • pyarrow returns null where Polars returns a value
─────────────── nw_select β†’ polars ───────────────
shape: (1, 1)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”
β”‚ value β”‚
β”‚ ---   β”‚
β”‚ i64   β”‚
β•žβ•β•β•β•β•β•β•β•‘
β”‚ 0     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”˜

─────────────── nw_select β†’ pandas ───────────────
Length of values (1) does not match length of index (0)

────────────── nw_select β†’ pyarrow ───────────────
pyarrow.Table
value: int64
----
value: [[null]]

──────────── nw_with_columns β†’ polars ────────────
shape: (0, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ name ┆ value ┆ res β”‚
β”‚ ---  ┆ ---   ┆ --- β”‚
β”‚ str  ┆ i64   ┆ i64 β”‚
β•žβ•β•β•β•β•β•β•ͺ═══════β•ͺ═════║
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜

──────────── nw_with_columns β†’ pandas ────────────
Length of values (1) does not match length of index (0)

─────────── nw_with_columns β†’ pyarrow ────────────
pyarrow.Table
name: string
value: int64
res: null
----
name: []
value: []
res: [0 nulls]

New behavior

  • select operations will produce a consistent scalar output even if the input was empty
    • all β†’ True
    • any β†’ False
    • sum β†’ 0
    • max β†’ Null or NaN if non-nullable
    • min β†’ Null or NaN if non-nullable
    • mean β†’ Null or NaN if non-nullable
  • with_columns returns empty if the input was empty
─────────────── nw_select β†’ polars ───────────────
shape: (1, 1)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”
β”‚ value β”‚
β”‚ ---   β”‚
β”‚ i64   β”‚
β•žβ•β•β•β•β•β•β•β•‘
β”‚ 0     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”˜

─────────────── nw_select β†’ pandas ───────────────
   value
0      0

────────────── nw_select β†’ pyarrow ───────────────
pyarrow.Table
value: int64
----
value: [[0]]

──────────── nw_with_columns β†’ polars ────────────
shape: (0, 3)
β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ name ┆ value ┆ res β”‚
β”‚ ---  ┆ ---   ┆ --- β”‚
β”‚ str  ┆ i64   ┆ i64 β”‚
β•žβ•β•β•β•β•β•β•ͺ═══════β•ͺ═════║
β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜

──────────── nw_with_columns β†’ pandas ────────────
Empty DataFrame
Columns: [name, value, res]
Index: []

─────────── nw_with_columns β†’ pyarrow ────────────
pyarrow.Table
name: string
value: int64
res: null
----
name: []
value: []
res: [0 nulls]

Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one! Thanks for the fix @camriddell πŸš€

SwiftmoTwitchGIF

@MarcoGorelli
Copy link
Member

thanks @camriddell and @FBruzzesi for review!

@MarcoGorelli MarcoGorelli merged commit ab8f515 into narwhals-dev:main Jan 4, 2025
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Summing empty Pandas DataFrame
3 participants