Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add drop_nulls in cudf-polars #16290

Merged
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 29 additions & 1 deletion python/cudf_polars/cudf_polars/dsl/expr.py
Original file line number Diff line number Diff line change
Expand Up @@ -867,7 +867,14 @@ def __init__(
self.name = name
self.options = options
self.children = children
if self.name not in ("mask_nans", "round", "setsorted", "unique"):
if self.name not in (
"mask_nans",
"round",
"setsorted",
"unique",
"dropnull",
"fill_null",
Comment on lines +890 to +891
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is dropnull without an underscore while fill_null has one... seems like something to request alignment on in polars.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be fixed when we update the expression node on the rust side, which it seems like we need to do anyways.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to note that if we do that we need to map the old name to the new name until such time as we drop support for the older versions (probably do this in translate.py)

):
raise NotImplementedError(f"Unary function {name=}")

def do_evaluate(
Expand Down Expand Up @@ -953,6 +960,27 @@ def do_evaluate(
order=order,
null_order=null_order,
)
elif self.name == "dropnull":
(column,) = (
child.evaluate(df, context=context, mapping=mapping)
for child in self.children
)
return Column(
plc.stream_compaction.drop_nulls(
plc.Table([column.obj]), [0], 1
).columns()[0]
)
elif self.name == "fill_null":
column = self.children[0].evaluate(df, context=context, mapping=mapping)
if isinstance(self.children[1], Literal):
arg = plc.interop.from_arrow(self.children[1].value)
else:
evaluated = self.children[1].evaluate(
df, context=context, mapping=mapping
)
arg = evaluated.obj_scalar if evaluated.is_scalar else evaluated.obj
return Column(plc.replace.replace_nulls(column.obj, arg))

raise NotImplementedError(
f"Unimplemented unary function {self.name=}"
) # pragma: no cover; init trips first
Expand Down
56 changes: 56 additions & 0 deletions python/cudf_polars/tests/test_drop_nulls.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES.
# SPDX-License-Identifier: Apache-2.0
from __future__ import annotations

import pytest

import polars as pl

from cudf_polars.testing.asserts import (
assert_gpu_result_equal,
assert_ir_translation_raises,
)


@pytest.fixture(
params=[
[1, 2, 1, 3, 5, None, None],
[1.5, 2.5, None, 1.5, 3, float("nan"), 3],
[],
[None, None],
Comment on lines +19 to +20
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two tests fail for me. Looks like theres some issues constructing empty columns that I'm looking into.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no dtype, polars will create a column which has dtype arrow::null. We convert that to a cudf::type_id::EMPTY column type, but AFAICT, libcudf barfs on basically all operations on EMPTY columns, so I complain on construction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I can get around this issue in this PR by making sure that the data is typed before I try dropping nulls. For a moment I wondered if this would create any UX issues - In particular I think this means that if I create a dataframe with an untyped list of nulls [None, None] or even an empty list [] it disables collecting the frame and I think any operations that materialize that column:

>>> df = pl.DataFrame({'a': [None, None], 'b':[1,2]}).lazy()
>>> df.collect(post_opt_callback=partial(execute_with_cudf, raise_on_fail=True)) # error 

IIUC then the presence of an untyped null column in the data means we'll end up falling back to CPU for ops that would be permissible otherwise, do we care about that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, because you can't do almost anything with an EMPTY column in libcudf. If this turns out to be problematic we can change it later

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened with this discussion? It looks like this test is passing now. Is this behavior already the default somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now fall back to Float64 if a column has the empty dtype, at least for these tests.

[1, 2, 3, 4, 5],
]
)
def null_data(request):
is_empty = pl.Series(request.param).dtype == pl.Null
return pl.DataFrame(
{
"a": pl.Series(request.param, dtype=pl.Float64 if is_empty else None),
"b": pl.Series(request.param, dtype=pl.Float64 if is_empty else None),
}
).lazy()


def test_drop_null(null_data):
q = null_data.select(pl.col("a").drop_nulls())
assert_gpu_result_equal(q)


@pytest.mark.parametrize(
"value",
[0, pl.col("a").mean(), pl.col("b")],
ids=["scalar", "aggregation", "column_expression"],
)
def test_fill_null(null_data, value):
q = null_data.select(pl.col("a").fill_null(value))
assert_gpu_result_equal(q)


@pytest.mark.parametrize(
"strategy", ["forward", "backward", "min", "max", "mean", "zero", "one"]
)
def test_fill_null_with_strategy(null_data, strategy):
q = null_data.select(pl.col("a").fill_null(strategy=strategy))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might not be passing the necessary options across the rust-python boundary again here:

https://github.com/pola-rs/polars/blob/6f8b4785c1b373b7740d9865671d624af325f9ed/py-polars/src/lazyframe/visitor/expr_nodes.rs#L1073

This doesn't seem to affect strategy because using that kwarg apparently results in a different expression node in rust (FillNullWithStrategy) but we'd have to expose this to selectively fall back when limit is passed rather than the current state which I think is ignoring it.

Perhaps we can come up with a quick polars patch to fix this and tack it on to pola-rs/polars#17702 before it goes in

cc @wence-

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the current state may be sufficient, since limit requires strategy={"forward", "backward"}, so we get a FillNullWithStrategy either way which falls back.


# Not yet exposed to python from rust
assert_ir_translation_raises(q, NotImplementedError)
Loading