-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add drop_nulls
in cudf-polars
#16290
Changes from 6 commits
c8e558c
44a6de2
eecbad0
8701226
f9d38b1
d2d3d19
58608ce
e50d4d4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. | ||
# SPDX-License-Identifier: Apache-2.0 | ||
from __future__ import annotations | ||
|
||
import pytest | ||
|
||
import polars as pl | ||
|
||
from cudf_polars.testing.asserts import ( | ||
assert_gpu_result_equal, | ||
assert_ir_translation_raises, | ||
) | ||
|
||
|
||
@pytest.fixture( | ||
params=[ | ||
[1, 2, 1, 3, 5, None, None], | ||
[1.5, 2.5, None, 1.5, 3, float("nan"), 3], | ||
[], | ||
[None, None], | ||
Comment on lines
+19
to
+20
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These two tests fail for me. Looks like theres some issues constructing empty columns that I'm looking into. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If there is no dtype, polars will create a column which has dtype There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see, I can get around this issue in this PR by making sure that the data is typed before I try dropping nulls. For a moment I wondered if this would create any UX issues - In particular I think this means that if I create a dataframe with an untyped list of nulls >>> df = pl.DataFrame({'a': [None, None], 'b':[1,2]}).lazy()
>>> df.collect(post_opt_callback=partial(execute_with_cudf, raise_on_fail=True)) # error IIUC then the presence of an untyped null column in the data means we'll end up falling back to CPU for ops that would be permissible otherwise, do we care about that? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no, because you can't do almost anything with an EMPTY column in libcudf. If this turns out to be problematic we can change it later There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happened with this discussion? It looks like this test is passing now. Is this behavior already the default somewhere? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We now fall back to |
||
[1, 2, 3, 4, 5], | ||
] | ||
) | ||
def null_data(request): | ||
is_empty = pl.Series(request.param).dtype == pl.Null | ||
return pl.DataFrame( | ||
{ | ||
"a": pl.Series(request.param, dtype=pl.Float64 if is_empty else None), | ||
"b": pl.Series(request.param, dtype=pl.Float64 if is_empty else None), | ||
} | ||
).lazy() | ||
|
||
|
||
def test_drop_null(null_data): | ||
q = null_data.select(pl.col("a").drop_nulls()) | ||
assert_gpu_result_equal(q) | ||
|
||
|
||
@pytest.mark.parametrize( | ||
"value", | ||
[0, pl.col("a").mean(), pl.col("b")], | ||
ids=["scalar", "aggregation", "column_expression"], | ||
) | ||
def test_fill_null(null_data, value): | ||
q = null_data.select(pl.col("a").fill_null(value)) | ||
assert_gpu_result_equal(q) | ||
|
||
|
||
@pytest.mark.parametrize( | ||
"strategy", ["forward", "backward", "min", "max", "mean", "zero", "one"] | ||
) | ||
def test_fill_null_with_strategy(null_data, strategy): | ||
q = null_data.select(pl.col("a").fill_null(strategy=strategy)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add a test for the limit keyword? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we might not be passing the necessary options across the rust-python boundary again here: This doesn't seem to affect Perhaps we can come up with a quick polars patch to fix this and tack it on to pola-rs/polars#17702 before it goes in cc @wence- There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, the current state may be sufficient, since |
||
|
||
# Not yet exposed to python from rust | ||
assert_ir_translation_raises(q, NotImplementedError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is
dropnull
without an underscore whilefill_null
has one... seems like something to request alignment on in polars.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be fixed when we update the expression node on the rust side, which it seems like we need to do anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to note that if we do that we need to map the old name to the new name until such time as we drop support for the older versions (probably do this in
translate.py
)