Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define Column.nan_as_null to return self #15923

Merged
merged 4 commits into from
Jun 7, 2024

Conversation

mroeschke
Copy link
Contributor

Description

While trying to clean all the fillna logic, I needed to have a Column.nan_as_null defined to make the fillna logic more re-useable.

This allows other nan_as_null usages in cudf to avoiding checking whether it's defined on the column or not.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mroeschke mroeschke added Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 5, 2024
@mroeschke mroeschke requested a review from a team as a code owner June 5, 2024 00:04
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question, but nice simplification!

@@ -1247,6 +1249,10 @@ def unary_operator(self, unaryop: str):
f"Operation {unaryop} not supported for dtype {self.dtype}."
)

def nans_to_nulls(self: Self) -> Self:
"""Convert NaN to NA."""
return self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: If the column is a floating point one, nans_to_nulls might (always?) produce a copy. Here, we would share data. Is that problematic for downstream consumers who might implicitly assume that nans_to_nulls copies?

I see you handle this explicitly in the one case it is necessary below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was wrestling back and forth between whether this should copy or not.

One one hand, as you mentioned, there was only 1 API where we needed to guarantee the result didn't share data.

For most other cases, we wanted this to no-op in the cases where no nan conversions would happen.

So I see it as avoid unnecessary copies by default vs the caller guards against calling nan_as_null if there are no nans to avoid unnecessary copies. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For most other cases, we wanted this to no-op in the cases where no nan conversions would happen.

For this reason, I feel it's okay not to create copies and generate more memory pressure for this API.

@@ -702,7 +702,9 @@ def fillna(
Returns a copy with null filled.
"""
return libcudf.replace.replace_nulls(
input_col=self, replacement=fill_value, method=method
input_col=self.nans_to_nulls(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you fixed a bug that is resulting in an increase in test pass rate:

Screenshot 2024-06-05 at 11 43 24 AM

Would you be able to add a test having a mix of nan's & NA's performing a to_arrow ? Just wanting to make sure we have it captured in pytest so that we know if it breaks/someone changes this code unintentionally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be able to add a test having a mix of nan's & NA's performing a to_arrow ?

I assume you meant fillna instead of to_arrow? If so, I added that test in 0d74275

@mroeschke
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit d83d086 into rapidsai:branch-24.08 Jun 7, 2024
72 checks passed
@mroeschke mroeschke deleted the ref/nan_to_null branch June 7, 2024 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants