-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define Column.nan_as_null to return self #15923
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A question, but nice simplification!
@@ -1247,6 +1249,10 @@ def unary_operator(self, unaryop: str): | |||
f"Operation {unaryop} not supported for dtype {self.dtype}." | |||
) | |||
|
|||
def nans_to_nulls(self: Self) -> Self: | |||
"""Convert NaN to NA.""" | |||
return self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: If the column is a floating point one, nans_to_nulls
might (always?) produce a copy. Here, we would share data. Is that problematic for downstream consumers who might implicitly assume that nans_to_nulls
copies?
I see you handle this explicitly in the one case it is necessary below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was wrestling back and forth between whether this should copy or not.
One one hand, as you mentioned, there was only 1 API where we needed to guarantee the result didn't share data.
For most other cases, we wanted this to no-op in the cases where no nan conversions would happen.
So I see it as avoid unnecessary copies by default vs the caller guards against calling nan_as_null
if there are no nans to avoid unnecessary copies. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For most other cases, we wanted this to no-op in the cases where no nan conversions would happen.
For this reason, I feel it's okay not to create copies and generate more memory pressure for this API.
@@ -702,7 +702,9 @@ def fillna( | |||
Returns a copy with null filled. | |||
""" | |||
return libcudf.replace.replace_nulls( | |||
input_col=self, replacement=fill_value, method=method | |||
input_col=self.nans_to_nulls(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you be able to add a test having a mix of nan's & NA's performing a to_arrow ?
I assume you meant fillna
instead of to_arrow
? If so, I added that test in 0d74275
/merge |
Description
While trying to clean all the
fillna
logic, I needed to have aColumn.nan_as_null
defined to make thefillna
logic more re-useable.This allows other
nan_as_null
usages in cudf to avoiding checking whether it's defined on the column or not.Checklist