-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix nan_as_null
not being respected when passing cudf object
#14687
Fix nan_as_null
not being respected when passing cudf object
#14687
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A small suggestion.
elif isinstance(arbitrary, cudf.Series): | ||
data = arbitrary._column | ||
column = arbitrary | ||
if column.dtype.kind == "f" and (nan_as_null is None or nan_as_null): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually I tend to prefer some parameter handling at the start of the function, such as:
if nan_as_null is None:
nan_as_null = True
…nan_as_null_cudf_objs
…nan_as_null_cudf_objs
elif isinstance(arbitrary, cudf.Series): | ||
data = arbitrary._column | ||
column = arbitrary | ||
if column.dtype.kind == "f" and (nan_as_null is None or nan_as_null): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mroeschke - The CI failure is a result of us hitting this check when we call compute() in the dask-cudf test.
When a dask collection is "computed", the partitions need to be concatenated together. It seems that calling cudf.concat
uses the cudf.Series._concat, which in turn calls this as_column
function with a default nan_as_null=True
setting. This means that dask-cudf correctly uses the user-provided nan_as_null=False
option when each partition is converted from pandas to cudf, but then the nans are lost when the partitions are concatenated together in compute
.
One possible fix may be to modify this line to something like:
return cls(data=col, index=index, name=name, nan_as_null=False)
However, I'm not immediately sure if there are other corner cases that this doesn't catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I suppose the best way to test this outside of dask is to add a cudf.concat
test that checks that existing nan values are preserved. In the case that there are both nan and null values, then we would probably need to raise an error or "promote" everything to null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the insight! The Series._concat
piece is what I was missing. I am going to see what breaks if nan_as_null
is set to False
. In theory, I would imagine nans and nulls should not be mangled together when merged together
Similar to #14687, nan was not being interpreted as null when `nan_as_null=True` Authors: - Matthew Roeschke (https://github.com/mroeschke) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #14688
…nan_as_null_cudf_objs
…nan_as_null_cudf_objs
…nan_as_null_cudf_objs
Looks like there's too many moving parts here when having |
Description
Fixes the following
Needs #14450 first
Checklist