-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Preserve the correct ListDtype
while creating an identical empty column
#10151
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.04 #10151 +/- ##
================================================
+ Coverage 10.37% 10.42% +0.05%
================================================
Files 119 119
Lines 20149 20597 +448
================================================
+ Hits 2091 2148 +57
- Misses 18058 18449 +391
Continue to review full report at Codecov.
|
@gpucibot merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized while submitting this review that I was a little late. Sorry @galipremsagar - I opened the tab earlier when you asked for review but was on a meeting until now. These suggestions might be helpful but are not blocking.
{"a": cudf.Series([["a", "b"], []]), "b": [1, 2], "c": [123, 321]} | ||
) | ||
df1 = df[df.b.isna()] | ||
df1["b"] = df1["c"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a comment to explain the logic of the test, and particularly why this line is necessary. On the surface, this line appears to do nothing (it assigns an empty Series "b"
to have the data of another empty Series "c"
, but both are int64
types). In the failing case, this triggers a change in the dtype of df1["a"]
because it is empty and goes from ListDtype(object)
(list of strings) to ListDtype(int8)
. That side-effect is not clear because this line doesn't reference "a"
at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed these comments in https://github.com/rapidsai/cudf/pull/10176/files
df1 = df[df.b.isna()] | ||
df1["b"] = df1["c"] | ||
df2 = df1.drop(["c"], axis=1) | ||
assert df2.a.dtype == df.a.dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to use the same square-bracket indexing here as above, rather than switch to attribute-based access:
assert df2.a.dtype == df.a.dtype | |
assert df2["a"].dtype == df["a"].dtype |
PR to address #10151 (review) Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Bradley Dice (https://github.com/bdice) URL: #10176
Fixes: #10122
This PR fixes an issue where the list columns children[1]'s dtype wasn't being preserved correctly.