-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add struct type support for drop_list_duplicates
#9202
Add struct type support for drop_list_duplicates
#9202
Conversation
|
||
// If nans are considered as NOT equal, even both element(i) and element(j) are NaNs this | ||
// comparison will still return `false`. This is the desired behavior in Apache Spark. | ||
return lhs_val == rhs_val; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the purposes of checking for duplicates, are we expecting to be checking for exact duplicates or does potential floating point precision come into play? If this is the latter, it might useful in the future to examine the overlap between this code and the equality/equivalency checking code in cudf_test/column_utilities
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting question. I thought about this before when firstly implemented it, but since there is no requirement for adding precision control thus I just did an exact comparison. If there is a request for adding precision control then we can go back and add it easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is something we want to entertain. The implication would then be that all libcudf functions that are comparing floating point values for equality should allow controlling the precision. This is just part of how floating point values work and while it makes sense to have that control in testing utilities, putting it in actual compute APIs would be very detrimental to maintenance/performance/testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of minor questions, but this looks pretty close to ready.
{ | ||
// Two entries are not considered for equality if they belong to different lists | ||
// If both element(i) and element(j) are NaNs and nans are considered as equal value then this | ||
// comparison will return `true`. This is the desired behavior in Pandas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is something we can improve in a future PR if it's too big a change now, but do we want to embed pandas/Spark specific behavior this deep in the call stack rather than having some configuration value passed down? I can see that it would need to get passed through about 4 levels of calls to get here, which would be annoying, but it also seems more in line with our general design philosophy to have an enum to configure this behavior rather than encoding it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely agree with this. When this API was first reviewed, there was a long discussion about how to satisfy both Pandas' users and Spark's users, since each side requires a different behavior (NaN are equal or not). In the meantime, I don't have a better idea how to avoid this deep call stack. I'm happy to do it (or anybody can do it) if there is a better way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing something, but would it not be possible to define an enum for different nan behaviors and then pass that down (all the way through get_unique_entries_and_list_offsets
->get_unique_entries_dispatch
... I know it's a lot of forwarding) to control this behavior based on a templated version of the column_row_comparator_fn
? I'm not saying that we need to do it in this PR, but if that solution would work and be preferable we should at least make an issue for this to avoid losing track once the PR is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I'll take a note on this. There should be a room for improvement. I'll create an issue for this too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Here it is: #9257
Codecov Report
@@ Coverage Diff @@
## branch-21.10 #9202 +/- ##
================================================
- Coverage 10.85% 10.84% -0.01%
================================================
Files 115 116 +1
Lines 19158 19171 +13
================================================
Hits 2080 2080
- Misses 17078 17091 +13
Continue to review full report at Codecov.
|
rerun tests |
Rerun tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments inline. For the record, I really dislike the -NaN replacement requirement. Ideally this should be factored out into an opt-in extension for platform-specific NaN behavior. But I guess libcudf is already riddled with platform-specific special casing at a very low level.
@gpucibot merge |
This PR add support for struct type into the existing
drop_list_duplicates
API. This is the first time a nested type is supported in this function. Some more code cleanup has also been done.To be clear: Only structs of basic types and structs of structs are supported. Structs of lists are not, due to their complex nature.
Closes #8972.
Blocked by #9218 (it is merged).