-
Notifications
You must be signed in to change notification settings - Fork 914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix NaN handling in drop_list_duplicates #7662
Fix NaN handling in drop_list_duplicates #7662
Conversation
…ng point numbers with NaN
Codecov Report
@@ Coverage Diff @@
## branch-0.19 #7662 +/- ##
===============================================
+ Coverage 81.86% 82.28% +0.42%
===============================================
Files 101 101
Lines 16884 17066 +182
===============================================
+ Hits 13822 14043 +221
+ Misses 3062 3023 -39
Continue to review full report at Codecov.
|
Rerun tests. |
I'm just thinking of another solution: re-implement Note that we still have to call |
That's a good point. I forgot CUB's radix sort doesn't take iterators. I think your current implementation is the best we can do for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second pass, another thought occurred to me. Instead of having to materialize the replaced NaN column, couldn't the equality comparator just be parameterized on nan_equality
to determine whether -NaN
and NaN
are equal?
I thought about that, but couldn't apply it. Here is the reason: After sorting, |
Oh, interesting. That must be an implementation detail of the CUB segmented radix sort. Well, in that case, what you've done seems like the best we can do. |
…AL before calling to `has_negative_nans`
@gpucibot merge |
Rerun tests. |
Rerun tests. |
Rerun tests. |
rerun tests |
Rerun tests. |
This PR modifies the behavior of
drop_list_duplicates
to satisfy both Apache Spark and Pandas behavior when dealing withNaN
value in floating-point columns data:NaNs
are treated as different values, thus noNaN
entry should be removed after callingdrop_list_duplicates
.NaNs
are considered as the same value, and even-NaN
is considered as the same asNaN
. Thus, only oneNaN
entry per list will be kept.New tests have also been added to verify such desired behavior.