-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Python drop_duplicates with cudf::stable_distinct. #11656
Implement Python drop_duplicates with cudf::stable_distinct. #11656
Conversation
This also can resolve #5286. Closing it can be by this PR or a separate PR after we can preserve order in |
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## branch-23.06 #11656 +/- ##
================================================
+ Coverage 85.47% 87.52% +2.04%
================================================
Files 153 133 -20
Lines 25006 21776 -3230
================================================
- Hits 21375 19059 -2316
+ Misses 3631 2717 -914
☔ View full report in Codecov by Sentry. |
Retargeted to 23.04 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changed tests and docstrings LGTM
I'm planning to merge this at the end of tomorrow. I'm giving it a bit of cool-down time since it's a "breaking" PR during burndown. |
Thanks @galipremsagar for doing some additional testing of this PR. I feel confident enough to merge this. I have some follow-up work planned, so I am going to merge now and start on that follow-up work. |
/merge |
Depends on #13392.
Closes #11638
Closes #12449
Closes #11230
Closes #5286
This PR re-implements Python's
DataFrame.drop_duplicates
/Series.drop_duplicates
to use thestable_distinct
algorithm.This fixed a large number of issues with correctness (ordering the same way as pandas) and also improves performance by eliminating a sorting step.
As a consequence of changing the behavior of
drop_duplicates
, a lot of refactoring was needed. Thedrop_duplicates
function was used to implementunique()
, which cascaded into changes for several groupby functions, one-hot encoding,np.unique
array function dispatches, and more. Those downstream functions relied on the sorting order ofdrop_duplicates
andunique
, which is not promised by pandas.