Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Python drop_duplicates with cudf::stable_distinct. #11656

Merged

Conversation

brandon-b-miller
Copy link
Contributor

@brandon-b-miller brandon-b-miller commented Sep 6, 2022

Depends on #13392.

Closes #11638
Closes #12449
Closes #11230
Closes #5286

This PR re-implements Python's DataFrame.drop_duplicates / Series.drop_duplicates to use the stable_distinct algorithm.

This fixed a large number of issues with correctness (ordering the same way as pandas) and also improves performance by eliminating a sorting step.

As a consequence of changing the behavior of drop_duplicates, a lot of refactoring was needed. The drop_duplicates function was used to implement unique(), which cascaded into changes for several groupby functions, one-hot encoding, np.unique array function dispatches, and more. Those downstream functions relied on the sorting order of drop_duplicates and unique, which is not promised by pandas.

@github-actions github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Sep 6, 2022
@brandon-b-miller brandon-b-miller added feature request New feature or request Cython non-breaking Non-breaking change labels Sep 6, 2022
@ttnghia
Copy link
Contributor

ttnghia commented Sep 6, 2022

This also can resolve #5286. Closing it can be by this PR or a separate PR after we can preserve order in drop_duplicates.

@codecov
Copy link

codecov bot commented Sep 26, 2022

Codecov Report

Patch coverage: 66.66% and project coverage change: +2.04 🎉

Comparison is base (209ab6e) 85.47% compared to head (a550a2e) 87.52%.

❗ Current head a550a2e differs from pull request most recent head c3a3bf7. Consider uploading reports for the commit c3a3bf7 to get more accurate results

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-23.06   #11656      +/-   ##
================================================
+ Coverage         85.47%   87.52%   +2.04%     
================================================
  Files               153      133      -20     
  Lines             25006    21776    -3230     
================================================
- Hits              21375    19059    -2316     
+ Misses             3631     2717     -914     
Impacted Files Coverage Δ
python/cudf/cudf/core/dataframe.py 93.77% <ø> (+0.29%) ⬆️
python/cudf/cudf/core/indexed_frame.py 92.03% <66.66%> (-0.68%) ⬇️

... and 114 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@brandon-b-miller brandon-b-miller changed the base branch from branch-22.10 to branch-22.12 November 7, 2022 15:04
@shwina shwina changed the base branch from branch-22.12 to branch-23.02 November 23, 2022 13:28
@shwina shwina changed the base branch from branch-23.02 to branch-23.04 January 26, 2023 16:45
@shwina
Copy link
Contributor

shwina commented Jan 26, 2023

Retargeted to 23.04

@bdice bdice changed the base branch from branch-23.04 to branch-23.06 April 17, 2023 18:17
@bdice bdice marked this pull request as ready for review May 22, 2023 21:42
@bdice bdice requested a review from a team as a code owner May 22, 2023 21:42
@bdice bdice requested review from vyasr and mroeschke May 22, 2023 21:42
Copy link
Contributor

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changed tests and docstrings LGTM

@bdice bdice added breaking Breaking change and removed non-breaking Non-breaking change CMake CMake build issue libcudf Affects libcudf (C++/CUDA) code. labels May 22, 2023
@bdice
Copy link
Contributor

bdice commented May 23, 2023

I'm planning to merge this at the end of tomorrow. I'm giving it a bit of cool-down time since it's a "breaking" PR during burndown.

@bdice
Copy link
Contributor

bdice commented May 23, 2023

Thanks @galipremsagar for doing some additional testing of this PR. I feel confident enough to merge this. I have some follow-up work planned, so I am going to merge now and start on that follow-up work.

@bdice
Copy link
Contributor

bdice commented May 23, 2023

/merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
6 participants