-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cudf::stable_sort_by_key
#10387
Add cudf::stable_sort_by_key
#10387
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.04 #10387 +/- ##
================================================
+ Coverage 10.42% 10.50% +0.07%
================================================
Files 119 126 +7
Lines 20603 21218 +615
================================================
+ Hits 2148 2228 +80
- Misses 18455 18990 +535
Continue to review full report at Codecov.
|
@gpucibot merge |
@PointKernel you should be getting two C++ approvals for libcudf PRs before merging. |
Thanks for the reminder. I realized this right after merging. |
Closes #9413 Depending on #10387. There are several changes involved in this PR: - Refactors `cudf::drop_duplicates` to match `std::unique`'s behavior and renames it as `cudf::unique`. `cudf::unique` creates a table by removing duplicate rows in each consecutive group of equivalent rows of the input. - Renames `cudf::unordered_drop_duplicates` as `cudf::distinct`. `cudf::distinct` creates a table by keeping unique rows across the whole input table. Unique rows in the new table are in unspecified orders due to the nature of hash-based algorithms. - Renames `cudf::unordered_distinct_count` as `cudf::distinct_count`: count of `cudf::distinct` - Renames `cudf::distinct_count` as `cudf::unique_count`: count of `cudf::unique` - Updates corresponding tests and benchmarks. - Updates related JNI/Cython bindings. In order not to break the existing behavior in java and python, JNI and Cython bindings of `drop_duplicates` are updated to stably sort the input table first and then `cudf::unique`. Performance hints for `cudf::unique` and `cudf::distinct`: - If the input is pre-sorted, use `cudf::unique` - If the input is **not** pre-sorted and the behavior of `pandas.DataFrame.drop_duplicates` is desired: - If `keep` control (keep the first, last, or none of the duplicates) doesn't matter, use the hash-based `cudf::distinct` - If `keep` control is required, stable sort the input then `cudf::unique` Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - Bradley Dice (https://github.com/bdice) - https://github.com/brandon-b-miller - MithunR (https://github.com/mythrocks) - Vyas Ramasubramani (https://github.com/vyasr) URL: #10370
This PR adds a new
stable_sort_by_key
API into libcudf. The new API is helpful to simplify Cython/JNI bindings ofdrop_duplicates
(#10370).