Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cudf::stable_sort_by_key #10387

Merged
merged 9 commits into from
Mar 7, 2022

Conversation

PointKernel
Copy link
Member

@PointKernel PointKernel commented Mar 2, 2022

This PR adds a new stable_sort_by_key API into libcudf. The new API is helpful to simplify Cython/JNI bindings of drop_duplicates (#10370).

@PointKernel PointKernel added feature request New feature or request 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue non-breaking Non-breaking change labels Mar 2, 2022
@PointKernel PointKernel self-assigned this Mar 2, 2022
@PointKernel PointKernel requested a review from a team as a code owner March 2, 2022 22:05
cpp/src/sort/sort.cu Outdated Show resolved Hide resolved
cpp/tests/sort/stable_sort_tests.cpp Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Mar 3, 2022

Codecov Report

Merging #10387 (7486621) into branch-22.04 (a7d88cd) will increase coverage by 0.07%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.04   #10387      +/-   ##
================================================
+ Coverage         10.42%   10.50%   +0.07%     
================================================
  Files               119      126       +7     
  Lines             20603    21218     +615     
================================================
+ Hits               2148     2228      +80     
- Misses            18455    18990     +535     
Impacted Files Coverage Δ
...ython/custreamz/custreamz/tests/test_dataframes.py 99.39% <0.00%> (-0.01%) ⬇️
python/cudf/cudf/errors.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/orc.py 0.00% <0.00%> (ø)
python/cudf/cudf/_version.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/ops.py 0.00% <0.00%> (ø)
python/cudf/cudf/datasets.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/index.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/parquet.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/scalar.py 0.00% <0.00%> (ø)
... and 45 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5d8ea19...7486621. Read the comment docs.

@PointKernel PointKernel requested a review from davidwendt March 7, 2022 17:26
@PointKernel
Copy link
Member Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 4f8c60a into rapidsai:branch-22.04 Mar 7, 2022
@davidwendt
Copy link
Contributor

@PointKernel you should be getting two C++ approvals for libcudf PRs before merging.

@PointKernel
Copy link
Member Author

@PointKernel you should be getting two C++ approvals for libcudf PRs before merging.

Thanks for the reminder. I realized this right after merging.

rapids-bot bot pushed a commit that referenced this pull request Mar 12, 2022
Closes #9413

Depending on #10387.

There are several changes involved in this PR:

- Refactors `cudf::drop_duplicates` to match `std::unique`'s behavior and renames it as `cudf::unique`. `cudf::unique` creates a table by removing duplicate rows in each consecutive group of equivalent rows of the input.
- Renames `cudf::unordered_drop_duplicates` as `cudf::distinct`. `cudf::distinct` creates a table by keeping unique rows across the whole input table. Unique rows in the new table are in unspecified orders due to the nature of hash-based algorithms.
- Renames `cudf::unordered_distinct_count` as `cudf::distinct_count`: count of `cudf::distinct`
- Renames `cudf::distinct_count` as `cudf::unique_count`: count of `cudf::unique`
- Updates corresponding tests and benchmarks.
- Updates related JNI/Cython bindings. In order not to break the existing behavior in java and python, JNI and Cython bindings of `drop_duplicates` are updated to stably sort the input table first and then `cudf::unique`. 


Performance hints for `cudf::unique` and `cudf::distinct`:

- If the input is pre-sorted, use `cudf::unique`
- If the input is **not** pre-sorted and the behavior of `pandas.DataFrame.drop_duplicates` is desired:

  - If `keep` control (keep the first, last, or none of the duplicates) doesn't matter, use the hash-based `cudf::distinct`
  - If `keep` control is required, stable sort the input then `cudf::unique`

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - https://github.com/brandon-b-miller
  - MithunR (https://github.com/mythrocks)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #10370
@PointKernel PointKernel deleted the stable-sort-by-key branch May 26, 2022 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants