Refactor stream compaction APIs #10370

PointKernel · 2022-02-28T23:18:56Z

Closes #9413

Depending on #10387.

There are several changes involved in this PR:

Refactors cudf::drop_duplicates to match std::unique's behavior and renames it as cudf::unique. cudf::unique creates a table by removing duplicate rows in each consecutive group of equivalent rows of the input.
Renames cudf::unordered_drop_duplicates as cudf::distinct. cudf::distinct creates a table by keeping unique rows across the whole input table. Unique rows in the new table are in unspecified orders due to the nature of hash-based algorithms.
Renames cudf::unordered_distinct_count as cudf::distinct_count: count of cudf::distinct
Renames cudf::distinct_count as cudf::unique_count: count of cudf::unique
Updates corresponding tests and benchmarks.
Updates related JNI/Cython bindings. In order not to break the existing behavior in java and python, JNI and Cython bindings of drop_duplicates are updated to stably sort the input table first and then cudf::unique.

Performance hints for cudf::unique and cudf::distinct:

If the input is pre-sorted, use cudf::unique
If the input is not pre-sorted and the behavior of pandas.DataFrame.drop_duplicates is desired:
- If keep control (keep the first, last, or none of the duplicates) doesn't matter, use the hash-based cudf::distinct
- If keep control is required, stable sort the input then cudf::unique

codecov · 2022-03-02T00:15:19Z

Codecov Report

Merging #10370 (81ef1fa) into branch-22.04 (5207eff) will increase coverage by 75.65%.
The diff coverage is 99.54%.

❗ Current head 81ef1fa differs from pull request most recent head 2c3c02a. Consider uploading reports for the commit 2c3c02a to get more accurate results

@@                Coverage Diff                @@
##           branch-22.04   #10370       +/-   ##
=================================================
+ Coverage         10.50%   86.16%   +75.65%     
=================================================
  Files               127      139       +12     
  Lines             21200    22457     +1257     
=================================================
+ Hits               2228    19350    +17122     
+ Misses            18972     3107    -15865

Impacted Files	Coverage Δ
python/cudf/cudf/core/column/string.py	`88.39% <ø> (+88.39%)`	⬆️
python/cudf/cudf/core/_base_index.py	`86.43% <50.00%> (+86.43%)`	⬆️
python/cudf/cudf/core/index.py	`92.19% <98.97%> (+92.19%)`	⬆️
python/cudf/cudf/core/dataframe.py	`93.57% <99.07%> (+93.57%)`	⬆️
python/cudf/cudf/core/column/column.py	`89.16% <100.00%> (+89.16%)`	⬆️
python/cudf/cudf/core/column/datetime.py	`89.07% <100.00%> (+89.07%)`	⬆️
python/cudf/cudf/core/column/numerical.py	`94.69% <100.00%> (+94.69%)`	⬆️
python/cudf/cudf/core/frame.py	`91.72% <100.00%> (+91.72%)`	⬆️
python/cudf/cudf/core/groupby/groupby.py	`91.92% <100.00%> (+91.92%)`	⬆️
python/cudf/cudf/core/indexed_frame.py	`92.25% <100.00%> (+92.25%)`	⬆️
... and 119 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5207eff...2c3c02a. Read the comment docs.

bdice

Very nice work @PointKernel! I have only a few minor comments and then I can approve.

I am especially glad you wrote the comments in the PR description about what API to call for optimal performance in each situation. That was very helpful!

cpp/src/stream_compaction/drop_duplicates.cu

bdice · 2022-03-02T17:35:06Z

java/src/main/native/src/TableJni.cpp

+                                                     cudf::duplicate_keep_option::KEEP_LAST,
+                                        nulls_equal ? cudf::null_equality::EQUAL :
+                                                      cudf::null_equality::UNEQUAL,
+                                        rmm::mr::get_current_device_resource());


Is the result supposed to use the same memory resource as the temporary values gather_map and sorted_input? I don't know how the Java API handles memory resources, but this PR is the only place I see rmm::mr being used explicitly (explicit is my preference, but it's not in line with the rest of the file).

I don't have a firm answer to your question. But for reference, the old drop_duplicates performs stable sort internally and the equivalent of gather_map/sorted_input was using the same memory resource as result.

Given the defaults in cudf::drop_duplicates(), I'm not sure why rmm::mr::get_current_device_resource() was required here in the old version of this function. @ttnghia might know.

python/cudf/cudf/_lib/stream_compaction.pyx

Co-authored-by: Bradley Dice <[email protected]>

…into update-drop-duplicates

PointKernel · 2022-03-10T22:38:50Z

@bdice @mythrocks @brandon-b-miller Sorry for re-requesting your reviews. There are substantial changes along with API renaming but most of them are just moving code around.

bdice

Mostly wording-related changes to align with the new API names. Thanks for the through refactor!

cpp/include/cudf/stream_compaction.hpp

cpp/src/dictionary/set_keys.cu

cpp/src/stream_compaction/distinct.cu

cpp/src/stream_compaction/unique_count.cu

cpp/src/transform/encode.cu

mythrocks · 2022-03-11T05:03:34Z

@mythrocks I think I am missing something. Java's distinct and Scala's distinct have the same semantics as what we are proposing for distinct. distinct removes all duplicates, unique removes adjacent duplicates. Can you clarify your concern?

No concerns. I assure you, we are in agreement. I'm trying to ascertain the right course of action for the Java API, without adding to @PointKernel's work.

TLDR: I'm fine with either of the following:

(Only) If time permits, the Java dropDuplicates() can be renamed to distinct(), implemented with cudf::distinct().
Otherwise, let's save @PointKernel's time and keep his current changes. (i.e. Keep it named dropDuplicates(), using stable_sort() + cudf::unique().)

As I've already said, this will have no repercussions on spark-rapids.

mythrocks · 2022-03-11T05:49:04Z

have no plans to use it, as far as I can tell?

As @revans2 is likely aware, cudf::jni::*dropDuplicates() arose from #9115. But @PointKernel needn't worry about breaking spark-rapids: we don't use dropDuplicates() because of the expensive sort().
As @jlowe might've mentioned, cudf::distinct() might be more our tempo, eventually. Maybe.

We have an accord: Either option above is acceptable for now. It appears we're going with the second.

Why are we arguing about Java and Scala semantics vs C++ semantics...

@revans2, @codereport: That was me agreeing with @codereport's analysis, and adding how the Java bits fit. :]
Let's leave this discussion here, lest it further resemble an Oscar Wilde play. :]

cpp/benchmarks/stream_compaction/distinct.cpp

cpp/src/stream_compaction/unique_count.cu

mythrocks

Thank you for your patience, @PointKernel. Also, thank you for courtesy to consumers of the Java API.

This looks nearly ready to ship! I've left a couple of (optional) nitpicks. There are a couple of concerns regarding distinct vs unique in the doxygen, as @bdice has also noted.

mythrocks

LGTM!

bdice

I am only hung up on the docs in stream_compaction.hpp at this point but I don't want to hold the PR any longer since it is so large. If all other reviewers are satisfied, we can merge this and do a follow-up PR to fix the docstrings with these comments.

cpp/benchmarks/stream_compaction/distinct.cpp

cpp/include/cudf/stream_compaction.hpp

vyasr

Python and CMake approval (didn't look at the C++ source).

@PointKernel one (non-blocking) request: it would be nice if the performance tips in the PR description (unique vs distinct) were added to the doxygen documentation. You could also use the sa ("see also") doxygen tag to link between them as well. CC @bdice and @mythrocks in case they think it would be worthwhile since they reviewed the C++.

PointKernel · 2022-03-12T05:12:03Z

@gpucibot merge

Update drop_duplicates to work like std::unique

535b91e

PointKernel added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. breaking Breaking change labels Feb 28, 2022

PointKernel self-assigned this Feb 28, 2022

Update drop_duplicates cython code

d17788e

github-actions bot added the Python Affects Python cuDF API. label Mar 1, 2022

Update drop_duplicates JNI code

bd9617b

github-actions bot added the Java Affects Java cuDF API. label Mar 1, 2022

Minor comment updates

616c095

PointKernel marked this pull request as ready for review March 2, 2022 17:18

PointKernel requested review from a team as code owners March 2, 2022 17:18

PointKernel requested review from bdice, brandon-b-miller and mythrocks March 2, 2022 17:18

PointKernel added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Mar 2, 2022

bdice requested changes Mar 2, 2022

View reviewed changes

PointKernel and others added 4 commits March 2, 2022 12:51

Update python/cudf/cudf/_lib/stream_compaction.pyx

6c01469

Co-authored-by: Bradley Dice <[email protected]>

Address review: replace INT32 with type_to_id of size_type

c752247

Merge branch 'update-drop-duplicates' of github.com:PointKernel/cudf …

5111f1e

…into update-drop-duplicates

Minor cython correction

910f5ac

PointKernel requested a review from bdice March 2, 2022 19:43

bdice approved these changes Mar 2, 2022

View reviewed changes

Update JNI and Cython bindings

053cda3

github-actions bot added the CMake CMake build issue label Mar 10, 2022

cmake format

cdf35bd

PointKernel requested a review from a team as a code owner March 10, 2022 21:56

PointKernel changed the title ~~Update drop_duplicates to work like std::unique~~ Refactor stream compaction APIs Mar 10, 2022

PointKernel requested review from mythrocks, brandon-b-miller and bdice March 10, 2022 22:38

bdice requested changes Mar 11, 2022

View reviewed changes

mythrocks reviewed Mar 11, 2022

View reviewed changes

cpp/benchmarks/stream_compaction/distinct.cpp Outdated Show resolved Hide resolved

mythrocks reviewed Mar 11, 2022

View reviewed changes

cpp/src/stream_compaction/unique_count.cu Show resolved Hide resolved

mythrocks reviewed Mar 11, 2022

View reviewed changes

cpp/src/stream_compaction/unique_count.cu Show resolved Hide resolved

mythrocks reviewed Mar 11, 2022

View reviewed changes

Address review comments: update docs/comments + code cleanups

3470dc2

PointKernel requested review from bdice and mythrocks March 11, 2022 19:30

mythrocks approved these changes Mar 11, 2022

View reviewed changes

bdice approved these changes Mar 11, 2022

View reviewed changes

PointKernel added 2 commits March 11, 2022 16:12

Remove keep options in disinct benchmark

93cef5b

Update docs

56f0a56

vyasr approved these changes Mar 11, 2022

View reviewed changes

Add performance hints into doc

2c3c02a

rapids-bot bot merged commit 0be0b00 into rapidsai:branch-22.04 Mar 12, 2022

PointKernel deleted the update-drop-duplicates branch May 26, 2022 17:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor stream compaction APIs #10370

Refactor stream compaction APIs #10370

PointKernel commented Feb 28, 2022 •

edited

Loading

codecov bot commented Mar 2, 2022 •

edited

Loading

bdice left a comment

bdice Mar 2, 2022

PointKernel Mar 2, 2022 •

edited

Loading

mythrocks Mar 8, 2022

PointKernel commented Mar 10, 2022

bdice left a comment

mythrocks commented Mar 11, 2022 •

edited

Loading

mythrocks commented Mar 11, 2022 •

edited

Loading

mythrocks left a comment

mythrocks left a comment

bdice left a comment

vyasr left a comment

PointKernel commented Mar 12, 2022

Refactor stream compaction APIs #10370

Refactor stream compaction APIs #10370

Conversation

PointKernel commented Feb 28, 2022 • edited Loading

codecov bot commented Mar 2, 2022 • edited Loading

Codecov Report

bdice left a comment

Choose a reason for hiding this comment

bdice Mar 2, 2022

Choose a reason for hiding this comment

PointKernel Mar 2, 2022 • edited Loading

Choose a reason for hiding this comment

mythrocks Mar 8, 2022

Choose a reason for hiding this comment

PointKernel commented Mar 10, 2022

bdice left a comment

Choose a reason for hiding this comment

mythrocks commented Mar 11, 2022 • edited Loading

mythrocks commented Mar 11, 2022 • edited Loading

mythrocks left a comment

Choose a reason for hiding this comment

mythrocks left a comment

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

vyasr left a comment

Choose a reason for hiding this comment

PointKernel commented Mar 12, 2022

PointKernel commented Feb 28, 2022 •

edited

Loading

codecov bot commented Mar 2, 2022 •

edited

Loading

PointKernel Mar 2, 2022 •

edited

Loading

mythrocks commented Mar 11, 2022 •

edited

Loading

mythrocks commented Mar 11, 2022 •

edited

Loading