Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor stream compaction APIs #10370

Merged
merged 19 commits into from
Mar 12, 2022

Conversation

PointKernel
Copy link
Member

@PointKernel PointKernel commented Feb 28, 2022

Closes #9413

Depending on #10387.

There are several changes involved in this PR:

  • Refactors cudf::drop_duplicates to match std::unique's behavior and renames it as cudf::unique. cudf::unique creates a table by removing duplicate rows in each consecutive group of equivalent rows of the input.
  • Renames cudf::unordered_drop_duplicates as cudf::distinct. cudf::distinct creates a table by keeping unique rows across the whole input table. Unique rows in the new table are in unspecified orders due to the nature of hash-based algorithms.
  • Renames cudf::unordered_distinct_count as cudf::distinct_count: count of cudf::distinct
  • Renames cudf::distinct_count as cudf::unique_count: count of cudf::unique
  • Updates corresponding tests and benchmarks.
  • Updates related JNI/Cython bindings. In order not to break the existing behavior in java and python, JNI and Cython bindings of drop_duplicates are updated to stably sort the input table first and then cudf::unique.

Performance hints for cudf::unique and cudf::distinct:

  • If the input is pre-sorted, use cudf::unique

  • If the input is not pre-sorted and the behavior of pandas.DataFrame.drop_duplicates is desired:

    • If keep control (keep the first, last, or none of the duplicates) doesn't matter, use the hash-based cudf::distinct
    • If keep control is required, stable sort the input then cudf::unique

@PointKernel PointKernel added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. breaking Breaking change labels Feb 28, 2022
@PointKernel PointKernel self-assigned this Feb 28, 2022
@github-actions github-actions bot added the Python Affects Python cuDF API. label Mar 1, 2022
@github-actions github-actions bot added the Java Affects Java cuDF API. label Mar 1, 2022
@codecov
Copy link

codecov bot commented Mar 2, 2022

Codecov Report

Merging #10370 (81ef1fa) into branch-22.04 (5207eff) will increase coverage by 75.65%.
The diff coverage is 99.54%.

❗ Current head 81ef1fa differs from pull request most recent head 2c3c02a. Consider uploading reports for the commit 2c3c02a to get more accurate results

Impacted file tree graph

@@                Coverage Diff                @@
##           branch-22.04   #10370       +/-   ##
=================================================
+ Coverage         10.50%   86.16%   +75.65%     
=================================================
  Files               127      139       +12     
  Lines             21200    22457     +1257     
=================================================
+ Hits               2228    19350    +17122     
+ Misses            18972     3107    -15865     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/string.py 88.39% <ø> (+88.39%) ⬆️
python/cudf/cudf/core/_base_index.py 86.43% <50.00%> (+86.43%) ⬆️
python/cudf/cudf/core/index.py 92.19% <98.97%> (+92.19%) ⬆️
python/cudf/cudf/core/dataframe.py 93.57% <99.07%> (+93.57%) ⬆️
python/cudf/cudf/core/column/column.py 89.16% <100.00%> (+89.16%) ⬆️
python/cudf/cudf/core/column/datetime.py 89.07% <100.00%> (+89.07%) ⬆️
python/cudf/cudf/core/column/numerical.py 94.69% <100.00%> (+94.69%) ⬆️
python/cudf/cudf/core/frame.py 91.72% <100.00%> (+91.72%) ⬆️
python/cudf/cudf/core/groupby/groupby.py 91.92% <100.00%> (+91.92%) ⬆️
python/cudf/cudf/core/indexed_frame.py 92.25% <100.00%> (+92.25%) ⬆️
... and 119 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5207eff...2c3c02a. Read the comment docs.

@PointKernel PointKernel marked this pull request as ready for review March 2, 2022 17:18
@PointKernel PointKernel requested review from a team as code owners March 2, 2022 17:18
@PointKernel PointKernel added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Mar 2, 2022
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work @PointKernel! I have only a few minor comments and then I can approve.

I am especially glad you wrote the comments in the PR description about what API to call for optimal performance in each situation. That was very helpful!

cpp/src/stream_compaction/drop_duplicates.cu Outdated Show resolved Hide resolved
cudf::duplicate_keep_option::KEEP_LAST,
nulls_equal ? cudf::null_equality::EQUAL :
cudf::null_equality::UNEQUAL,
rmm::mr::get_current_device_resource());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the result supposed to use the same memory resource as the temporary values gather_map and sorted_input? I don't know how the Java API handles memory resources, but this PR is the only place I see rmm::mr being used explicitly (explicit is my preference, but it's not in line with the rest of the file).

Copy link
Member Author

@PointKernel PointKernel Mar 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a firm answer to your question. But for reference, the old drop_duplicates performs stable sort internally and the equivalent of gather_map/sorted_input was using the same memory resource as result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the defaults in cudf::drop_duplicates(), I'm not sure why rmm::mr::get_current_device_resource() was required here in the old version of this function. @ttnghia might know.

python/cudf/cudf/_lib/stream_compaction.pyx Outdated Show resolved Hide resolved
@PointKernel PointKernel requested a review from bdice March 2, 2022 19:43
@github-actions github-actions bot added the CMake CMake build issue label Mar 10, 2022
@PointKernel PointKernel requested a review from a team as a code owner March 10, 2022 21:56
@PointKernel PointKernel changed the title Update drop_duplicates to work like std::unique Refactor stream compaction APIs Mar 10, 2022
@PointKernel
Copy link
Member Author

@bdice @mythrocks @brandon-b-miller Sorry for re-requesting your reviews. There are substantial changes along with API renaming but most of them are just moving code around.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly wording-related changes to align with the new API names. Thanks for the through refactor!

cpp/include/cudf/stream_compaction.hpp Outdated Show resolved Hide resolved
cpp/src/dictionary/set_keys.cu Outdated Show resolved Hide resolved
cpp/src/dictionary/set_keys.cu Outdated Show resolved Hide resolved
cpp/src/stream_compaction/distinct.cu Outdated Show resolved Hide resolved
cpp/src/stream_compaction/distinct.cu Outdated Show resolved Hide resolved
cpp/src/stream_compaction/unique_count.cu Outdated Show resolved Hide resolved
cpp/src/transform/encode.cu Outdated Show resolved Hide resolved
@mythrocks
Copy link
Contributor

mythrocks commented Mar 11, 2022

@mythrocks I think I am missing something. Java's distinct and Scala's distinct have the same semantics as what we are proposing for distinct. distinct removes all duplicates, unique removes adjacent duplicates. Can you clarify your concern?

No concerns. I assure you, we are in agreement. I'm trying to ascertain the right course of action for the Java API, without adding to @PointKernel's work.

TLDR: I'm fine with either of the following:

  1. (Only) If time permits, the Java dropDuplicates() can be renamed to distinct(), implemented with cudf::distinct().
  2. Otherwise, let's save @PointKernel's time and keep his current changes. (i.e. Keep it named dropDuplicates(), using stable_sort() + cudf::unique().)

As I've already said, this will have no repercussions on spark-rapids.

@mythrocks
Copy link
Contributor

mythrocks commented Mar 11, 2022

have no plans to use it, as far as I can tell?

As @revans2 is likely aware, cudf::jni::*dropDuplicates() arose from #9115. But @PointKernel needn't worry about breaking spark-rapids: we don't use dropDuplicates() because of the expensive sort().
As @jlowe might've mentioned, cudf::distinct() might be more our tempo, eventually. Maybe.

We have an accord: Either option above is acceptable for now. It appears we're going with the second.

Why are we arguing about Java and Scala semantics vs C++ semantics...

@revans2, @codereport: That was me agreeing with @codereport's analysis, and adding how the Java bits fit. :]
Let's leave this discussion here, lest it further resemble an Oscar Wilde play. :]

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your patience, @PointKernel. Also, thank you for courtesy to consumers of the Java API.

This looks nearly ready to ship! I've left a couple of (optional) nitpicks. There are a couple of concerns regarding distinct vs unique in the doxygen, as @bdice has also noted.

@PointKernel PointKernel requested review from bdice and mythrocks March 11, 2022 19:30
Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am only hung up on the docs in stream_compaction.hpp at this point but I don't want to hold the PR any longer since it is so large. If all other reviewers are satisfied, we can merge this and do a follow-up PR to fix the docstrings with these comments.

cpp/benchmarks/stream_compaction/distinct.cpp Outdated Show resolved Hide resolved
cpp/include/cudf/stream_compaction.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/stream_compaction.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/stream_compaction.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/stream_compaction.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/stream_compaction.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/stream_compaction.hpp Show resolved Hide resolved
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python and CMake approval (didn't look at the C++ source).

@PointKernel one (non-blocking) request: it would be nice if the performance tips in the PR description (unique vs distinct) were added to the doxygen documentation. You could also use the sa ("see also") doxygen tag to link between them as well. CC @bdice and @mythrocks in case they think it would be worthwhile since they reviewed the C++.

@PointKernel
Copy link
Member Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 0be0b00 into rapidsai:branch-22.04 Mar 12, 2022
@PointKernel PointKernel deleted the update-drop-duplicates branch May 26, 2022 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team breaking Breaking change CMake CMake build issue feature request New feature or request Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] drop_duplicates and distinct_count behavior/implementation is very inefficient
9 participants