-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor stream compaction APIs #10370
Refactor stream compaction APIs #10370
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.04 #10370 +/- ##
=================================================
+ Coverage 10.50% 86.16% +75.65%
=================================================
Files 127 139 +12
Lines 21200 22457 +1257
=================================================
+ Hits 2228 19350 +17122
+ Misses 18972 3107 -15865
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice work @PointKernel! I have only a few minor comments and then I can approve.
I am especially glad you wrote the comments in the PR description about what API to call for optimal performance in each situation. That was very helpful!
cudf::duplicate_keep_option::KEEP_LAST, | ||
nulls_equal ? cudf::null_equality::EQUAL : | ||
cudf::null_equality::UNEQUAL, | ||
rmm::mr::get_current_device_resource()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the result
supposed to use the same memory resource as the temporary values gather_map
and sorted_input
? I don't know how the Java API handles memory resources, but this PR is the only place I see rmm::mr
being used explicitly (explicit is my preference, but it's not in line with the rest of the file).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a firm answer to your question. But for reference, the old drop_duplicates
performs stable sort internally and the equivalent of gather_map
/sorted_input
was using the same memory resource as result
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the defaults in cudf::drop_duplicates()
, I'm not sure why rmm::mr::get_current_device_resource()
was required here in the old version of this function. @ttnghia might know.
Co-authored-by: Bradley Dice <[email protected]>
…into update-drop-duplicates
drop_duplicates
to work like std::unique
@bdice @mythrocks @brandon-b-miller Sorry for re-requesting your reviews. There are substantial changes along with API renaming but most of them are just moving code around. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly wording-related changes to align with the new API names. Thanks for the through refactor!
No concerns. I assure you, we are in agreement. I'm trying to ascertain the right course of action for the Java API, without adding to @PointKernel's work. TLDR: I'm fine with either of the following:
As I've already said, this will have no repercussions on |
As @revans2 is likely aware, We have an accord: Either option above is acceptable for now. It appears we're going with the second.
@revans2, @codereport: That was me agreeing with @codereport's analysis, and adding how the Java bits fit. :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your patience, @PointKernel. Also, thank you for courtesy to consumers of the Java API.
This looks nearly ready to ship! I've left a couple of (optional) nitpicks. There are a couple of concerns regarding distinct
vs unique
in the doxygen, as @bdice has also noted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am only hung up on the docs in stream_compaction.hpp
at this point but I don't want to hold the PR any longer since it is so large. If all other reviewers are satisfied, we can merge this and do a follow-up PR to fix the docstrings with these comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python and CMake approval (didn't look at the C++ source).
@PointKernel one (non-blocking) request: it would be nice if the performance tips in the PR description (unique
vs distinct
) were added to the doxygen documentation. You could also use the sa
("see also") doxygen tag to link between them as well. CC @bdice and @mythrocks in case they think it would be worthwhile since they reviewed the C++.
@gpucibot merge |
Closes #9413
Depending on #10387.
There are several changes involved in this PR:
cudf::drop_duplicates
to matchstd::unique
's behavior and renames it ascudf::unique
.cudf::unique
creates a table by removing duplicate rows in each consecutive group of equivalent rows of the input.cudf::unordered_drop_duplicates
ascudf::distinct
.cudf::distinct
creates a table by keeping unique rows across the whole input table. Unique rows in the new table are in unspecified orders due to the nature of hash-based algorithms.cudf::unordered_distinct_count
ascudf::distinct_count
: count ofcudf::distinct
cudf::distinct_count
ascudf::unique_count
: count ofcudf::unique
drop_duplicates
are updated to stably sort the input table first and thencudf::unique
.Performance hints for
cudf::unique
andcudf::distinct
:If the input is pre-sorted, use
cudf::unique
If the input is not pre-sorted and the behavior of
pandas.DataFrame.drop_duplicates
is desired:keep
control (keep the first, last, or none of the duplicates) doesn't matter, use the hash-basedcudf::distinct
keep
control is required, stable sort the input thencudf::unique