-
Notifications
You must be signed in to change notification settings - Fork 933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement lists::distinct
and cudf::detail::stable_distinct
#11149
Implement lists::distinct
and cudf::detail::stable_distinct
#11149
Conversation
This comment was marked as off-topic.
This comment was marked as off-topic.
lists::distinct
cudf::stable_distinct
and lists::distinct
cudf::stable_distinct
and lists::distinct
cudf::detail::stable_distinct
and lists::distinct
cudf::detail::stable_distinct
and lists::distinct
lists::distinct
and cudf::detail::stable_distinct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMake changes LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some final nitpicks. Looks good in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't pay attention to the default stream until this round of review. Otherwise, looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
@gpucibot merge |
This PR completely removes `cudf::lists::drop_list_duplicates`. It is replaced by the new API `cudf::list::distinct` which has a simpler implementation but better performance. The replacements for internal cudf usage have all been merged before thus there is no side effect or breaking for the existing APIs in this work. Closes #11114, #11093, #11053, #11034, and closes #9257. Depends on: * #11228 * #11149 * #11234 * #11233 Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Jordan Jacobelli (https://github.com/Ethyling) - Robert Maynard (https://github.com/robertmaynard) - Vukasin Milovanovic (https://github.com/vuule) - Bradley Dice (https://github.com/bdice) URL: #11236
This PR follows up on #11149 to add the lists filtering (stream compaction) APIs to a doxygen group. The previous doxygen group `lists_drop_duplicates` is empty after #11326. Authors: - Bradley Dice (https://github.com/bdice) Approvers: - David Wendt (https://github.com/davidwendt) - Vukasin Milovanovic (https://github.com/vuule) URL: #11336
This PR adds the following APIs for set operations: * `lists::have_overlap` * `lists::intersect_distinct` * `lists::union_distinct` * `lists::difference_distinct` ### Name Convention Except for the first API (`lists::have_overlap`) that returns a boolean column, the suffix `_distinct` of the rest APIs denotes that their results will be lists columns in which all list rows have been post-processed to remove duplicates. As such, their results are actually "set" columns in which each row is a "set" of distinct elements. --- Depends on: * #10945 * #11017 * NVIDIA/cuCollections#175 * #11052 * #11118 * #11100 * #11149 Closes #10409. Authors: - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) Approvers: - Michael Wang (https://github.com/isVoid) - AJ Schmidt (https://github.com/ajschmidt8) - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) URL: #11043
This adds new APIs:
lists::distinct
as a stream compaction component ofcudf::lists::
, allowing to extract distinct elements from lists in a lists column. The new API does a similar job aslists::drop_list_duplicate
but can operate on arbitrary data types whilelists::drop_list_duplicate
can only work on basic data types and flat structs.cudf::detail::stable_distinct
, which is implemented in the main stream compaction module. This API is introduced as just adetail::
API first (which means we can expose it to the public if needed), producing the equivalent output ascudf::distinct
but with row order preserved. It is used as a building block to implementlists::distinct
.This PR is a dependency to implement set-like operations (#11043).
Note: This new
lists::distinct
API will completely replacelists::drop_list_duplicate
(which in turn will be deprecated). This will be the follow-up work.