-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Add distinct-key joins to libcudf #14948
Labels
2 - In Progress
Currently a work in progress
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
Performance
Performance related issue
Spark
Functionality that helps Spark RAPIDS
Milestone
Comments
GregoryKimball
added
feature request
New feature or request
0 - Backlog
In queue waiting for assignment
Performance
Performance related issue
Spark
Functionality that helps Spark RAPIDS
labels
Feb 1, 2024
PointKernel
changed the title
[FEA] Add unique-key joins to libcudf
[FEA] Add distinct-key joins to libcudf
Feb 13, 2024
GregoryKimball
added
2 - In Progress
Currently a work in progress
and removed
0 - Backlog
In queue waiting for assignment
labels
Feb 21, 2024
rapids-bot bot
pushed a commit
that referenced
this issue
Feb 23, 2024
Contributes to #14948 This PR adds a public `cudf::distinct_hash_join` class that provides a fast code path for joins with distinct keys. Only distinct inner join is tackled in the current PR. Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - Jason Lowe (https://github.com/jlowe) - Bradley Dice (https://github.com/bdice) - Lawrence Mitchell (https://github.com/wence-) - David Wendt (https://github.com/davidwendt) - Nghia Truong (https://github.com/ttnghia) URL: #14990
rapids-bot bot
pushed a commit
that referenced
this issue
Mar 6, 2024
Contributes to #14948 This PR adds distinct left join. It also cleans up the distinct inner code to use the terms "build" and "probe" consistently instead of "left" and "right". Authors: - Yunsong Wang (https://github.com/PointKernel) - Nghia Truong (https://github.com/ttnghia) Approvers: - Bradley Dice (https://github.com/bdice) - Jason Lowe (https://github.com/jlowe) - Nghia Truong (https://github.com/ttnghia) URL: #15149
Based on tests, explicit shared memory hash tables for joins don't help improve performance:
This issue can be closed with confirmation from the spark side and the follow up work will be tracked via #15502 |
Closing as completed. NVIDIA/spark-rapids#7529 is more related to high-multiplicity tuning and shared memory hash table won't help. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
2 - In Progress
Currently a work in progress
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
Performance
Performance related issue
Spark
Functionality that helps Spark RAPIDS
Is your feature request related to a problem? Please describe.
For equality joins in which the keys of one of the tables do not contain any duplicates, then we can provide a more efficient implementation based on
cuco::static_set
. Distinct-key joins also have more predictable output sizes and most join types can be implemented with single-pass kernels. The join APIs currently in libcudf's hash_join class use thecuco::static_multimap
data structure to support duplicates.Describe the solution you'd like
We should provide a new
distinct_hash_join
class that uses thecuco::static_set
data structure and does not support duplicate keys in the build table. This class would have member functions forinner_join
andleft_join
join types.Staging the work
56c53beb
(Fetch the latest cuco and remove outdated patches rapids-cmake#526)static_set
data structure (Add distinct key inner join #14990)int32
orint64
keying column. To be revisited after Add set retrieve NVIDIA/cuCollections#442retrieve
#15636Additional context
See also #12261, which includes refactoring
hash_join
from usingcuco::static_multimap
tocuco::static_multiset
. If we add the simpler and more efficient distinct-key joins, it will make it easier to experiment with join implementations using set-like data structures.Distinct-key joins are common in "primary key / foreign key" joins because the primary key in a table is required to never have duplicates.
The text was updated successfully, but these errors were encountered: