[FEA] Add distinct-key joins to libcudf #14948

GregoryKimball · 2024-02-01T06:35:50Z

Is your feature request related to a problem? Please describe.
For equality joins in which the keys of one of the tables do not contain any duplicates, then we can provide a more efficient implementation based on cuco::static_set. Distinct-key joins also have more predictable output sizes and most join types can be implemented with single-pass kernels. The join APIs currently in libcudf's hash_join class use the cuco::static_multimap data structure to support duplicates.

Describe the solution you'd like
We should provide a new distinct_hash_join class that uses the cuco::static_set data structure and does not support duplicate keys in the build table. This class would have member functions for inner_join and left_join join types.

Staging the work

Update RAPIDS to use cuco version 56c53beb (Fetch the latest cuco and remove outdated patches rapids-cmake#526)
Merge distinct-key inner join based on a static_set data structure (Add distinct key inner join #14990)
Add left and full join types (Add distinct left join #15149)
Explore a fast path with a single int32 or int64 keying column. To be revisited after Add set retrieve NVIDIA/cuCollections#442
refactor the custom device code from the first distinct key join implementation Improve distinct join with set retrieve #15636
Explore a shared memory hash table if cardinality is below the size threshold to fit in shared memory. We can estimate if the hash table will fit in shared memory based on the size of the build table. Also see [BUG] low cardinality joins on the GPU are really bad. NVIDIA/spark-rapids#7529

Additional context
See also #12261, which includes refactoring hash_join from using cuco::static_multimap to cuco::static_multiset. If we add the simpler and more efficient distinct-key joins, it will make it easier to experiment with join implementations using set-like data structures.

Distinct-key joins are common in "primary key / foreign key" joins because the primary key in a table is required to never have duplicates.

The text was updated successfully, but these errors were encountered:

Contributes to #14948 This PR adds a public `cudf::distinct_hash_join` class that provides a fast code path for joins with distinct keys. Only distinct inner join is tackled in the current PR. Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - Jason Lowe (https://github.com/jlowe) - Bradley Dice (https://github.com/bdice) - Lawrence Mitchell (https://github.com/wence-) - David Wendt (https://github.com/davidwendt) - Nghia Truong (https://github.com/ttnghia) URL: #14990

Contributes to #14948 This PR adds distinct left join. It also cleans up the distinct inner code to use the terms "build" and "probe" consistently instead of "left" and "right". Authors: - Yunsong Wang (https://github.com/PointKernel) - Nghia Truong (https://github.com/ttnghia) Approvers: - Bradley Dice (https://github.com/bdice) - Jason Lowe (https://github.com/jlowe) - Nghia Truong (https://github.com/ttnghia) URL: #15149

PointKernel · 2024-05-29T19:58:23Z

Based on tests, explicit shared memory hash tables for joins don't help improve performance:

shared_memory:

| Key | Distribution | NumOutputs | Samples |  CPU Time  | Noise  |  GPU Time  | Noise  |  Elem/s  |
|-----|--------------|------------|---------|------------|--------|------------|--------|----------|
| I32 |       UNIQUE |        800 |  36864x |  16.658 us | 26.38% |  13.567 us | 12.03% |  58.967M |
| I32 |       UNIQUE |       2000 |  35200x |  17.238 us | 23.20% |  14.207 us |  8.56% | 140.774M |
| I32 |       UNIQUE |       8000 |  23424x |  24.394 us | 16.17% |  21.355 us |  7.31% | 374.615M |
| I32 |       UNIQUE |      80000 |   5088x | 101.504 us |  3.65% |  98.452 us |  1.74% | 812.575M |
| I32 |       UNIQUE |     800000 |   1042x | 890.035 us |  0.61% | 887.040 us |  0.50% | 901.876M |
| I32 |       UNIQUE |    8000000 |     58x |   8.730 ms |  0.17% |   8.727 ms |  0.17% | 916.703M |
| I32 |       UNIQUE |   80000000 |     11x |  87.252 ms |  0.01% |  87.250 ms |  0.01% | 916.902M |
| I64 |       UNIQUE |        800 |  30928x |  19.206 us | 20.21% |  16.174 us |  7.08% |  49.463M |
| I64 |       UNIQUE |       2000 |  34576x |  17.513 us | 22.77% |  14.463 us |  8.11% | 138.283M |
| I64 |       UNIQUE |       8000 |  16272x |  33.790 us | 10.68% |  30.754 us |  3.78% | 260.125M |
| I64 |       UNIQUE |      80000 |   2640x | 192.500 us |  1.73% | 189.497 us |  0.67% | 422.169M |
| I64 |       UNIQUE |     800000 |    283x |   1.774 ms |  0.51% |   1.771 ms |  0.47% | 451.705M |
| I64 |       UNIQUE |    8000000 |     29x |  17.573 ms |  0.07% |  17.570 ms |  0.07% | 455.313M |
| I64 |       UNIQUE |   80000000 |     11x | 175.586 ms |  0.02% | 175.585 ms |  0.02% | 455.619M |

global memory:

| Key | Distribution | NumOutputs | Samples |  CPU Time  | Noise  |  GPU Time  | Noise  |  Elem/s  |
|-----|--------------|------------|---------|------------|--------|------------|--------|----------|
| I32 |       UNIQUE |        800 |  66784x |  10.678 us | 47.51% |   7.487 us | 18.63% | 106.854M |
| I32 |       UNIQUE |       2000 |  64320x |  10.932 us | 44.70% |   7.774 us | 16.48% | 257.269M |
| I32 |       UNIQUE |       8000 |  62640x |  11.161 us | 44.79% |   7.982 us | 18.55% |   1.002G |
| I32 |       UNIQUE |      80000 |  35520x |  17.181 us | 26.87% |  14.080 us | 14.23% |   5.682G |
| I32 |       UNIQUE |     800000 |   7552x |  69.285 us |  5.07% |  66.239 us |  2.00% |  12.077G |
| I32 |       UNIQUE |    8000000 |   2016x | 622.167 us |  1.21% | 619.066 us |  1.09% |  12.923G |
| I32 |       UNIQUE |   80000000 |    848x |   6.216 ms |  0.82% |   6.213 ms |  0.82% |  12.876G |
| I64 |       UNIQUE |        800 |  64640x |  10.884 us | 46.03% |   7.736 us | 19.71% | 103.415M |
| I64 |       UNIQUE |       2000 |  63936x |  11.054 us | 45.48% |   7.822 us | 16.70% | 255.690M |
| I64 |       UNIQUE |       8000 |  59792x |  11.565 us | 42.66% |   8.363 us | 16.62% | 956.566M |
| I64 |       UNIQUE |      80000 |  33040x |  18.327 us | 26.25% |  15.140 us | 14.08% |   5.284G |
| I64 |       UNIQUE |     800000 |   7152x |  74.976 us |  5.02% |  71.897 us |  2.42% |  11.127G |
| I64 |       UNIQUE |    8000000 |   2016x | 671.884 us |  1.34% | 668.698 us |  1.24% |  11.964G |
| I64 |       UNIQUE |   80000000 |   1104x |   6.662 ms |  0.86% |   6.659 ms |  0.86% |  12.014G |

This issue can be closed with confirmation from the spark side and the follow up work will be tracked via #15502

PointKernel · 2024-05-29T21:29:47Z

Closing as completed.

NVIDIA/spark-rapids#7529 is more related to high-multiplicity tuning and shared memory hash table won't help.

GregoryKimball added feature request New feature or request 0 - Backlog In queue waiting for assignment Performance Performance related issue Spark Functionality that helps Spark RAPIDS labels Feb 1, 2024

GregoryKimball added this to the Joins continuous improvement milestone Feb 1, 2024

GregoryKimball added this to libcudf Feb 1, 2024

GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Feb 1, 2024

PointKernel mentioned this issue Feb 7, 2024

Add distinct key inner join #14990

Merged

3 tasks

PointKernel changed the title ~~[FEA] Add unique-key joins to libcudf~~ [FEA] Add distinct-key joins to libcudf Feb 13, 2024

GregoryKimball added 2 - In Progress Currently a work in progress and removed 0 - Backlog In queue waiting for assignment labels Feb 21, 2024

GregoryKimball moved this to In progress in libcudf Feb 21, 2024

PointKernel mentioned this issue Feb 26, 2024

Add distinct left join #15149

Merged

3 tasks

GregoryKimball mentioned this issue Mar 8, 2024

[FEA] Add shared memory hash map for low-cardinality aggregations #15262

Closed

PointKernel mentioned this issue May 29, 2024

[BUG] low cardinality joins on the GPU are really bad. NVIDIA/spark-rapids#7529

Open

PointKernel closed this as completed May 29, 2024

GregoryKimball mentioned this issue Jun 13, 2024

[FEA] Investigate fast-path for hash joins that bypasses row operators #16026

Open

GregoryKimball removed this from libcudf Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add distinct-key joins to libcudf #14948

[FEA] Add distinct-key joins to libcudf #14948

GregoryKimball commented Feb 1, 2024 •

edited

Loading

PointKernel commented May 29, 2024

PointKernel commented May 29, 2024

[FEA] Add distinct-key joins to libcudf #14948

[FEA] Add distinct-key joins to libcudf #14948

Comments

GregoryKimball commented Feb 1, 2024 • edited Loading

PointKernel commented May 29, 2024

PointKernel commented May 29, 2024

GregoryKimball commented Feb 1, 2024 •

edited

Loading