-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate set retrieve to use the OA implementation #637
Conversation
Compared to the current set algorithm, the OA retrieve achieves comparable performance after adding an early exit for a single hash set, though it still experiences about a 20% slowdown:
On the bright side, I realized that cudf’s distinct inner join can utilize Is a clean and nice block-wise |
Can you elaborate on this? I don't think the API granularity is the culprit here as it just mimics the behavior of the former |
Based on profiling, the performance difference mainly came from memory transactions and register usage. I highly suspect the former is due to the current buffer flushing strategy:
Will keep you posted |
e3d12a9
to
be3f83d
Compare
Whups, that is probably my bad. It should be feasible to flush the remaining elements at the end of the device-batch retrieve. The current approach is conceptually not that different from the former retrieve kernel. Instead of defining the shmem buffers at kernel scope and then pass them to the API, we define them directly inside the batch retrieve function, giving us full control over the size and the current state of the buffers. |
Yeah, I've already included that change in this PR and there is no performance impact.
The coalesced group still seems like the most likely culprit for now. I have a local branch that mimics the behavior of the legacy code but manages all shared memory operations at the block level. The performance is significantly worse (I assume that’s where you started as well 😊). I plan to investigate the root cause of the additional local store/load operations tomorrow. |
At most, 3% slower than the legacy set retrieve. # static_set_retrieve_uniform_occupancy
## [0] Quadro RTX 8000
| Key | Distribution | NumInputs | Occupancy | MatchingRate | Multiplicity | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-------|----------------|-------------|-------------|----------------|----------------|------------|-------------|------------|-------------|------------|---------|----------|
| I32 | UNIFORM | 100000000 | 0.1 | 1 | 1 | 31.655 ms | 0.16% | 32.138 ms | 0.49% | 483.156 us | 1.53% | FAIL |
| I32 | UNIFORM | 100000000 | 0.2 | 1 | 1 | 31.561 ms | 0.03% | 32.031 ms | 0.09% | 470.460 us | 1.49% | FAIL |
| I32 | UNIFORM | 100000000 | 0.3 | 1 | 1 | 31.600 ms | 0.04% | 32.059 ms | 0.02% | 459.550 us | 1.45% | FAIL |
| I32 | UNIFORM | 100000000 | 0.4 | 1 | 1 | 31.625 ms | 0.05% | 32.098 ms | 0.03% | 472.842 us | 1.50% | FAIL |
| I32 | UNIFORM | 100000000 | 0.5 | 1 | 1 | 31.768 ms | 0.02% | 32.279 ms | 0.02% | 510.634 us | 1.61% | FAIL |
| I32 | UNIFORM | 100000000 | 0.6 | 1 | 1 | 32.028 ms | 0.09% | 32.539 ms | 0.14% | 510.931 us | 1.60% | FAIL |
| I32 | UNIFORM | 100000000 | 0.7 | 1 | 1 | 32.365 ms | 0.05% | 32.888 ms | 0.02% | 522.923 us | 1.62% | FAIL |
| I32 | UNIFORM | 100000000 | 0.8 | 1 | 1 | 32.854 ms | 0.05% | 33.434 ms | 0.03% | 580.426 us | 1.77% | FAIL |
| I32 | UNIFORM | 100000000 | 0.9 | 1 | 1 | 33.509 ms | 0.05% | 34.158 ms | 0.11% | 648.794 us | 1.94% | FAIL |
| I64 | UNIFORM | 100000000 | 0.1 | 1 | 1 | 33.785 ms | 0.09% | 34.664 ms | 0.02% | 878.707 us | 2.60% | FAIL |
| I64 | UNIFORM | 100000000 | 0.2 | 1 | 1 | 33.780 ms | 0.02% | 34.640 ms | 0.05% | 860.681 us | 2.55% | FAIL |
| I64 | UNIFORM | 100000000 | 0.3 | 1 | 1 | 33.828 ms | 0.28% | 34.688 ms | 0.09% | 860.019 us | 2.54% | FAIL |
| I64 | UNIFORM | 100000000 | 0.4 | 1 | 1 | 33.886 ms | 0.08% | 34.794 ms | 0.04% | 907.827 us | 2.68% | FAIL |
| I64 | UNIFORM | 100000000 | 0.5 | 1 | 1 | 34.030 ms | 0.02% | 35.002 ms | 0.27% | 971.882 us | 2.86% | FAIL |
| I64 | UNIFORM | 100000000 | 0.6 | 1 | 1 | 34.288 ms | 0.04% | 35.342 ms | 0.10% | 1.054 ms | 3.07% | FAIL |
| I64 | UNIFORM | 100000000 | 0.7 | 1 | 1 | 34.644 ms | 0.05% | 35.750 ms | 0.06% | 1.107 ms | 3.19% | FAIL |
| I64 | UNIFORM | 100000000 | 0.8 | 1 | 1 | 35.143 ms | 0.04% | 36.316 ms | 0.05% | 1.173 ms | 3.34% | FAIL |
| I64 | UNIFORM | 100000000 | 0.9 | 1 | 1 | 35.839 ms | 0.04% | 37.148 ms | 0.07% | 1.309 ms | 3.65% | FAIL |
# static_set_retrieve_uniform_matching_rate
## [0] Quadro RTX 8000
| Key | Distribution | NumInputs | Occupancy | MatchingRate | Multiplicity | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-------|----------------|-------------|-------------|----------------|----------------|------------|-------------|------------|-------------|------------|---------|----------|
| I32 | UNIFORM | 100000000 | 0.5 | 0.1 | 1 | 32.716 ms | 0.07% | 33.242 ms | 0.02% | 526.718 us | 1.61% | FAIL |
| I32 | UNIFORM | 100000000 | 0.5 | 0.2 | 1 | 32.669 ms | 0.05% | 33.209 ms | 0.03% | 539.526 us | 1.65% | FAIL |
| I32 | UNIFORM | 100000000 | 0.5 | 0.3 | 1 | 32.619 ms | 0.04% | 33.174 ms | 0.06% | 555.652 us | 1.70% | FAIL |
| I32 | UNIFORM | 100000000 | 0.5 | 0.4 | 1 | 32.631 ms | 0.33% | 33.147 ms | 0.03% | 515.483 us | 1.58% | FAIL |
| I32 | UNIFORM | 100000000 | 0.5 | 0.5 | 1 | 32.519 ms | 0.15% | 33.007 ms | 0.02% | 487.826 us | 1.50% | FAIL |
| I32 | UNIFORM | 100000000 | 0.5 | 0.6 | 1 | 32.387 ms | 0.08% | 32.900 ms | 0.27% | 512.544 us | 1.58% | FAIL |
| I32 | UNIFORM | 100000000 | 0.5 | 0.7 | 1 | 32.204 ms | 0.06% | 32.697 ms | 0.05% | 492.261 us | 1.53% | FAIL |
| I32 | UNIFORM | 100000000 | 0.5 | 0.8 | 1 | 32.080 ms | 0.06% | 32.546 ms | 0.08% | 466.007 us | 1.45% | FAIL |
| I32 | UNIFORM | 100000000 | 0.5 | 0.9 | 1 | 31.951 ms | 0.09% | 32.399 ms | 0.02% | 447.622 us | 1.40% | FAIL |
| I32 | UNIFORM | 100000000 | 0.5 | 1 | 1 | 31.825 ms | 0.08% | 32.275 ms | 0.04% | 449.686 us | 1.41% | FAIL |
| I64 | UNIFORM | 100000000 | 0.5 | 0.1 | 1 | 35.011 ms | 0.04% | 36.080 ms | 0.03% | 1.069 ms | 3.05% | FAIL |
| I64 | UNIFORM | 100000000 | 0.5 | 0.2 | 1 | 34.972 ms | 0.13% | 36.024 ms | 0.02% | 1.051 ms | 3.01% | FAIL |
| I64 | UNIFORM | 100000000 | 0.5 | 0.3 | 1 | 34.925 ms | 0.03% | 35.978 ms | 0.04% | 1.052 ms | 3.01% | FAIL |
| I64 | UNIFORM | 100000000 | 0.5 | 0.4 | 1 | 34.927 ms | 0.29% | 35.935 ms | 0.03% | 1.008 ms | 2.89% | FAIL |
| I64 | UNIFORM | 100000000 | 0.5 | 0.5 | 1 | 34.788 ms | 0.06% | 35.803 ms | 0.01% | 1.016 ms | 2.92% | FAIL |
| I64 | UNIFORM | 100000000 | 0.5 | 0.6 | 1 | 34.655 ms | 0.08% | 35.675 ms | 0.28% | 1.020 ms | 2.94% | FAIL |
| I64 | UNIFORM | 100000000 | 0.5 | 0.7 | 1 | 34.478 ms | 0.07% | 35.462 ms | 0.03% | 983.448 us | 2.85% | FAIL |
| I64 | UNIFORM | 100000000 | 0.5 | 0.8 | 1 | 34.339 ms | 0.04% | 35.327 ms | 0.13% | 987.044 us | 2.87% | FAIL |
| I64 | UNIFORM | 100000000 | 0.5 | 0.9 | 1 | 34.222 ms | 0.05% | 35.138 ms | 0.02% | 915.469 us | 2.68% | FAIL |
| I64 | UNIFORM | 100000000 | 0.5 | 1 | 1 | 34.099 ms | 0.04% | 35.003 ms | 0.02% | 904.032 us | 2.65% | FAIL |
# static_set_retrieve_uniform_multiplicity
## [0] Quadro RTX 8000
| Key | Distribution | NumInputs | Occupancy | MatchingRate | Multiplicity | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-------|----------------|-------------|-------------|----------------|----------------|------------|-------------|------------|-------------|------------|---------|----------|
| I32 | UNIFORM | 100000000 | 0.5 | 1 | 1 | 31.872 ms | 0.05% | 32.286 ms | 0.04% | 414.806 us | 1.30% | FAIL |
| I32 | UNIFORM | 100000000 | 0.5 | 1 | 2 | 31.635 ms | 0.04% | 32.042 ms | 0.08% | 407.646 us | 1.29% | FAIL |
| I32 | UNIFORM | 100000000 | 0.5 | 1 | 4 | 31.487 ms | 0.05% | 31.886 ms | 0.06% | 398.762 us | 1.27% | FAIL |
| I32 | UNIFORM | 100000000 | 0.5 | 1 | 8 | 31.446 ms | 0.04% | 31.848 ms | 0.16% | 401.280 us | 1.28% | FAIL |
| I32 | UNIFORM | 100000000 | 0.5 | 1 | 16 | 31.354 ms | 0.08% | 31.793 ms | 0.22% | 439.230 us | 1.40% | FAIL |
| I64 | UNIFORM | 100000000 | 0.5 | 1 | 1 | 34.104 ms | 0.05% | 35.028 ms | 0.30% | 924.195 us | 2.71% | FAIL |
| I64 | UNIFORM | 100000000 | 0.5 | 1 | 2 | 33.879 ms | 0.03% | 34.714 ms | 0.02% | 834.471 us | 2.46% | FAIL |
| I64 | UNIFORM | 100000000 | 0.5 | 1 | 4 | 33.761 ms | 0.03% | 34.549 ms | 0.05% | 788.689 us | 2.34% | FAIL |
| I64 | UNIFORM | 100000000 | 0.5 | 1 | 8 | 33.719 ms | 0.08% | 34.480 ms | 0.02% | 761.383 us | 2.26% | FAIL |
| I64 | UNIFORM | 100000000 | 0.5 | 1 | 16 | 33.633 ms | 0.03% | 34.413 ms | 0.02% | 780.286 us | 2.32% | FAIL |
For multisets, it is about 10% to 40% faster compared to the current open-addressing # static_multiset_retrieve_uniform_occupancy
## [0] Quadro RTX 8000
| Key | Distribution | NumInputs | Occupancy | MatchingRate | Multiplicity | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-------|----------------|-------------|-------------|----------------|----------------|------------|-------------|------------|-------------|----------------|---------|----------|
| I32 | UNIFORM | 100000000 | 0.1 | 1 | 1 | 38.650 ms | 0.14% | 33.982 ms | 0.43% | -4668.478 us | -12.08% | FAIL |
| I32 | UNIFORM | 100000000 | 0.2 | 1 | 1 | 42.247 ms | 0.05% | 34.941 ms | 3.81% | -7305.580 us | -17.29% | FAIL |
| I32 | UNIFORM | 100000000 | 0.3 | 1 | 1 | 50.699 ms | 0.06% | 36.139 ms | 0.03% | -14559.651 us | -28.72% | FAIL |
| I32 | UNIFORM | 100000000 | 0.4 | 1 | 1 | 62.085 ms | 0.02% | 38.581 ms | 0.03% | -23504.343 us | -37.86% | FAIL |
| I32 | UNIFORM | 100000000 | 0.5 | 1 | 1 | 74.446 ms | 0.16% | 42.371 ms | 0.08% | -32075.106 us | -43.09% | FAIL |
| I32 | UNIFORM | 100000000 | 0.6 | 1 | 1 | 88.869 ms | 0.15% | 48.364 ms | 0.22% | -40504.642 us | -45.58% | FAIL |
| I32 | UNIFORM | 100000000 | 0.7 | 1 | 1 | 109.835 ms | 0.03% | 57.400 ms | 0.12% | -52434.290 us | -47.74% | FAIL |
| I32 | UNIFORM | 100000000 | 0.8 | 1 | 1 | 148.494 ms | 0.04% | 73.864 ms | 0.14% | -74629.652 us | -50.26% | FAIL |
| I32 | UNIFORM | 100000000 | 0.9 | 1 | 1 | 258.161 ms | 0.05% | 117.360 ms | 0.10% | -140800.405 us | -54.54% | FAIL |
| I64 | UNIFORM | 100000000 | 0.1 | 1 | 1 | 41.075 ms | 0.06% | 37.750 ms | 0.03% | -3325.859 us | -8.10% | FAIL |
| I64 | UNIFORM | 100000000 | 0.2 | 1 | 1 | 46.242 ms | 5.98% | 39.479 ms | 5.77% | -6762.625 us | -14.62% | FAIL |
| I64 | UNIFORM | 100000000 | 0.3 | 1 | 1 | 54.064 ms | 0.02% | 40.101 ms | 0.05% | -13963.470 us | -25.83% | FAIL |
| I64 | UNIFORM | 100000000 | 0.4 | 1 | 1 | 66.349 ms | 0.04% | 42.776 ms | 0.09% | -23572.833 us | -35.53% | FAIL |
| I64 | UNIFORM | 100000000 | 0.5 | 1 | 1 | 79.403 ms | 0.07% | 46.942 ms | 0.24% | -32461.705 us | -40.88% | FAIL |
| I64 | UNIFORM | 100000000 | 0.6 | 1 | 1 | 94.554 ms | 0.08% | 53.216 ms | 0.07% | -41338.752 us | -43.72% | FAIL |
| I64 | UNIFORM | 100000000 | 0.7 | 1 | 1 | 116.591 ms | 0.08% | 62.678 ms | 0.07% | -53913.335 us | -46.24% | FAIL |
| I64 | UNIFORM | 100000000 | 0.8 | 1 | 1 | 158.597 ms | 0.06% | 79.747 ms | 0.17% | -78850.020 us | -49.72% | FAIL |
| I64 | UNIFORM | 100000000 | 0.9 | 1 | 1 | 278.029 ms | 0.06% | 124.728 ms | 0.14% | -153301.718 us | -55.14% | FAIL |
# static_multiset_retrieve_uniform_matching_rate
## [0] Quadro RTX 8000
| Key | Distribution | NumInputs | Occupancy | MatchingRate | Multiplicity | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-------|----------------|-------------|-------------|----------------|----------------|------------|-------------|------------|-------------|---------------|---------|----------|
| I32 | UNIFORM | 100000000 | 0.5 | 0.1 | 1 | 57.847 ms | 0.05% | 36.902 ms | 0.05% | -20944.641 us | -36.21% | FAIL |
| I32 | UNIFORM | 100000000 | 0.5 | 0.2 | 1 | 58.533 ms | 0.07% | 37.109 ms | 0.09% | -21424.232 us | -36.60% | FAIL |
| I32 | UNIFORM | 100000000 | 0.5 | 0.3 | 1 | 59.248 ms | 0.10% | 37.319 ms | 0.31% | -21928.637 us | -37.01% | FAIL |
| I32 | UNIFORM | 100000000 | 0.5 | 0.4 | 1 | 59.944 ms | 0.09% | 37.462 ms | 0.05% | -22481.984 us | -37.50% | FAIL |
| I32 | UNIFORM | 100000000 | 0.5 | 0.5 | 1 | 61.782 ms | 0.10% | 38.002 ms | 0.06% | -23780.763 us | -38.49% | FAIL |
| I32 | UNIFORM | 100000000 | 0.5 | 0.6 | 1 | 63.841 ms | 0.08% | 38.612 ms | 0.04% | -25228.463 us | -39.52% | FAIL |
| I32 | UNIFORM | 100000000 | 0.5 | 0.7 | 1 | 66.673 ms | 0.08% | 39.531 ms | 0.07% | -27142.926 us | -40.71% | FAIL |
| I32 | UNIFORM | 100000000 | 0.5 | 0.8 | 1 | 69.311 ms | 0.12% | 40.475 ms | 0.06% | -28836.524 us | -41.60% | FAIL |
| I32 | UNIFORM | 100000000 | 0.5 | 0.9 | 1 | 71.696 ms | 0.15% | 41.412 ms | 0.08% | -30284.182 us | -42.24% | FAIL |
| I32 | UNIFORM | 100000000 | 0.5 | 1 | 1 | 74.195 ms | 0.11% | 42.482 ms | 0.05% | -31712.819 us | -42.74% | FAIL |
| I64 | UNIFORM | 100000000 | 0.5 | 0.1 | 1 | 61.779 ms | 0.02% | 40.839 ms | 0.06% | -20940.799 us | -33.90% | FAIL |
| I64 | UNIFORM | 100000000 | 0.5 | 0.2 | 1 | 62.626 ms | 0.05% | 41.049 ms | 0.17% | -21577.183 us | -34.45% | FAIL |
| I64 | UNIFORM | 100000000 | 0.5 | 0.3 | 1 | 63.385 ms | 0.03% | 41.300 ms | 0.23% | -22085.006 us | -34.84% | FAIL |
| I64 | UNIFORM | 100000000 | 0.5 | 0.4 | 1 | 64.189 ms | 0.01% | 41.489 ms | 0.09% | -22699.447 us | -35.36% | FAIL |
| I64 | UNIFORM | 100000000 | 0.5 | 0.5 | 1 | 66.202 ms | 0.02% | 42.068 ms | 0.03% | -24133.859 us | -36.45% | FAIL |
| I64 | UNIFORM | 100000000 | 0.5 | 0.6 | 1 | 68.467 ms | 0.18% | 42.806 ms | 0.08% | -25660.686 us | -37.48% | FAIL |
| I64 | UNIFORM | 100000000 | 0.5 | 0.7 | 1 | 71.496 ms | 0.03% | 43.848 ms | 0.05% | -27647.586 us | -38.67% | FAIL |
| I64 | UNIFORM | 100000000 | 0.5 | 0.8 | 1 | 74.177 ms | 0.02% | 44.881 ms | 0.06% | -29296.193 us | -39.49% | FAIL |
| I64 | UNIFORM | 100000000 | 0.5 | 0.9 | 1 | 76.677 ms | 0.03% | 45.889 ms | 0.04% | -30788.112 us | -40.15% | FAIL |
| I64 | UNIFORM | 100000000 | 0.5 | 1 | 1 | 79.335 ms | 0.04% | 47.076 ms | 0.05% | -32258.921 us | -40.66% | FAIL |
# static_multiset_retrieve_uniform_multiplicity
## [0] Quadro RTX 8000
| Key | Distribution | NumInputs | Occupancy | MatchingRate | Multiplicity | Ref Time | Ref Noise | Cmp Time | Cmp Noise | Diff | %Diff | Status |
|-------|----------------|-------------|-------------|----------------|----------------|------------|-------------|------------|-------------|---------------|---------|----------|
| I32 | UNIFORM | 100000000 | 0.5 | 1 | 1 | 74.280 ms | 0.13% | 42.491 ms | 0.07% | -31788.668 us | -42.80% | FAIL |
| I32 | UNIFORM | 100000000 | 0.5 | 1 | 2 | 82.512 ms | 0.08% | 46.722 ms | 0.26% | -35790.760 us | -43.38% | FAIL |
| I32 | UNIFORM | 100000000 | 0.5 | 1 | 4 | 105.811 ms | 0.12% | 62.029 ms | 0.05% | -43782.403 us | -41.38% | FAIL |
| I32 | UNIFORM | 100000000 | 0.5 | 1 | 8 | 142.387 ms | 0.07% | 89.924 ms | 0.07% | -52462.611 us | -36.85% | FAIL |
| I32 | UNIFORM | 100000000 | 0.5 | 1 | 16 | 204.460 ms | 0.09% | 138.475 ms | 0.11% | -65985.654 us | -32.27% | FAIL |
| I64 | UNIFORM | 100000000 | 0.5 | 1 | 1 | 79.304 ms | 0.03% | 47.139 ms | 0.19% | -32165.628 us | -40.56% | FAIL |
| I64 | UNIFORM | 100000000 | 0.5 | 1 | 2 | 87.966 ms | 0.14% | 51.644 ms | 0.07% | -36321.473 us | -41.29% | FAIL |
| I64 | UNIFORM | 100000000 | 0.5 | 1 | 4 | 112.007 ms | 0.02% | 67.364 ms | 0.20% | -44642.472 us | -39.86% | FAIL |
| I64 | UNIFORM | 100000000 | 0.5 | 1 | 8 | 150.420 ms | 0.02% | 96.340 ms | 0.07% | -54079.837 us | -35.95% | FAIL |
| I64 | UNIFORM | 100000000 | 0.5 | 1 | 16 | 216.731 ms | 0.05% | 149.005 ms | 0.08% | -67726.028 us | -31.25% | FAIL |
|
/ok to test |
/ok to test |
include/cuco/detail/open_addressing/open_addressing_ref_impl.cuh
Outdated
Show resolved
Hide resolved
include/cuco/detail/open_addressing/open_addressing_ref_impl.cuh
Outdated
Show resolved
Hide resolved
include/cuco/detail/open_addressing/open_addressing_ref_impl.cuh
Outdated
Show resolved
Hide resolved
/ok to test |
/merge |
This PR updates the legacy set retrieve to use the new open-addressing solution. It enhances open-addressing retrieve by eliminating the use of coalesced groups to reduce register pressure, resulting in approximately 10% to 40% speedups in multiset retrieve benchmarks.