Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate set retrieve to use the OA implementation #637

Merged
merged 25 commits into from
Nov 21, 2024

Conversation

PointKernel
Copy link
Member

@PointKernel PointKernel commented Nov 8, 2024

This PR updates the legacy set retrieve to use the new open-addressing solution. It enhances open-addressing retrieve by eliminating the use of coalesced groups to reduce register pressure, resulting in approximately 10% to 40% speedups in multiset retrieve benchmarks.

@PointKernel PointKernel added type: improvement Improvement / enhancement to an existing function topic: static_set Issue related to the static_set labels Nov 8, 2024
@PointKernel
Copy link
Member Author

PointKernel commented Nov 8, 2024

Compared to the current set algorithm, the OA retrieve achieves comparable performance after adding an early exit for a single hash set, though it still experiences about a 20% slowdown:

yunsongw@0c23fdd-lcedt:~/Work/nvbench/scripts$ ./nvbench_compare.py old_retrieve.json oa-items-block.json 
['old_retrieve.json', 'oa-items-block.json']
# static_set_retrieve_uniform_occupancy

## [0] Quadro RTX 8000

|  Key  |  Distribution  |  Occupancy  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|-------|----------------|-------------|------------|-------------|------------|-------------|-----------|---------|----------|
|  I32  |    UNIFORM     |     0.1     |  31.708 ms |       0.40% |  34.022 ms |       0.11% |  2.314 ms |   7.30% |   FAIL   |
|  I32  |    UNIFORM     |     0.2     |  31.706 ms |       0.32% |  34.080 ms |       0.06% |  2.375 ms |   7.49% |   FAIL   |
|  I32  |    UNIFORM     |     0.3     |  31.689 ms |       0.29% |  34.312 ms |       0.13% |  2.623 ms |   8.28% |   FAIL   |
|  I32  |    UNIFORM     |     0.4     |  32.002 ms |       1.16% |  34.864 ms |       0.04% |  2.862 ms |   8.94% |   FAIL   |
|  I32  |    UNIFORM     |     0.5     |  32.161 ms |       0.07% |  35.880 ms |       0.08% |  3.719 ms |  11.56% |   FAIL   |
|  I32  |    UNIFORM     |     0.6     |  32.711 ms |       0.10% |  37.520 ms |       0.26% |  4.808 ms |  14.70% |   FAIL   |
|  I32  |    UNIFORM     |     0.7     |  33.454 ms |       0.06% |  39.609 ms |       0.01% |  6.155 ms |  18.40% |   FAIL   |
|  I32  |    UNIFORM     |     0.8     |  34.625 ms |       0.17% |  42.471 ms |       0.03% |  7.846 ms |  22.66% |   FAIL   |
|  I32  |    UNIFORM     |     0.9     |  36.350 ms |       0.09% |  46.138 ms |       0.05% |  9.788 ms |  26.93% |   FAIL   |
|  I64  |    UNIFORM     |     0.1     |  33.798 ms |       0.04% |  36.332 ms |       0.24% |  2.534 ms |   7.50% |   FAIL   |
|  I64  |    UNIFORM     |     0.2     |  33.976 ms |       1.76% |  36.435 ms |       0.04% |  2.459 ms |   7.24% |   FAIL   |
|  I64  |    UNIFORM     |     0.3     |  33.978 ms |       0.14% |  36.717 ms |       0.05% |  2.739 ms |   8.06% |   FAIL   |
|  I64  |    UNIFORM     |     0.4     |  34.090 ms |       0.10% |  37.314 ms |       0.04% |  3.223 ms |   9.46% |   FAIL   |
|  I64  |    UNIFORM     |     0.5     |  34.412 ms |       0.06% |  38.389 ms |       0.10% |  3.977 ms |  11.56% |   FAIL   |
|  I64  |    UNIFORM     |     0.6     |  34.927 ms |       0.03% |  40.046 ms |       0.05% |  5.119 ms |  14.66% |   FAIL   |
|  I64  |    UNIFORM     |     0.7     |  35.790 ms |       0.28% |  42.317 ms |       0.04% |  6.527 ms |  18.24% |   FAIL   |
|  I64  |    UNIFORM     |     0.8     |  36.891 ms |       0.06% |  45.269 ms |       0.03% |  8.378 ms |  22.71% |   FAIL   |
|  I64  |    UNIFORM     |     0.9     |  38.686 ms |       0.04% |  49.043 ms |       0.02% | 10.357 ms |  26.77% |   FAIL   |

# static_set_retrieve_uniform_matching_rate

## [0] Quadro RTX 8000

|  Key  |  Distribution  |  MatchingRate  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |     Diff |   %Diff |  Status  |
|-------|----------------|----------------|------------|-------------|------------|-------------|----------|---------|----------|
|  I32  |    UNIFORM     |      0.1       |  32.647 ms |       0.06% |  40.862 ms |       0.12% | 8.214 ms |  25.16% |   FAIL   |
|  I32  |    UNIFORM     |      0.2       |  32.670 ms |       0.37% |  40.687 ms |       0.21% | 8.017 ms |  24.54% |   FAIL   |
|  I32  |    UNIFORM     |      0.3       |  32.532 ms |       0.10% |  40.397 ms |       0.02% | 7.865 ms |  24.18% |   FAIL   |
|  I32  |    UNIFORM     |      0.4       |  32.472 ms |       0.08% |  40.154 ms |       0.04% | 7.682 ms |  23.66% |   FAIL   |
|  I32  |    UNIFORM     |      0.5       |  32.498 ms |       0.35% |  39.601 ms |       0.06% | 7.103 ms |  21.86% |   FAIL   |
|  I32  |    UNIFORM     |      0.6       |  32.392 ms |       0.05% |  38.879 ms |       0.03% | 6.487 ms |  20.03% |   FAIL   |
|  I32  |    UNIFORM     |      0.7       |  32.340 ms |       0.09% |  37.965 ms |       0.04% | 5.625 ms |  17.39% |   FAIL   |
|  I32  |    UNIFORM     |      0.8       |  32.283 ms |       0.06% |  37.166 ms |       0.07% | 4.883 ms |  15.12% |   FAIL   |
|  I32  |    UNIFORM     |      0.9       |  32.241 ms |       0.06% |  36.486 ms |       0.04% | 4.245 ms |  13.17% |   FAIL   |
|  I32  |    UNIFORM     |       1        |  32.243 ms |       0.14% |  35.845 ms |       0.04% | 3.603 ms |  11.17% |   FAIL   |
|  I64  |    UNIFORM     |      0.1       |  34.905 ms |       0.04% |  43.508 ms |       0.06% | 8.603 ms |  24.65% |   FAIL   |
|  I64  |    UNIFORM     |      0.2       |  34.845 ms |       0.03% |  43.344 ms |       0.16% | 8.500 ms |  24.39% |   FAIL   |
|  I64  |    UNIFORM     |      0.3       |  34.806 ms |       0.05% |  43.120 ms |       0.04% | 8.314 ms |  23.89% |   FAIL   |
|  I64  |    UNIFORM     |      0.4       |  34.754 ms |       0.05% |  42.887 ms |       0.07% | 8.132 ms |  23.40% |   FAIL   |
|  I64  |    UNIFORM     |      0.5       |  34.727 ms |       0.18% |  42.269 ms |       0.04% | 7.543 ms |  21.72% |   FAIL   |
|  I64  |    UNIFORM     |      0.6       |  34.656 ms |       0.04% |  41.548 ms |       0.03% | 6.891 ms |  19.89% |   FAIL   |
|  I64  |    UNIFORM     |      0.7       |  34.599 ms |       0.05% |  40.584 ms |       0.02% | 5.985 ms |  17.30% |   FAIL   |
|  I64  |    UNIFORM     |      0.8       |  34.570 ms |       0.04% |  39.717 ms |       0.03% | 5.147 ms |  14.89% |   FAIL   |
|  I64  |    UNIFORM     |      0.9       |  34.533 ms |       0.06% |  38.977 ms |       0.02% | 4.444 ms |  12.87% |   FAIL   |
|  I64  |    UNIFORM     |       1        |  34.453 ms |       0.05% |  38.263 ms |       0.03% | 3.810 ms |  11.06% |   FAIL   |

# static_set_retrieve_uniform_multiplicity

## [0] Quadro RTX 8000

|  Key  |  Distribution  |  Multiplicity  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |     Diff |   %Diff |  Status  |
|-------|----------------|----------------|------------|-------------|------------|-------------|----------|---------|----------|
|  I32  |    UNIFORM     |       1        |  32.236 ms |       0.11% |  35.795 ms |       0.03% | 3.559 ms |  11.04% |   FAIL   |
|  I32  |    UNIFORM     |       2        |  31.634 ms |       0.12% |  34.570 ms |       0.06% | 2.937 ms |   9.28% |   FAIL   |
|  I32  |    UNIFORM     |       4        |  31.494 ms |       0.28% |  33.824 ms |       0.03% | 2.330 ms |   7.40% |   FAIL   |
|  I32  |    UNIFORM     |       8        |  31.412 ms |       0.05% |  33.685 ms |       0.04% | 2.272 ms |   7.23% |   FAIL   |
|  I32  |    UNIFORM     |       16       |  31.345 ms |       0.07% |  33.585 ms |       0.03% | 2.241 ms |   7.15% |   FAIL   |
|  I64  |    UNIFORM     |       1        |  34.478 ms |       0.05% |  38.306 ms |       0.12% | 3.828 ms |  11.10% |   FAIL   |
|  I64  |    UNIFORM     |       2        |  33.858 ms |       0.04% |  36.990 ms |       0.04% | 3.132 ms |   9.25% |   FAIL   |
|  I64  |    UNIFORM     |       4        |  34.050 ms |       3.52% |  36.166 ms |       0.04% | 2.116 ms |   6.22% |   FAIL   |
|  I64  |    UNIFORM     |       8        |  33.701 ms |       0.04% |  36.028 ms |       0.05% | 2.326 ms |   6.90% |   FAIL   |
|  I64  |    UNIFORM     |       16       |  33.645 ms |       0.11% |  35.898 ms |       0.03% | 2.252 ms |   6.69% |   FAIL   |

On the bright side, I realized that cudf’s distinct inner join can utilize find instead of the costly atomic-bounded retrieve operation, providing noticeable speedups (rapidsai/cudf#17278). This allows us to concentrate specifically on multiset use cases.

Is a clean and nice block-wise retrieve API worth a 20% performance slowdown?

@sleeepyjack
Copy link
Collaborator

Is a clean and nice block-wise retrieve API worth a 20% performance slowdown?

Can you elaborate on this? I don't think the API granularity is the culprit here as it just mimics the behavior of the former retrieve kernel. There must be something wrong in the implementation itself rather than the way we partition the work among CTAs.

@PointKernel
Copy link
Member Author

PointKernel commented Nov 8, 2024

Is a clean and nice block-wise retrieve API worth a 20% performance slowdown?

Can you elaborate on this? I don't think the API granularity is the culprit here as it just mimics the behavior of the former retrieve kernel. There must be something wrong in the implementation itself rather than the way we partition the work among CTAs.

Based on profiling, the performance difference mainly came from memory transactions and register usage. I highly suspect the former is due to the current buffer flushing strategy:

  1. buffer size, the current setup requires to flush even if there's only one element in the buffer.
  2. The remainder flushing is needed only when all the probing sequences are terminated and there are still elements present in the buffer. The legacy implementation performs this operation right before the kernel termination after the grid stride loop. We cannot do the same since buffers are not exposed to the kernel in the current setup. I thought doing a clean final flush was infeasible yesterday but just get an idea to try this morning.

Will keep you posted

@sleeepyjack
Copy link
Collaborator

The remainder flushing is needed only when all the probing sequences are terminated and there are still elements present in the buffer. The legacy implementation performs this operation right before the kernel termination after the grid stride loop. We cannot do the same since buffers are not exposed to the kernel in the current setup.

Whups, that is probably my bad. It should be feasible to flush the remaining elements at the end of the device-batch retrieve. The current approach is conceptually not that different from the former retrieve kernel. Instead of defining the shmem buffers at kernel scope and then pass them to the API, we define them directly inside the batch retrieve function, giving us full control over the size and the current state of the buffers.

@PointKernel
Copy link
Member Author

It should be feasible to flush the remaining elements at the end of the device-batch retrieve.

Yeah, I've already included that change in this PR and there is no performance impact.

The current approach is conceptually not that different from the former retrieve kernel.

The coalesced group still seems like the most likely culprit for now. I have a local branch that mimics the behavior of the legacy code but manages all shared memory operations at the block level. The performance is significantly worse (I assume that’s where you started as well 😊). I plan to investigate the root cause of the additional local store/load operations tomorrow.

@PointKernel PointKernel added the In Progress Currently a work in progress label Nov 16, 2024
@PointKernel
Copy link
Member Author

At most, 3% slower than the legacy set retrieve.

# static_set_retrieve_uniform_occupancy

## [0] Quadro RTX 8000

|  Key  |  Distribution  |  NumInputs  |  Occupancy  |  MatchingRate  |  Multiplicity  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|-------|----------------|-------------|-------------|----------------|----------------|------------|-------------|------------|-------------|------------|---------|----------|
|  I32  |    UNIFORM     |  100000000  |     0.1     |       1        |       1        |  31.655 ms |       0.16% |  32.138 ms |       0.49% | 483.156 us |   1.53% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.2     |       1        |       1        |  31.561 ms |       0.03% |  32.031 ms |       0.09% | 470.460 us |   1.49% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.3     |       1        |       1        |  31.600 ms |       0.04% |  32.059 ms |       0.02% | 459.550 us |   1.45% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.4     |       1        |       1        |  31.625 ms |       0.05% |  32.098 ms |       0.03% | 472.842 us |   1.50% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.5     |       1        |       1        |  31.768 ms |       0.02% |  32.279 ms |       0.02% | 510.634 us |   1.61% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.6     |       1        |       1        |  32.028 ms |       0.09% |  32.539 ms |       0.14% | 510.931 us |   1.60% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.7     |       1        |       1        |  32.365 ms |       0.05% |  32.888 ms |       0.02% | 522.923 us |   1.62% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.8     |       1        |       1        |  32.854 ms |       0.05% |  33.434 ms |       0.03% | 580.426 us |   1.77% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.9     |       1        |       1        |  33.509 ms |       0.05% |  34.158 ms |       0.11% | 648.794 us |   1.94% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.1     |       1        |       1        |  33.785 ms |       0.09% |  34.664 ms |       0.02% | 878.707 us |   2.60% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.2     |       1        |       1        |  33.780 ms |       0.02% |  34.640 ms |       0.05% | 860.681 us |   2.55% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.3     |       1        |       1        |  33.828 ms |       0.28% |  34.688 ms |       0.09% | 860.019 us |   2.54% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.4     |       1        |       1        |  33.886 ms |       0.08% |  34.794 ms |       0.04% | 907.827 us |   2.68% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.5     |       1        |       1        |  34.030 ms |       0.02% |  35.002 ms |       0.27% | 971.882 us |   2.86% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.6     |       1        |       1        |  34.288 ms |       0.04% |  35.342 ms |       0.10% |   1.054 ms |   3.07% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.7     |       1        |       1        |  34.644 ms |       0.05% |  35.750 ms |       0.06% |   1.107 ms |   3.19% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.8     |       1        |       1        |  35.143 ms |       0.04% |  36.316 ms |       0.05% |   1.173 ms |   3.34% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.9     |       1        |       1        |  35.839 ms |       0.04% |  37.148 ms |       0.07% |   1.309 ms |   3.65% |   FAIL   |

# static_set_retrieve_uniform_matching_rate

## [0] Quadro RTX 8000

|  Key  |  Distribution  |  NumInputs  |  Occupancy  |  MatchingRate  |  Multiplicity  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|-------|----------------|-------------|-------------|----------------|----------------|------------|-------------|------------|-------------|------------|---------|----------|
|  I32  |    UNIFORM     |  100000000  |     0.5     |      0.1       |       1        |  32.716 ms |       0.07% |  33.242 ms |       0.02% | 526.718 us |   1.61% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.5     |      0.2       |       1        |  32.669 ms |       0.05% |  33.209 ms |       0.03% | 539.526 us |   1.65% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.5     |      0.3       |       1        |  32.619 ms |       0.04% |  33.174 ms |       0.06% | 555.652 us |   1.70% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.5     |      0.4       |       1        |  32.631 ms |       0.33% |  33.147 ms |       0.03% | 515.483 us |   1.58% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.5     |      0.5       |       1        |  32.519 ms |       0.15% |  33.007 ms |       0.02% | 487.826 us |   1.50% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.5     |      0.6       |       1        |  32.387 ms |       0.08% |  32.900 ms |       0.27% | 512.544 us |   1.58% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.5     |      0.7       |       1        |  32.204 ms |       0.06% |  32.697 ms |       0.05% | 492.261 us |   1.53% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.5     |      0.8       |       1        |  32.080 ms |       0.06% |  32.546 ms |       0.08% | 466.007 us |   1.45% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.5     |      0.9       |       1        |  31.951 ms |       0.09% |  32.399 ms |       0.02% | 447.622 us |   1.40% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.5     |       1        |       1        |  31.825 ms |       0.08% |  32.275 ms |       0.04% | 449.686 us |   1.41% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.5     |      0.1       |       1        |  35.011 ms |       0.04% |  36.080 ms |       0.03% |   1.069 ms |   3.05% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.5     |      0.2       |       1        |  34.972 ms |       0.13% |  36.024 ms |       0.02% |   1.051 ms |   3.01% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.5     |      0.3       |       1        |  34.925 ms |       0.03% |  35.978 ms |       0.04% |   1.052 ms |   3.01% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.5     |      0.4       |       1        |  34.927 ms |       0.29% |  35.935 ms |       0.03% |   1.008 ms |   2.89% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.5     |      0.5       |       1        |  34.788 ms |       0.06% |  35.803 ms |       0.01% |   1.016 ms |   2.92% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.5     |      0.6       |       1        |  34.655 ms |       0.08% |  35.675 ms |       0.28% |   1.020 ms |   2.94% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.5     |      0.7       |       1        |  34.478 ms |       0.07% |  35.462 ms |       0.03% | 983.448 us |   2.85% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.5     |      0.8       |       1        |  34.339 ms |       0.04% |  35.327 ms |       0.13% | 987.044 us |   2.87% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.5     |      0.9       |       1        |  34.222 ms |       0.05% |  35.138 ms |       0.02% | 915.469 us |   2.68% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.5     |       1        |       1        |  34.099 ms |       0.04% |  35.003 ms |       0.02% | 904.032 us |   2.65% |   FAIL   |

# static_set_retrieve_uniform_multiplicity

## [0] Quadro RTX 8000

|  Key  |  Distribution  |  NumInputs  |  Occupancy  |  MatchingRate  |  Multiplicity  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|-------|----------------|-------------|-------------|----------------|----------------|------------|-------------|------------|-------------|------------|---------|----------|
|  I32  |    UNIFORM     |  100000000  |     0.5     |       1        |       1        |  31.872 ms |       0.05% |  32.286 ms |       0.04% | 414.806 us |   1.30% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.5     |       1        |       2        |  31.635 ms |       0.04% |  32.042 ms |       0.08% | 407.646 us |   1.29% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.5     |       1        |       4        |  31.487 ms |       0.05% |  31.886 ms |       0.06% | 398.762 us |   1.27% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.5     |       1        |       8        |  31.446 ms |       0.04% |  31.848 ms |       0.16% | 401.280 us |   1.28% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.5     |       1        |       16       |  31.354 ms |       0.08% |  31.793 ms |       0.22% | 439.230 us |   1.40% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.5     |       1        |       1        |  34.104 ms |       0.05% |  35.028 ms |       0.30% | 924.195 us |   2.71% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.5     |       1        |       2        |  33.879 ms |       0.03% |  34.714 ms |       0.02% | 834.471 us |   2.46% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.5     |       1        |       4        |  33.761 ms |       0.03% |  34.549 ms |       0.05% | 788.689 us |   2.34% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.5     |       1        |       8        |  33.719 ms |       0.08% |  34.480 ms |       0.02% | 761.383 us |   2.26% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.5     |       1        |       16       |  33.633 ms |       0.03% |  34.413 ms |       0.02% | 780.286 us |   2.32% |   FAIL   |

For multisets, it is about 10% to 40% faster compared to the current open-addressing retrieve solution.

# static_multiset_retrieve_uniform_occupancy

## [0] Quadro RTX 8000

|  Key  |  Distribution  |  NumInputs  |  Occupancy  |  MatchingRate  |  Multiplicity  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |           Diff |   %Diff |  Status  |
|-------|----------------|-------------|-------------|----------------|----------------|------------|-------------|------------|-------------|----------------|---------|----------|
|  I32  |    UNIFORM     |  100000000  |     0.1     |       1        |       1        |  38.650 ms |       0.14% |  33.982 ms |       0.43% |   -4668.478 us | -12.08% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.2     |       1        |       1        |  42.247 ms |       0.05% |  34.941 ms |       3.81% |   -7305.580 us | -17.29% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.3     |       1        |       1        |  50.699 ms |       0.06% |  36.139 ms |       0.03% |  -14559.651 us | -28.72% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.4     |       1        |       1        |  62.085 ms |       0.02% |  38.581 ms |       0.03% |  -23504.343 us | -37.86% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.5     |       1        |       1        |  74.446 ms |       0.16% |  42.371 ms |       0.08% |  -32075.106 us | -43.09% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.6     |       1        |       1        |  88.869 ms |       0.15% |  48.364 ms |       0.22% |  -40504.642 us | -45.58% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.7     |       1        |       1        | 109.835 ms |       0.03% |  57.400 ms |       0.12% |  -52434.290 us | -47.74% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.8     |       1        |       1        | 148.494 ms |       0.04% |  73.864 ms |       0.14% |  -74629.652 us | -50.26% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.9     |       1        |       1        | 258.161 ms |       0.05% | 117.360 ms |       0.10% | -140800.405 us | -54.54% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.1     |       1        |       1        |  41.075 ms |       0.06% |  37.750 ms |       0.03% |   -3325.859 us |  -8.10% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.2     |       1        |       1        |  46.242 ms |       5.98% |  39.479 ms |       5.77% |   -6762.625 us | -14.62% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.3     |       1        |       1        |  54.064 ms |       0.02% |  40.101 ms |       0.05% |  -13963.470 us | -25.83% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.4     |       1        |       1        |  66.349 ms |       0.04% |  42.776 ms |       0.09% |  -23572.833 us | -35.53% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.5     |       1        |       1        |  79.403 ms |       0.07% |  46.942 ms |       0.24% |  -32461.705 us | -40.88% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.6     |       1        |       1        |  94.554 ms |       0.08% |  53.216 ms |       0.07% |  -41338.752 us | -43.72% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.7     |       1        |       1        | 116.591 ms |       0.08% |  62.678 ms |       0.07% |  -53913.335 us | -46.24% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.8     |       1        |       1        | 158.597 ms |       0.06% |  79.747 ms |       0.17% |  -78850.020 us | -49.72% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.9     |       1        |       1        | 278.029 ms |       0.06% | 124.728 ms |       0.14% | -153301.718 us | -55.14% |   FAIL   |

# static_multiset_retrieve_uniform_matching_rate

## [0] Quadro RTX 8000

|  Key  |  Distribution  |  NumInputs  |  Occupancy  |  MatchingRate  |  Multiplicity  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |          Diff |   %Diff |  Status  |
|-------|----------------|-------------|-------------|----------------|----------------|------------|-------------|------------|-------------|---------------|---------|----------|
|  I32  |    UNIFORM     |  100000000  |     0.5     |      0.1       |       1        |  57.847 ms |       0.05% |  36.902 ms |       0.05% | -20944.641 us | -36.21% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.5     |      0.2       |       1        |  58.533 ms |       0.07% |  37.109 ms |       0.09% | -21424.232 us | -36.60% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.5     |      0.3       |       1        |  59.248 ms |       0.10% |  37.319 ms |       0.31% | -21928.637 us | -37.01% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.5     |      0.4       |       1        |  59.944 ms |       0.09% |  37.462 ms |       0.05% | -22481.984 us | -37.50% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.5     |      0.5       |       1        |  61.782 ms |       0.10% |  38.002 ms |       0.06% | -23780.763 us | -38.49% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.5     |      0.6       |       1        |  63.841 ms |       0.08% |  38.612 ms |       0.04% | -25228.463 us | -39.52% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.5     |      0.7       |       1        |  66.673 ms |       0.08% |  39.531 ms |       0.07% | -27142.926 us | -40.71% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.5     |      0.8       |       1        |  69.311 ms |       0.12% |  40.475 ms |       0.06% | -28836.524 us | -41.60% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.5     |      0.9       |       1        |  71.696 ms |       0.15% |  41.412 ms |       0.08% | -30284.182 us | -42.24% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.5     |       1        |       1        |  74.195 ms |       0.11% |  42.482 ms |       0.05% | -31712.819 us | -42.74% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.5     |      0.1       |       1        |  61.779 ms |       0.02% |  40.839 ms |       0.06% | -20940.799 us | -33.90% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.5     |      0.2       |       1        |  62.626 ms |       0.05% |  41.049 ms |       0.17% | -21577.183 us | -34.45% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.5     |      0.3       |       1        |  63.385 ms |       0.03% |  41.300 ms |       0.23% | -22085.006 us | -34.84% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.5     |      0.4       |       1        |  64.189 ms |       0.01% |  41.489 ms |       0.09% | -22699.447 us | -35.36% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.5     |      0.5       |       1        |  66.202 ms |       0.02% |  42.068 ms |       0.03% | -24133.859 us | -36.45% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.5     |      0.6       |       1        |  68.467 ms |       0.18% |  42.806 ms |       0.08% | -25660.686 us | -37.48% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.5     |      0.7       |       1        |  71.496 ms |       0.03% |  43.848 ms |       0.05% | -27647.586 us | -38.67% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.5     |      0.8       |       1        |  74.177 ms |       0.02% |  44.881 ms |       0.06% | -29296.193 us | -39.49% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.5     |      0.9       |       1        |  76.677 ms |       0.03% |  45.889 ms |       0.04% | -30788.112 us | -40.15% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.5     |       1        |       1        |  79.335 ms |       0.04% |  47.076 ms |       0.05% | -32258.921 us | -40.66% |   FAIL   |

# static_multiset_retrieve_uniform_multiplicity

## [0] Quadro RTX 8000

|  Key  |  Distribution  |  NumInputs  |  Occupancy  |  MatchingRate  |  Multiplicity  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |          Diff |   %Diff |  Status  |
|-------|----------------|-------------|-------------|----------------|----------------|------------|-------------|------------|-------------|---------------|---------|----------|
|  I32  |    UNIFORM     |  100000000  |     0.5     |       1        |       1        |  74.280 ms |       0.13% |  42.491 ms |       0.07% | -31788.668 us | -42.80% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.5     |       1        |       2        |  82.512 ms |       0.08% |  46.722 ms |       0.26% | -35790.760 us | -43.38% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.5     |       1        |       4        | 105.811 ms |       0.12% |  62.029 ms |       0.05% | -43782.403 us | -41.38% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.5     |       1        |       8        | 142.387 ms |       0.07% |  89.924 ms |       0.07% | -52462.611 us | -36.85% |   FAIL   |
|  I32  |    UNIFORM     |  100000000  |     0.5     |       1        |       16       | 204.460 ms |       0.09% | 138.475 ms |       0.11% | -65985.654 us | -32.27% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.5     |       1        |       1        |  79.304 ms |       0.03% |  47.139 ms |       0.19% | -32165.628 us | -40.56% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.5     |       1        |       2        |  87.966 ms |       0.14% |  51.644 ms |       0.07% | -36321.473 us | -41.29% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.5     |       1        |       4        | 112.007 ms |       0.02% |  67.364 ms |       0.20% | -44642.472 us | -39.86% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.5     |       1        |       8        | 150.420 ms |       0.02% |  96.340 ms |       0.07% | -54079.837 us | -35.95% |   FAIL   |
|  I64  |    UNIFORM     |  100000000  |     0.5     |       1        |       16       | 216.731 ms |       0.05% | 149.005 ms |       0.08% | -67726.028 us | -31.25% |   FAIL   |

@PointKernel PointKernel added helps: rapids Helps or needed by RAPIDS Needs Review Awaiting reviews before merging and removed In Progress Currently a work in progress labels Nov 18, 2024
@PointKernel PointKernel marked this pull request as ready for review November 18, 2024 23:24
Copy link

copy-pr-bot bot commented Nov 18, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@PointKernel
Copy link
Member Author

/ok to test

@PointKernel
Copy link
Member Author

/ok to test

@sleeepyjack
Copy link
Collaborator

/ok to test

@PointKernel
Copy link
Member Author

/merge

@PointKernel PointKernel merged commit d829576 into NVIDIA:dev Nov 21, 2024
18 checks passed
@PointKernel PointKernel deleted the improve-retrieve branch November 21, 2024 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
helps: rapids Helps or needed by RAPIDS Needs Review Awaiting reviews before merging topic: static_set Issue related to the static_set type: improvement Improvement / enhancement to an existing function
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants