Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes a performance regression in FST #13850

Merged

Conversation

elstehle
Copy link
Contributor

Description

#13344 introduced a performance regression to the FST benchmarks that showed as much as a 35% performance degradation.

It seems that, after the refactor in the above PR, compiler optimization heuristics are deciding differently on loop unrolling in the part of the FST that's writing out transduced symbols.

As a fix, we are enforcing to not unroll that loop.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@elstehle elstehle requested a review from a team as a code owner August 11, 2023 11:49
@elstehle elstehle requested review from karthikeyann and vuule August 11, 2023 11:49
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Aug 11, 2023
@elstehle
Copy link
Contributor Author

elstehle commented Aug 11, 2023


The following numbers compare performance between current branch-23.10 and this fix of FST_BENCH:


## [0] Tesla V100-SXM2-32GB

|  string_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |          Diff |   %Diff |  Status  |
|---------------|------------|-------------|------------|-------------|---------------|---------|----------|
|     2^20      | 112.026 us |       6.10% | 109.783 us |       6.65% |     -2.243 us |  -2.00% |   PASS   |
|     2^21      | 205.869 us |       2.51% | 173.607 us |       2.20% |    -32.262 us | -15.67% |   FAIL   |
|     2^22      | 670.395 us |       5.24% | 331.522 us |       5.73% |   -338.873 us | -50.55% |   FAIL   |
|     2^23      |   1.143 ms |       7.48% | 619.904 us |       3.66% |   -522.665 us | -45.74% |   FAIL   |
|     2^24      |   2.004 ms |       7.44% |   1.168 ms |       4.66% |   -835.960 us | -41.72% |   FAIL   |
|     2^25      |   3.649 ms |       7.49% |   2.260 ms |       1.19% |  -1389.762 us | -38.08% |   FAIL   |
|     2^26      |   6.893 ms |       6.06% |   4.453 ms |       0.82% |  -2439.141 us | -35.39% |   FAIL   |
|     2^27      |  13.287 ms |       4.44% |   8.802 ms |       0.42% |  -4484.298 us | -33.75% |   FAIL   |
|     2^28      |  26.013 ms |       2.94% |  17.515 ms |       0.37% |  -8497.370 us | -32.67% |   FAIL   |
|     2^29      |  51.290 ms |       2.21% |  34.964 ms |       0.28% | -16326.015 us | -31.83% |   FAIL   |
|     2^30      | 102.488 ms |       1.68% |  70.101 ms |       1.45% | -32386.937 us | -31.60% |   FAIL   |

# FST_JSON_no_outidx

## [0] Tesla V100-SXM2-32GB

|  string_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------------|------------|-------------|------------|-------------|------------|---------|----------|
|     2^20      |  67.092 us |       3.05% |  59.620 us |       3.91% |  -7.472 us | -11.14% |   FAIL   |
|     2^21      | 100.143 us |       2.52% |  87.097 us |       1.19% | -13.045 us | -13.03% |   FAIL   |
|     2^22      | 159.329 us |       1.93% | 152.840 us |       2.38% |  -6.488 us |  -4.07% |   FAIL   |
|     2^23      | 273.194 us |       1.11% | 262.936 us |       1.41% | -10.258 us |  -3.75% |   FAIL   |
|     2^24      | 490.196 us |       0.84% | 476.839 us |       1.09% | -13.357 us |  -2.72% |   FAIL   |
|     2^25      | 923.605 us |       0.43% | 924.554 us |       0.82% |   0.949 us |   0.10% |   PASS   |
|     2^26      |   1.794 ms |       0.25% |   1.793 ms |       0.45% |  -0.800 us |  -0.04% |   PASS   |
|     2^27      |   3.520 ms |       0.16% |   3.531 ms |       0.30% |  11.120 us |   0.32% |   FAIL   |
|     2^28      |   6.967 ms |       0.11% |   6.989 ms |       0.13% |  21.575 us |   0.31% |   FAIL   |
|     2^29      |  13.869 ms |       0.07% |  13.912 ms |       0.10% |  42.916 us |   0.31% |   FAIL   |
|     2^30      |  27.670 ms |       0.03% |  27.763 ms |       0.09% |  93.022 us |   0.34% |   FAIL   |

# FST_JSON_no_out

## [0] Tesla V100-SXM2-32GB

|  string_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|---------------|------------|-------------|------------|-------------|-----------|---------|----------|
|     2^20      |  35.246 us |       7.49% |  35.645 us |       6.54% |  0.399 us |   1.13% |   PASS   |
|     2^21      |  39.790 us |       6.34% |  40.562 us |       2.49% |  0.773 us |   1.94% |   PASS   |
|     2^22      |  53.282 us |       4.92% |  52.691 us |       2.64% | -0.591 us |  -1.11% |   PASS   |
|     2^23      |  76.475 us |       3.36% |  76.386 us |       1.73% | -0.089 us |  -0.12% |   PASS   |
|     2^24      | 115.929 us |       2.06% | 116.312 us |       1.28% |  0.383 us |   0.33% |   PASS   |
|     2^25      | 193.174 us |       1.39% | 192.929 us |       0.88% | -0.245 us |  -0.13% |   PASS   |
|     2^26      | 340.896 us |       0.70% | 342.582 us |       0.69% |  1.685 us |   0.49% |   PASS   |
|     2^27      | 637.203 us |       0.54% | 633.735 us |       0.50% | -3.468 us |  -0.54% |   FAIL   |
|     2^28      |   1.223 ms |       0.32% |   1.224 ms |       0.28% |  0.160 us |   0.01% |   PASS   |
|     2^29      |   2.399 ms |       0.23% |   2.399 ms |       0.21% |  0.005 us |   0.00% |   PASS   |
|     2^30      |   4.723 ms |       0.15% |   4.738 ms |       0.14% | 15.486 us |   0.33% |   FAIL   |

# FST_JSON_no_str

## [0] Tesla V100-SXM2-32GB

|  string_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|---------------|------------|-------------|------------|-------------|-------------|---------|----------|
|     2^20      |  78.349 us |       4.82% |  76.602 us |       1.03% |   -1.746 us |  -2.23% |   FAIL   |
|     2^21      | 125.573 us |       2.39% | 119.507 us |       0.80% |   -6.067 us |  -4.83% |   FAIL   |
|     2^22      | 271.541 us |       9.04% | 212.044 us |       1.43% |  -59.497 us | -21.91% |   FAIL   |
|     2^23      | 463.312 us |       7.68% | 387.597 us |       1.13% |  -75.715 us | -16.34% |   FAIL   |
|     2^24      | 804.782 us |       5.27% | 720.793 us |       0.84% |  -83.989 us | -10.44% |   FAIL   |
|     2^25      |   1.476 ms |       2.38% |   1.407 ms |       0.57% |  -68.566 us |  -4.65% |   FAIL   |
|     2^26      |   2.826 ms |       1.28% |   2.756 ms |       0.35% |  -69.783 us |  -2.47% |   FAIL   |
|     2^27      |   5.502 ms |       0.70% |   5.454 ms |       0.25% |  -47.536 us |  -0.86% |   FAIL   |
|     2^28      |  10.867 ms |       0.34% |  10.837 ms |       0.26% |  -30.415 us |  -0.28% |   FAIL   |
|     2^29      |  21.625 ms |       0.21% |  21.602 ms |       0.16% |  -22.571 us |  -0.10% |   PASS   |
|     2^30      |  43.166 ms |       1.46% |  42.702 ms |       1.95% | -464.150 us |  -1.08% |   PASS   |

@elstehle
Copy link
Contributor Author

The following numbers compare performance between current branch-23.10 and this fix of NESTED_JSON_TEST:

## [0] Tesla V100-SXM2-32GB

|  string_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------------|------------|-------------|------------|-------------|------------|---------|----------|
|     2^20      |   5.073 ms |       6.72% |   5.034 ms |       7.51% | -38.284 us |  -0.75% |   PASS   |
|     2^21      |   5.210 ms |       4.17% |   5.207 ms |       3.99% |  -2.199 us |  -0.04% |   PASS   |
|     2^22      |   5.469 ms |       4.55% |   5.466 ms |       3.11% |  -3.510 us |  -0.06% |   PASS   |
|     2^23      |   6.470 ms |       2.81% |   6.484 ms |       2.21% |  13.069 us |   0.20% |   PASS   |
|     2^24      |   8.576 ms |       2.39% |   8.587 ms |       3.86% |  11.227 us |   0.13% |   PASS   |
|     2^25      |  12.630 ms |       2.92% |  12.657 ms |       1.42% |  26.841 us |   0.21% |   PASS   |
|     2^26      |  20.738 ms |       1.41% |  20.880 ms |       1.16% | 141.862 us |   0.68% |   PASS   |
|     2^27      |  36.554 ms |       1.42% |  36.788 ms |       1.65% | 233.606 us |   0.64% |   PASS   |
|     2^28      |  68.380 ms |       1.23% |  68.704 ms |       0.43% | 324.144 us |   0.47% |   FAIL   |
|     2^29      | 131.338 ms |       0.48% | 132.373 ms |       1.24% |   1.035 ms |   0.79% |   FAIL   |
|     2^30      | 258.338 ms |       1.23% | 259.946 ms |       1.35% |   1.607 ms |   0.62% |   PASS   |

# nested_json_gpu_parser_depth

## [0] Tesla V100-SXM2-32GB

|  depth  |  string_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|---------|---------------|------------|-------------|------------|-------------|-------------|---------|----------|
|   2^1   |     2^20      |   5.141 ms |       5.67% |   5.165 ms |       5.05% |   23.745 us |   0.46% |   PASS   |
|   2^2   |     2^20      |   5.187 ms |       2.30% |   5.191 ms |       2.68% |    4.169 us |   0.08% |   PASS   |
|   2^3   |     2^20      |  11.176 ms |       2.61% |  11.162 ms |       1.24% |  -14.760 us |  -0.13% |   PASS   |
|   2^4   |     2^20      |  13.422 ms |       0.34% |  13.410 ms |       0.50% |  -11.419 us |  -0.09% |   PASS   |
|   2^1   |     2^22      |   6.056 ms |       2.02% |   6.054 ms |       2.30% |   -1.865 us |  -0.03% |   PASS   |
|   2^2   |     2^22      |   6.069 ms |       2.88% |   6.048 ms |       2.27% |  -21.103 us |  -0.35% |   PASS   |
|   2^3   |     2^22      |  12.037 ms |      25.05% |  11.826 ms |       1.32% | -211.283 us |  -1.76% |   FAIL   |
|   2^4   |     2^22      |  14.369 ms |       2.26% |  14.316 ms |       1.29% |  -52.624 us |  -0.37% |   PASS   |
|   2^1   |     2^24      |  10.481 ms |       1.59% |  10.502 ms |       0.48% |   21.378 us |   0.20% |   PASS   |
|   2^2   |     2^24      |  10.437 ms |       0.50% |  10.520 ms |       1.41% |   83.349 us |   0.80% |   FAIL   |
|   2^3   |     2^24      |  15.868 ms |       0.50% |  15.974 ms |       1.13% |  105.060 us |   0.66% |   FAIL   |
|   2^4   |     2^24      |  19.563 ms |       0.27% |  19.730 ms |       1.16% |  167.066 us |   0.85% |   FAIL   |
|   2^1   |     2^26      |  27.363 ms |       0.94% |  27.674 ms |       1.38% |  310.813 us |   1.14% |   FAIL   |
|   2^2   |     2^26      |  27.270 ms |       0.48% |  27.670 ms |       0.87% |  400.692 us |   1.47% |   FAIL   |
|   2^3   |     2^26      |  34.320 ms |       0.94% |  34.688 ms |       0.41% |  367.849 us |   1.07% |   FAIL   |
|   2^4   |     2^26      |  43.526 ms |       0.50% |  44.068 ms |       0.50% |  542.234 us |   1.25% |   FAIL   |
|   2^1   |     2^28      |  94.395 ms |       0.50% |  95.726 ms |       0.36% |    1.331 ms |   1.41% |   FAIL   |
|   2^2   |     2^28      |  94.669 ms |       0.46% |  95.867 ms |       0.74% |    1.198 ms |   1.27% |   FAIL   |
|   2^3   |     2^28      | 124.110 ms |       0.10% | 126.004 ms |       0.16% |    1.894 ms |   1.53% |   FAIL   |
|   2^4   |     2^28      | 158.065 ms |       0.28% | 160.069 ms |       0.18% |    2.004 ms |   1.27% |   FAIL   |
|   2^1   |     2^30      | 364.525 ms |       0.50% | 370.193 ms |       0.59% |    5.669 ms |   1.56% |   FAIL   |
|   2^2   |     2^30      | 363.589 ms |       0.25% | 369.947 ms |       0.45% |    6.358 ms |   1.75% |   FAIL   |
|   2^3   |     2^30      | 473.029 ms |       0.07% | 479.814 ms |       0.13% |    6.785 ms |   1.43% |   FAIL   |
|   2^4   |     2^30      | 606.921 ms |       0.11% | 615.830 ms |       0.19% |    8.909 ms |   1.47% |   FAIL   |

# json_read_data_type

## [0] Tesla V100-SXM2-32GB

|  data_type  |      io       |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|-------------|---------------|------------|-------------|------------|-------------|------------|---------|----------|
|    FLOAT    | DEVICE_BUFFER | 707.806 ms |       0.09% | 715.990 ms |       0.14% |   8.185 ms |   1.16% |   FAIL   |
|   DECIMAL   | DEVICE_BUFFER | 835.115 ms |       0.07% | 836.620 ms |       0.06% |   1.505 ms |   0.18% |   FAIL   |
|   STRING    | DEVICE_BUFFER | 323.414 ms |       0.14% | 325.918 ms |       0.09% |   2.504 ms |   0.77% |   FAIL   |
|    LIST     | DEVICE_BUFFER | 229.349 ms |       0.18% | 229.249 ms |       0.26% | -99.551 us |  -0.04% |   PASS   |
|   STRUCT    | DEVICE_BUFFER | 884.349 ms |       0.07% | 896.116 ms |       0.10% |  11.766 ms |   1.33% |   FAIL   |

# json_read_io

## [0] Tesla V100-SXM2-32GB

|      io       |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|---------------|------------|-------------|------------|-------------|--------------|---------|----------|
|   FILEPATH    | 621.109 ms |       2.93% | 625.018 ms |       1.80% |     3.909 ms |   0.63% |   PASS   |
|  HOST_BUFFER  | 397.674 ms |       0.05% | 393.501 ms |       0.37% | -4173.206 us |  -1.05% |   FAIL   |
| DEVICE_BUFFER | 256.134 ms |       0.09% | 255.866 ms |       0.12% |  -267.999 us |  -0.10% |   FAIL   |

@elstehle elstehle added 3 - Ready for Review Ready for review by team cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Aug 11, 2023
@elstehle
Copy link
Contributor Author

rerun tests

@davidwendt
Copy link
Contributor

rerun tests

I don't think this works anymore.
Once all the actions have finished you can go to any of the Details links and there will be a Rerun failed jobs option you can select in the upper-right of the page.

@elstehle
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 6a407cf into rapidsai:branch-23.10 Aug 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants