Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvement for nvtext tokenize/token functions #13480

Merged
merged 31 commits into from
Jun 29, 2023

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented May 31, 2023

Description

Improves performance for nvtext tokenize functions by minimizing character counting in the characters_tokenize utility functor in src/text/utilities/tokenize_ops.cuh.

Functions this change effects are:

This change improved performance by at least 10% for all string lengths for most of these functions.

Reference #13048

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 31, 2023
@davidwendt davidwendt self-assigned this May 31, 2023
@davidwendt
Copy link
Contributor Author

davidwendt commented Jun 7, 2023

Performance results for nvtext::count_tokens

| width |  rows    |   Ref Time |   New Time |         Diff |   %Diff |
|-------|----------|------------|------------| -------------|---------|
|   32  |     4096 |  48.511 us |  40.038 us |    -8.473 us | -17.47% |
|   64  |     4096 |  65.747 us |  52.758 us |   -12.989 us | -19.76% |
|   128 |     4096 | 116.317 us |  93.920 us |   -22.397 us | -19.25% |
|   256 |     4096 | 255.196 us | 204.905 us |   -50.291 us | -19.71% |
|   512 |     4096 | 488.596 us | 403.411 us |   -85.184 us | -17.43% |
|  1024 |     4096 |   1.081 ms | 833.339 us |  -247.201 us | -22.88% |
|   32  |    32768 |  46.373 us |  38.618 us |    -7.755 us | -16.72% |
|   64  |    32768 |  71.724 us |  56.370 us |   -15.354 us | -21.41% |
|   128 |    32768 | 126.872 us | 101.626 us |   -25.246 us | -19.90% |
|   256 |    32768 | 257.506 us | 209.754 us |   -47.752 us | -18.54% |
|   512 |    32768 | 544.384 us | 451.347 us |   -93.037 us | -17.09% |
|  1024 |    32768 |   1.166 ms | 907.933 us |  -257.842 us | -22.12% |
|   32  |   262144 |  85.930 us |  63.813 us |   -22.116 us | -25.74% |
|   64  |   262144 | 145.192 us | 110.141 us |   -35.051 us | -24.14% |
|   128 |   262144 | 297.499 us | 225.087 us |   -72.412 us | -24.34% |
|   256 |   262144 | 872.486 us | 655.518 us |  -216.968 us | -24.87% |
|   512 |   262144 |   2.087 ms |   1.425 ms |  -661.474 us | -31.70% |
|  1024 |   262144 |   4.234 ms |   2.819 ms | -1415.158 us | -33.42% |
|   32  |  2097152 | 385.129 us | 240.484 us |  -144.645 us | -37.56% |
|   64  |  2097152 | 702.797 us | 476.196 us |  -226.601 us | -32.24% |
|   128 |  2097152 |   1.489 ms |   1.069 ms |  -420.012 us | -28.22% |
|   256 |  2097152 |   3.840 ms |   2.813 ms | -1027.364 us | -26.75% |
|   512 |  2097152 |   9.474 ms |   6.313 ms | -3160.805 us | -33.36% |
|   32  | 16777216 |   2.818 ms |   1.658 ms | -1160.484 us | -41.18% |
|   64  | 16777216 |   5.211 ms |   3.417 ms | -1793.785 us | -34.43% |

@davidwendt
Copy link
Contributor Author

Performance results for nvtext::tokenize

| width |    rows  |   Ref Time |   New Time |         Diff |   %Diff |
|-------|----------| -----------|------------| -------------|---------|
|   32  |     4096 | 146.036 us | 129.399 us |   -16.637 us | -11.39% |
|   64  |     4096 | 186.565 us | 159.696 us |   -26.869 us | -14.40% |
|   128 |     4096 | 302.437 us | 255.658 us |   -46.780 us | -15.47% |
|   256 |     4096 | 602.833 us | 489.313 us |  -113.520 us | -18.83% |
|   512 |     4096 |   1.008 ms | 796.955 us |  -211.290 us | -20.96% |
|  1024 |     4096 |   2.183 ms |   1.595 ms |  -588.048 us | -26.94% |
|   32  |    32768 | 145.114 us | 129.163 us |   -15.952 us | -10.99% |
|   64  |    32768 | 211.849 us | 181.935 us |   -29.915 us | -14.12% |
|   128 |    32768 | 375.236 us | 321.367 us |   -53.870 us | -14.36% |
|   256 |    32768 | 738.987 us | 624.949 us |  -114.038 us | -15.43% |
|   512 |    32768 |   1.181 ms | 989.479 us |  -191.639 us | -16.23% |
|  1024 |    32768 |   2.502 ms |   1.960 ms |  -542.015 us | -21.66% |
|   32  |   262144 | 303.311 us | 256.434 us |   -46.877 us | -15.46% |
|   64  |   262144 | 563.373 us | 487.956 us |   -75.418 us | -13.39% |
|   128 |   262144 |   1.334 ms |   1.185 ms |  -149.070 us | -11.17% |
|   256 |   262144 |   4.713 ms |   4.149 ms |  -563.564 us | -11.96% |
|   512 |   262144 |   4.717 ms |   3.236 ms | -1481.115 us | -31.40% |
|  1024 |   262144 |   9.552 ms |   6.052 ms | -3499.788 us | -36.64% |
|   32  |  2097152 |   1.492 ms |   1.192 ms |  -300.078 us | -20.11% |
|   64  |  2097152 |   3.206 ms |   2.741 ms |  -464.831 us | -14.50% |
|   128 |  2097152 |   7.227 ms |   6.472 ms |  -755.754 us | -10.46% |
|   256 |  2097152 |  23.687 ms |  22.485 ms | -1201.915 us |  -5.07% |
|   512 |  2097152 |  23.156 ms |  18.011 ms | -5144.751 us | -22.22% |
|   32  | 16777216 |  10.945 ms |   8.513 ms | -2432.283 us | -22.22% |
|   64  | 16777216 |  24.259 ms |  20.522 ms | -3736.454 us | -15.40% |

@davidwendt
Copy link
Contributor Author

Performance results for nvtext::replace_tokens

| width |    rows  |   Ref Time |   New Time |         Diff |   %Diff |
|-------|----------|------------|------------| -------------|---------|
|  32   |     4096 | 180.832 us | 156.554 us |   -24.278 us | -13.43% |
|  64   |     4096 | 231.309 us | 204.905 us |   -26.404 us | -11.42% |
|  128  |     4096 | 441.013 us | 379.279 us |   -61.734 us | -14.00% |
|  256  |     4096 | 800.562 us | 705.105 us |   -95.457 us | -11.92% |
|  512  |     4096 |   1.583 ms |   1.372 ms |  -210.590 us | -13.31% |
| 1024  |     4096 |   3.135 ms |   2.696 ms |  -438.833 us | -14.00% |
|  32   |    32768 | 180.878 us | 161.270 us |   -19.609 us | -10.84% |
|  64   |    32768 | 263.387 us | 234.917 us |   -28.469 us | -10.81% |
|  128  |    32768 | 481.632 us | 432.004 us |   -49.628 us | -10.30% |
|  256  |    32768 | 903.393 us | 851.389 us |   -52.004 us |  -5.76% |
|  512  |    32768 |   1.844 ms |   1.654 ms |  -189.833 us | -10.30% |
| 1024  |    32768 |   3.823 ms |   3.125 ms |  -698.387 us | -18.27% |
|  32   |   262144 | 522.113 us | 441.362 us |   -80.752 us | -15.47% |
|  64   |   262144 |   1.004 ms | 907.246 us |   -96.976 us |  -9.66% |
|  128  |   262144 |   2.343 ms |   2.220 ms |  -122.461 us |  -5.23% |
|  256  |   262144 |   5.559 ms |   5.029 ms |  -530.276 us |  -9.54% |
|  512  |   262144 |  12.289 ms |  10.863 ms | -1425.938 us | -11.60% |
| 1024  |   262144 |  24.336 ms |  21.217 ms | -3118.850 us | -12.82% |
|  32   |  2097152 |   3.178 ms |   2.598 ms |  -580.615 us | -18.27% |
|  64   |  2097152 |   6.303 ms |   5.423 ms |  -880.641 us | -13.97% |
|  128  |  2097152 |  15.678 ms |  14.399 ms | -1278.928 us |  -8.16% |
|  256  |  2097152 |  36.717 ms |  32.785 ms | -3932.645 us | -10.71% |
|  512  |  2097152 |  83.745 ms |  73.797 ms | -9948.156 us | -11.88% |
|  32   | 16777216 |  24.484 ms |  19.711 ms | -4772.928 us | -19.49% |
|  64   | 16777216 |  47.634 ms |  41.194 ms | -6440.088 us | -13.52% |

@davidwendt davidwendt changed the title Performance improvement for nvtext tokenize for long strings Performance improvement for nvtext tokenize/token functions Jun 7, 2023
@davidwendt
Copy link
Contributor Author

davidwendt commented Jun 7, 2023

Performance results for nvtext::normalize_spaces

| width |    rows  |   Ref Time |   New Time |         Diff |   %Diff |
|-------|----------|------------|------------|--------------|---------|
|   32  |     4096 |  98.035 us |  80.177 us |   -17.858 us | -18.22% |
|   64  |     4096 | 137.667 us | 112.543 us |   -25.123 us | -18.25% |
|   128 |     4096 | 249.369 us | 201.099 us |   -48.270 us | -19.36% |
|   256 |     4096 | 555.932 us | 464.217 us |   -91.715 us | -16.50% |
|   512 |     4096 |   1.074 ms | 904.616 us |  -169.466 us | -15.78% |
|  1024 |     4096 |   2.350 ms |   1.876 ms |  -473.827 us | -20.16% |
|   32  |    32768 |  94.327 us |  82.569 us |   -11.758 us | -12.46% |
|   64  |    32768 | 156.128 us | 129.143 us |   -26.985 us | -17.28% |
|   128 |    32768 | 286.025 us | 241.780 us |   -44.245 us | -15.47% |
|   256 |    32768 | 599.385 us | 526.894 us |   -72.491 us | -12.09% |
|   512 |    32768 |   1.227 ms |   1.096 ms |  -131.949 us | -10.75% |
|  1024 |    32768 |   2.713 ms |   2.263 ms |  -449.410 us | -16.57% |
|   32  |   262144 | 201.068 us | 174.979 us |   -26.089 us | -12.98% |
|   64  |   262144 | 453.823 us | 399.995 us |   -53.828 us | -11.86% |
|   128 |   262144 |   1.089 ms |   1.024 ms |   -65.534 us |  -6.02% |
|   256 |   262144 |   3.520 ms |   3.123 ms |  -397.206 us | -11.28% |
|   512 |   262144 |   8.082 ms |   7.247 ms |  -835.410 us | -10.34% |
|  1024 |   262144 |  15.123 ms |  13.127 ms | -1996.859 us | -13.20% |
|   32  |  2097152 | 985.586 us | 864.985 us |  -120.602 us | -12.24% |
|   64  |  2097152 |   2.489 ms |   2.286 ms |  -203.301 us |  -8.17% |
|   128 |  2097152 |   5.963 ms |   5.552 ms |  -411.271 us |  -6.90% |
|   256 |  2097152 |  15.976 ms |  14.617 ms | -1358.824 us |  -8.51% |
|   512 |  2097152 |  37.368 ms |  32.946 ms | -4421.712 us | -11.83% |
|   32  | 16777216 |   7.292 ms |   6.394 ms |  -898.460 us | -12.32% |
|   64  | 16777216 |  18.976 ms |  17.409 ms | -1567.228 us |  -8.26% |

@davidwendt
Copy link
Contributor Author

Performance results for `nvtext::ngrams_tokenize'

| width |    rows  |   Ref Time |   New Time |          Diff |   %Diff |
|-------|----------|------------|------------|---------------|---------|
|   32  |     4096 | 336.345 us | 254.371 us |    -81.974 us | -24.37% |
|   64  |     4096 | 445.825 us | 349.035 us |    -96.790 us | -21.71% |
|   128 |     4096 | 804.544 us | 651.466 us |   -153.078 us | -19.03% |
|   256 |     4096 |   1.719 ms |   1.419 ms |   -299.558 us | -17.43% |
|   512 |     4096 |   3.639 ms |   3.106 ms |   -532.256 us | -14.63% |
|  1024 |     4096 |   8.066 ms |   6.599 ms |  -1466.539 us | -18.18% |
|   32  |    32768 | 305.689 us | 244.122 us |    -61.567 us | -20.14% |
|   64  |    32768 | 463.945 us | 366.640 us |    -97.305 us | -20.97% |
|   128 |    32768 | 867.306 us | 698.292 us |   -169.015 us | -19.49% |
|   256 |    32768 |   1.863 ms |   1.569 ms |   -293.520 us | -15.76% |
|   512 |    32768 |   4.052 ms |   3.520 ms |   -531.761 us | -13.12% |
|  1024 |    32768 |   8.953 ms |   7.542 ms |  -1411.745 us | -15.77% |
|   32  |   262144 | 409.060 us | 313.378 us |    -95.682 us | -23.39% |
|   64  |   262144 | 662.305 us | 521.544 us |   -140.761 us | -21.25% |
|   128 |   262144 |   1.422 ms |   1.198 ms |   -224.669 us | -15.79% |
|   256 |   262144 |   4.347 ms |   3.854 ms |   -492.958 us | -11.34% |
|   512 |   262144 |  16.679 ms |  15.282 ms |  -1396.587 us |  -8.37% |
|  1024 |   262144 |  48.973 ms |  45.748 ms |  -3225.150 us |  -6.59% |
|   32  |  2097152 |   1.957 ms | 934.373 us |  -1022.834 us | -52.26% |
|   64  |  2097152 |   3.753 ms |   1.850 ms |  -1902.538 us | -50.70% |
|   128 |  2097152 |   7.839 ms |   5.133 ms |  -2706.137 us | -34.52% |
|   256 |  2097152 |  22.544 ms |  16.727 ms |  -5816.646 us | -25.80% |
|   512 |  2097152 |  92.856 ms |  82.702 ms | -10154.290 us | -10.94% |
|   32  | 16777216 |  18.918 ms |   7.365 ms | -11552.982 us | -61.07% |
|   64  | 16777216 |  34.700 ms |  15.020 ms | -19679.438 us | -56.71% |

@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jun 8, 2023
@davidwendt davidwendt marked this pull request as ready for review June 20, 2023 17:07
@davidwendt davidwendt requested a review from a team as a code owner June 20, 2023 17:07
@davidwendt davidwendt changed the title Performance improvement for nvtext tokenize/token functions Performance improvement for nvtext tokenize/token functions Jun 21, 2023
@davidwendt davidwendt changed the title Performance improvement for nvtext tokenize/token functions Performance improvement for nvtext tokenize/token functions Jun 22, 2023
rapids-bot bot pushed a commit that referenced this pull request Jun 23, 2023
…ings (#13322)

Changes the internal regex logic to minimize character counting to help performance with longer strings. The improvement applies mainly to libcudf regex functions that return strings (i.e. extract, replace, split). The changes here also improve the internal device APIs for clarity to improve maintenance. The most significant change makes the position variables input-only and returning an optional pair to indicate a successful match.

There are some more optimizations that are possible here where character positions are passed back and forth that could be replaced with byte positions to further reduce counting. Initial measurements showed this noticeably slowed down small strings so more analysis is required before continuing this optimization. 

Reference: #13480

### More Detail

First, there is a change to some internal regex function signatures. Notable the `reprog_device::find()` and `reprog_device::extract()` member functions declared in `cpp/src/strings/regex/regex.cuh` that are used by all the libcudf regex functions. The in/out parameters are now input-only parameters (pass by value) and the return is an optional pair that includes the match result. Also, the `begin` parameter is now an iterator and the `end` parameter now has a default. This change requires updating all the definitions and uses of the `find` and `extract` member functions.

Using an iterator as the `begin` parameter allows for some optimizations in the calling code to minimize character counting that may be needed for processing multi-byte UTF-8 characters. Rather than using the `cudf::string_view::byte_offset()` member function to convert character positions to byte positions, an iterator can be incremented as we traverse through the string which helps reduce some character counting. So the changes here involve removing some calls to `byte_offset()` and incrementing (really moving) iterators with a pattern like `itr += (new_pos - itr.position());` There is another PR #13428 to make a `move_to` iterator member function.

It is possible to reduce the character counting even more as mentioned above but further optimization requires some deeper analysis.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - MithunR (https://github.com/mythrocks)

URL: #13322
@davidwendt davidwendt changed the title Performance improvement for nvtext tokenize/token functions Performance improvement for nvtext tokenize/token functions Jun 26, 2023
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, one random comment for my own edification.

@@ -58,7 +58,7 @@ namespace {
*/
struct normalize_spaces_fn {
cudf::column_device_view const d_strings; // strings to normalize
int32_t* d_offsets{}; // offsets into d_buffer
cudf::size_type* d_offsets{}; // offsets into d_chars
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we deprecated offset_type entirely? Has it been removed?

Copy link
Contributor Author

@davidwendt davidwendt Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet. I can look into doing that in a separate PR.

@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 1296ebd into rapidsai:branch-23.08 Jun 29, 2023
@davidwendt davidwendt deleted the nvtext-perf-tokenize branch June 29, 2023 23:02
rapids-bot bot pushed a commit that referenced this pull request Jun 30, 2023
Fixes memcheck error found by the nightly build in the nvtext `characters_tokenizer` utility function.
```
[ RUN      ] TextNgramsTokenizeTest.Tokenize
========= Invalid __global__ read of size 1 bytes
=========     at 0x2360 in void cub::CUB_101702_610_860_NS::DeviceScanKernel<cub::CUB_101702_610_860_NS::DeviceScanPolicy<int>::Policy600, thrust::cuda_cub::transform_input_iterator_t<int, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, nvtext::detail::strings_tokenizer>, int *, cub::CUB_101702_610_860_NS::ScanTileState<int, (bool)1>, thrust::plus<int>, cub::CUB_101702_610_860_NS::NullType, int>(T2, T3, T4, int, T5, T6, T7)
=========     by thread (5,0,0) in block (0,0,0)
=========     Address 0x7f67a0200a65 is out of bounds
=========     and is 1 bytes after the nearest allocation at 0x7f67a0200a00 of size 101 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x30b492]
=========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x1488c]
=========                in /conda/envs/rapids/lib/libcudart.so.11.0
=========     Host Frame:cudaLaunchKernel [0x6c318]
=========                in /conda/envs/rapids/lib/libcudart.so.11.0
=========     Host Frame:nvtext::detail::ngrams_tokenize(cudf::strings_column_view const&, int, cudf::string_scalar const&, cudf::string_scalar const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x2693cc9]
=========                in /conda/envs/rapids/lib/libcudf.so

```
This error was introduced by changes in #13480

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Bradley Dice (https://github.com/bdice)

URL: #13649
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants