Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Operations with string_view involving character counting are very inefficient #12445

Closed
ttnghia opened this issue Dec 28, 2022 · 2 comments · Fixed by #13100
Closed

[BUG] Operations with string_view involving character counting are very inefficient #12445

ttnghia opened this issue Dec 28, 2022 · 2 comments · Fixed by #13100
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)

Comments

@ttnghia
Copy link
Contributor

ttnghia commented Dec 28, 2022

Due to the nature of strings columns that they only contain offsets to determine string boundaries, there is no way to know in O(1) how many characters were stored in each string. As such, when an operation requires to know the number of characters, an expensive sequential count is executed:

return thrust::count_if(
    thrust::seq, ptr, ptr + bytes, [](uint8_t chr) { return is_begin_utf8_char(chr); });

Because of that, a simple operation like substr can be extremely expensive. One of my tests with 1.4 million (customized) substring ops from just one source string can be slow like this:

|  test_size  | Samples |  CPU Time  | Noise |  GPU Time  | Noise |  Elem/s  | bytes_per_second | peak_memory_usage |
|-------------|---------|------------|-------|------------|-------|----------|------------------|-------------------|
|   2^8 = 256 |     34x | 451.036 ms | 1.62% | 450.978 ms | 1.62% |   8.287M |          8287194 |        35.142 MiB |
| 2^12 = 4096 |      1x |   96.397 s |  inf% |   96.393 s |  inf% | 620.353K |           620353 |       562.277 MiB |

When I just use the simple way of memcpy (which cannot deal with utf8 character, assuming that the input only has ASCII) then the performance went back to be reasonable:

|  test_size  | Samples | CPU Time  | Noise | GPU Time  | Noise |  Elem/s  | bytes_per_second | peak_memory_usage |
|-------------|---------|-----------|-------|-----------|-------|----------|------------------|-------------------|
|   2^8 = 256 |   1899x |  7.884 ms | 4.23% |  7.879 ms | 4.22% | 474.344M |        474344125 |        35.142 MiB |
| 2^12 = 4096 |    788x | 19.030 ms | 2.86% | 19.024 ms | 2.86% |   3.143G |       3143281474 |       562.277 MiB |

Summary for the tables above: using just memcpy for substring without character counting can improve performance from 96 seconds down to 19 milliseconds.

As such, the way of implementing strings column should be improved. We may need to have some efficient way to compute the number of characters in parallel (such as using thrust::reduce_by_key), or have a new way to store strings such that we can compute the number of characters in O(1) time.

I don't have a very good idea of how to implement and use those efficiently for various internal string_view operations but they should be on the table. Otherwise, we will hit this performance issue more in the future.

@ttnghia ttnghia added bug Something isn't working Needs Triage Need team to review and classify labels Dec 28, 2022
@ttnghia
Copy link
Contributor Author

ttnghia commented Dec 28, 2022

CC @davidwendt for visibility.

@davidwendt
Copy link
Contributor

Recommend not using cudf::string_view for parsing JSON. The class is useful when dealing with characters and character offsets. The substr() and find() function accept character positions and lengths as needed by various functions.

@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) and removed Needs Triage Need team to review and classify labels Apr 2, 2023
rapids-bot bot pushed a commit that referenced this issue Apr 18, 2023
Improves on performance for longer strings with `cudf::strings::slice_strings()` API. The `cudf::string_view::substr` was reworked to minimize counting characters and the gather version of `make_strings_children` is used to build the resulting strings column. This version is already optimized for small and large strings.

Additionally, the code was refactored so the common case of `step==1 and start < stop`  can also make use of the gather approach. Common code was also grouped closer together to help navigate the source file better.

The `slice.cpp` benchmark was updated to better measure large strings with comparable slice boundaries. The benchmark showed performance improvement was up to 9x for larger strings with no significant degradation for smaller strings.

Reference #13048 and #12445

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Elias Stehle (https://github.com/elstehle)

URL: #13057
shwina pushed a commit to shwina/cudf that referenced this issue Apr 18, 2023
Improves on performance for longer strings with `cudf::strings::slice_strings()` API. The `cudf::string_view::substr` was reworked to minimize counting characters and the gather version of `make_strings_children` is used to build the resulting strings column. This version is already optimized for small and large strings.

Additionally, the code was refactored so the common case of `step==1 and start < stop`  can also make use of the gather approach. Common code was also grouped closer together to help navigate the source file better.

The `slice.cpp` benchmark was updated to better measure large strings with comparable slice boundaries. The benchmark showed performance improvement was up to 9x for larger strings with no significant degradation for smaller strings.

Reference rapidsai#13048 and rapidsai#12445

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Elias Stehle (https://github.com/elstehle)

URL: rapidsai#13057
rapids-bot bot pushed a commit that referenced this issue Apr 20, 2023
Improves performance of calls to `cudf::string_view::find` functions for longer strings by minimizing character and byte counting required for resolving the input parameters. The input values are in character positions/counts so range checks do not include bytes that may be in partial characters. Converting these to bytes requires some counting usually from the beginning of the string for any given character position. The code has been reworked to minimizing this counting as much as possible. This improves performance overall by 10% but greatly improves performance for long strings by up to 3x.

Closes #12445

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Nghia Truong (https://github.com/ttnghia)

URL: #13100
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants