[BUG] Operations with `string_view` involving character counting are very inefficient #12445

ttnghia · 2022-12-28T17:39:58Z

Due to the nature of strings columns that they only contain offsets to determine string boundaries, there is no way to know in O(1) how many characters were stored in each string. As such, when an operation requires to know the number of characters, an expensive sequential count is executed:

return thrust::count_if(
    thrust::seq, ptr, ptr + bytes, [](uint8_t chr) { return is_begin_utf8_char(chr); });

Because of that, a simple operation like substr can be extremely expensive. One of my tests with 1.4 million (customized) substring ops from just one source string can be slow like this:

|  test_size  | Samples |  CPU Time  | Noise |  GPU Time  | Noise |  Elem/s  | bytes_per_second | peak_memory_usage |
|-------------|---------|------------|-------|------------|-------|----------|------------------|-------------------|
|   2^8 = 256 |     34x | 451.036 ms | 1.62% | 450.978 ms | 1.62% |   8.287M |          8287194 |        35.142 MiB |
| 2^12 = 4096 |      1x |   96.397 s |  inf% |   96.393 s |  inf% | 620.353K |           620353 |       562.277 MiB |

When I just use the simple way of memcpy (which cannot deal with utf8 character, assuming that the input only has ASCII) then the performance went back to be reasonable:

|  test_size  | Samples | CPU Time  | Noise | GPU Time  | Noise |  Elem/s  | bytes_per_second | peak_memory_usage |
|-------------|---------|-----------|-------|-----------|-------|----------|------------------|-------------------|
|   2^8 = 256 |   1899x |  7.884 ms | 4.23% |  7.879 ms | 4.22% | 474.344M |        474344125 |        35.142 MiB |
| 2^12 = 4096 |    788x | 19.030 ms | 2.86% | 19.024 ms | 2.86% |   3.143G |       3143281474 |       562.277 MiB |

Summary for the tables above: using just memcpy for substring without character counting can improve performance from 96 seconds down to 19 milliseconds.

As such, the way of implementing strings column should be improved. We may need to have some efficient way to compute the number of characters in parallel (such as using thrust::reduce_by_key), or have a new way to store strings such that we can compute the number of characters in O(1) time.

I don't have a very good idea of how to implement and use those efficiently for various internal string_view operations but they should be on the table. Otherwise, we will hit this performance issue more in the future.

The text was updated successfully, but these errors were encountered:

ttnghia · 2022-12-28T17:42:38Z

CC @davidwendt for visibility.

davidwendt · 2023-01-03T15:57:58Z

Recommend not using cudf::string_view for parsing JSON. The class is useful when dealing with characters and character offsets. The substr() and find() function accept character positions and lengths as needed by various functions.

Improves on performance for longer strings with `cudf::strings::slice_strings()` API. The `cudf::string_view::substr` was reworked to minimize counting characters and the gather version of `make_strings_children` is used to build the resulting strings column. This version is already optimized for small and large strings. Additionally, the code was refactored so the common case of `step==1 and start < stop` can also make use of the gather approach. Common code was also grouped closer together to help navigate the source file better. The `slice.cpp` benchmark was updated to better measure large strings with comparable slice boundaries. The benchmark showed performance improvement was up to 9x for larger strings with no significant degradation for smaller strings. Reference #13048 and #12445 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Elias Stehle (https://github.com/elstehle) URL: #13057

Improves on performance for longer strings with `cudf::strings::slice_strings()` API. The `cudf::string_view::substr` was reworked to minimize counting characters and the gather version of `make_strings_children` is used to build the resulting strings column. This version is already optimized for small and large strings. Additionally, the code was refactored so the common case of `step==1 and start < stop` can also make use of the gather approach. Common code was also grouped closer together to help navigate the source file better. The `slice.cpp` benchmark was updated to better measure large strings with comparable slice boundaries. The benchmark showed performance improvement was up to 9x for larger strings with no significant degradation for smaller strings. Reference rapidsai#13048 and rapidsai#12445 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Elias Stehle (https://github.com/elstehle) URL: rapidsai#13057

Improves performance of calls to `cudf::string_view::find` functions for longer strings by minimizing character and byte counting required for resolving the input parameters. The input values are in character positions/counts so range checks do not include bytes that may be in partial characters. Converting these to bytes requires some counting usually from the beginning of the string for any given character position. The code has been reworked to minimizing this counting as much as possible. This improves performance overall by 10% but greatly improves performance for long strings by up to 3x. Closes #12445 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Nghia Truong (https://github.com/ttnghia) URL: #13100

ttnghia added bug Something isn't working Needs Triage Need team to review and classify labels Dec 28, 2022

ttnghia mentioned this issue Dec 28, 2022

Implement a simplified version of from_json NVIDIA/spark-rapids-jni#844

Merged

GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) and removed Needs Triage Need team to review and classify labels Apr 2, 2023

This was referenced Apr 4, 2023

Improve performance of slice_strings for long strings #13057

Merged

Performance improvement for cudf::string_view::find functions #13100

Merged

rapids-bot bot closed this as completed in #13100 Apr 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Operations with `string_view` involving character counting are very inefficient #12445

[BUG] Operations with `string_view` involving character counting are very inefficient #12445

ttnghia commented Dec 28, 2022 •

edited

Loading

ttnghia commented Dec 28, 2022

davidwendt commented Jan 3, 2023

[BUG] Operations with string_view involving character counting are very inefficient #12445

[BUG] Operations with string_view involving character counting are very inefficient #12445

Comments

ttnghia commented Dec 28, 2022 • edited Loading

ttnghia commented Dec 28, 2022

davidwendt commented Jan 3, 2023

[BUG] Operations with `string_view` involving character counting are very inefficient #12445

[BUG] Operations with `string_view` involving character counting are very inefficient #12445

ttnghia commented Dec 28, 2022 •

edited

Loading