-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Story - Improve performance with long strings #13048
Comments
This is a great use of a "Story" issue for documenting this work. I have some very high level comments / ideas:
|
Improves on performance for longer strings with `cudf::strings::slice_strings()` API. The `cudf::string_view::substr` was reworked to minimize counting characters and the gather version of `make_strings_children` is used to build the resulting strings column. This version is already optimized for small and large strings. Additionally, the code was refactored so the common case of `step==1 and start < stop` can also make use of the gather approach. Common code was also grouped closer together to help navigate the source file better. The `slice.cpp` benchmark was updated to better measure large strings with comparable slice boundaries. The benchmark showed performance improvement was up to 9x for larger strings with no significant degradation for smaller strings. Reference #13048 and #12445 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Elias Stehle (https://github.com/elstehle) URL: #13057
Improves on performance for longer strings with `cudf::strings::slice_strings()` API. The `cudf::string_view::substr` was reworked to minimize counting characters and the gather version of `make_strings_children` is used to build the resulting strings column. This version is already optimized for small and large strings. Additionally, the code was refactored so the common case of `step==1 and start < stop` can also make use of the gather approach. Common code was also grouped closer together to help navigate the source file better. The `slice.cpp` benchmark was updated to better measure large strings with comparable slice boundaries. The benchmark showed performance improvement was up to 9x for larger strings with no significant degradation for smaller strings. Reference rapidsai#13048 and rapidsai#12445 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Elias Stehle (https://github.com/elstehle) URL: rapidsai#13057
…#13226) Improves performance for longer strings with `cudf::strings::find()` and `cudf::strings::rfind()` APIs. The current implementation works well with small-ish strings and so this new implementation splits into a long-ish string algorithm when the average number of bytes per string is 64 bytes or greater. The new implementation searches for the target string by applying a warp per string kernel. Additionally, the special logic needed for matching an empty target (easily checked in host code) is factored out into its own transform functor. For longer strings, the performance improvement is about 2-6x. Reference #13048 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) - MithunR (https://github.com/mythrocks) URL: #13226
…trings (#13142) Improves on performance for longer strings with `cudf::strings::to_lower()` `cudf::strings::to_upper()` and `cudf::strings::swapcase()` APIs. The current implementation works well with smallish strings and so this new implementation splits into a longish string algorithm when average number of bytes per string is 64 bytes or greater. The new implementation is similar but computes the output size of each string with a warp per string function. In addition, a check is added for long strings testing if all bytes are ASCII and thereby can run a faster kernel for this case. Reference #13048 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Mark Harris (https://github.com/harrism) - Karthikeyan (https://github.com/karthikeyann) URL: #13142
) Improves performance for `cudf::strings::all_characters_of_type()` API which covers many cudf `is_X` functions. The solution improves performance for all string lengths as measured by the new benchmark included in this PR. Additionally, the code was cleaned up to help with maintenance and clarity. Reference #13048 Authors: - David Wendt (https://github.com/davidwendt) - Karthikeyan (https://github.com/karthikeyann) Approvers: - Nghia Truong (https://github.com/ttnghia) - Karthikeyan (https://github.com/karthikeyann) URL: #13259
…gs (#13283) Improves performance for longer strings with `cudf::strings::join_strings()` API. The new implementation is inspired by recent changes in json/writer_impl.cu which uses the `make_strings_children` factor that accepts an iterator of string pointers to gather the results into a single column of data but does lose some performance for smallish-strings (less than 32 bytes). So the original code is refactored to work on smaller strings. Also, a benchmark implementation is added in this PR to test against various sized input columns. For longer strings, the performance improvement is up to 100x for very long strings (>512 bytes). Reference #13048 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Nghia Truong (https://github.com/ttnghia) URL: #13283
Moves some of the nvtext benchmarks to nvbench to help provide a more useful baseline when making improvements to performance. These run with varying parameters for column size and string length. The remaining benchmarks are more involved a may be updated in follow-on PRs. Reference: #13048 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) - Robert Maynard (https://github.com/robertmaynard) URL: #13368
Minimizes character counting in the kernel logic for `cudf::strings::like` to improve overall performance especially for longer strings. The nvbench benchmark is updated to include measurements for various strings sizes. Reference #13048 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Karthikeyan (https://github.com/karthikeyann) URL: #13594
Improves performance for nvtext tokenize functions by minimizing character counting in the `characters_tokenize` utility functor in `src/text/utilities/tokenize_ops.cuh`. Functions this change effects are: - [`nvtext::count_tokens`](https://docs.rapids.ai/api/libcudf/stable/group__nvtext__tokenize.html#ga5323d94dac99bf42f0cbb07c4fcd7242) (single delimiter and whitespace) - [`nvtext::tokenize`](https://docs.rapids.ai/api/libcudf/stable/group__nvtext__tokenize.html#ga64c2806c398ce476fa5174f3155ea0fb) (single delimiter and whitespace) - [`nvtext::replace_tokens`](https://docs.rapids.ai/api/libcudf/stable/group__nvtext__replace.html#ga66219b7db6155c4e14bf6f6147e1fc81) - [`nvtext::normalize_spaces`](https://docs.rapids.ai/api/libcudf/stable/group__nvtext__normalize.html#ga9104dffc71baf77e710bc63e5e2a8837) - [`nvtext::ngrams_tokenize`](https://docs.rapids.ai/api/libcudf/stable/group__nvtext__ngrams.html#gace17045b4ee5a3b10157ed40f9575298) This change improved performance by at least 10% for all string lengths for most of these functions. Reference #13048 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Mike Wilson (https://github.com/hyperbolic2346) - Bradley Dice (https://github.com/bdice) URL: #13480
Hello @davidwendt, I was discussing this issue with @elstehle. Do you think the "combining" or "copying" strings algorithms could benefit from DeviceBatchMemcpy? |
There are several parts of the strings (and nvtext) API implementations that basically resolve into a fragmented set of (pointer,size) pairs which are passed to a factory that performs an optimized gather function to build a contiguous byte array for an output strings column. So a significant impact here may be to improve that gather function by taking advantage of The factory function that uses the gather is illustrated in my blog post which use the custom kernel here: cudf/cpp/examples/strings/custom_prealloc.cu Line 118 in cd56cc2
I'll locate other places that libcudf uses this factory function that also include benchmarks. |
Many strings APIs in libcudf use thread-per-string parallelism in their implementation. This approach works great for processing smaller strings of relatively consistent length. However, for long strings (roughly 256 bytes and above) the performance of thread-per-string algorithms begins to degrade. Some strings APIs are compatible with data-parallel algorithms and can be refactored to improve performance for long strings, while other strings APIs are difficult to refactor with data-parallel algorithms.
Let's use this issue to track the progression:
✅ - this API works well with long strings
🟢 - we think this API will be straightforward to refactor
🟡 - we have some ideas on how to refactor this API, and we'll need to experiment
🔴 - we think this will be very difficult to refactor!
⚪ - long string support is not a priority for this API
title
is_title
to_lower
to_upper
swapcase
🟡
🟡
✅#13142
✅#13142
✅#13142
filter_characters_of_type
🔴
concatenate
join_list_elements
🟡
🔴
matches_re
count_re
like
find_all
⚪
🔴
🟢#13594
🔴
from_XXXX
⚪
repeat_strings
✅
Long string support is not a priority for
step > 1 or step < 0
rfind
contains
starts_with
ends_with
find_multiple
✅#13226
✅#10739
⚪
⚪
🟢
zfill
reverse
strip
translate
filter_characters
wrap
⚪
🟡
✅
🟡
🔴
🔴
replace_slice
replace_re
replace_with_backrefs
🟡
🔴
🔴
split
split_record
split_re
split_record_re
✅#4922 #13680
✅#12729
🔴
🔴
count_bytes
🟢
Libcudf also includes NVText APIs that will benefit from improvements in performance when processing long strings. Generally long string performance is even more important for our text APIs, where each row could represent a sentence, paragraph or document.
generate_character_ngrams
ngrams_tokenize
🟢
🟢#13480
normalize_spaces
🟢#13480
porter_stemmer_measure
🟢
edit_distance_matrix
⚪
subword_tokenize
tokenize
count_tokens
character_tokenize
detokenize
🟡
🟢#13480
🟢#13480
🟡
🟡
filter_tokens
🟢#13480
The text was updated successfully, but these errors were encountered: