Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Story - Improve performance with long strings #13048

Open
GregoryKimball opened this issue Apr 3, 2023 · 5 comments
Open

[FEA] Story - Improve performance with long strings #13048

GregoryKimball opened this issue Apr 3, 2023 · 5 comments
Assignees
Labels
2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue strings strings issues (C++ and Python)

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Apr 3, 2023

Many strings APIs in libcudf use thread-per-string parallelism in their implementation. This approach works great for processing smaller strings of relatively consistent length. However, for long strings (roughly 256 bytes and above) the performance of thread-per-string algorithms begins to degrade. Some strings APIs are compatible with data-parallel algorithms and can be refactored to improve performance for long strings, while other strings APIs are difficult to refactor with data-parallel algorithms.

Let's use this issue to track the progression:
✅ - this API works well with long strings
🟢 - we think this API will be straightforward to refactor
🟡 - we have some ideas on how to refactor this API, and we'll need to experiment
🔴 - we think this will be very difficult to refactor!
⚪ - long string support is not a priority for this API

Module Function Status Notes
Case capitalize
title
is_title
to_lower
to_upper
swapcase
🟡
🟡
🟡
#13142
#13142
#13142
Character Types all_characters_of_type
filter_characters_of_type
#13259
🔴
Combining join_strings
concatenate
join_list_elements
#13283
🟡
🔴
Searching contains_re
matches_re
count_re
like
find_all
🟡

🔴
🟢#13594
🔴
Converting to_XXXX
from_XXXX

these are rarely long strings
Copying repeat_string
repeat_strings

One overload is an exception
Slicing slice_strings #13057 One overload allows for skipping characters.
Long string support is not a priority for
step > 1 or step < 0
Finding find
rfind
contains
starts_with
ends_with
find_multiple
#13226
#13226
#10739


🟢
Modifying pad
zfill
reverse
strip
translate
filter_characters
wrap
🟡

🟡

🟡
🔴
🔴
Replacing replace
replace_slice
replace_re
replace_with_backrefs
#12858
🟡
🔴
🔴
Splitting partition
split
split_record
split_re
split_record_re
🟡
#4922 #13680
#12729
🔴
🔴
other count_characters
count_bytes
#12779
🟢

Libcudf also includes NVText APIs that will benefit from improvements in performance when processing long strings. Generally long string performance is even more important for our text APIs, where each row could represent a sentence, paragraph or document.

Module Function Status Notes
NGrams generate_ngrams
generate_character_ngrams
ngrams_tokenize

🟢
🟢#13480
these are generally not long strings
Normalizing normalize_characters
normalize_spaces
🟢
🟢#13480
Stemming is_letter
porter_stemmer_measure
🟢
🟢
Edit Distance edit_distance
edit_distance_matrix

these are generally not long strings
Tokenizing byte_pair_encoding
subword_tokenize
tokenize
count_tokens
character_tokenize
detokenize
🟡
🟡
🟢#13480
🟢#13480
🟡
🟡
Replacing replace_tokens
filter_tokens
🟢#13480
🟢#13480
MinHashing minhash #13333
@GregoryKimball GregoryKimball added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue strings strings issues (C++ and Python) labels Apr 3, 2023
@GregoryKimball GregoryKimball moved this to In progress in libcudf Apr 3, 2023
@bdice
Copy link
Contributor

bdice commented Apr 3, 2023

This is a great use of a "Story" issue for documenting this work. I have some very high level comments / ideas:

  • Some of these algorithms might benefit from using Cooperative Groups and/or studying implementations of related algorithms in CUB.
  • For some of these cases, I suspect that "prefix strings" as described in the Velox paper could be useful. That would use a (non-Arrow) layout with a fixed-width prefix that could be computed as a temporary column used to quickly eliminate possible matches. Filtering and ordering are specifically called out in that paper.

@GregoryKimball
Copy link
Contributor Author

After studying the results of the strings microbenchmarks in libcudf, I'd like to share some results and early thoughts on the performance we should target with long strings.

First, here is data throughput versus data size for the split_record API. It shows about 15-35 GB/s throughput at the ~100 MB data size for string lengths between 32 and 8192 bytes.
image

Next, here is the same plot for filter_characters_of_type. It shows 20 GB/s throughput for ~100 MB data size and strings lengths <100 bytes. For longer strings the throughput falls to the 1-6 GB/s range.
image

So perhaps we should target data throughputs for long strings to be better than 80% of data throughput for short strings.

rapids-bot bot pushed a commit that referenced this issue Apr 18, 2023
Improves on performance for longer strings with `cudf::strings::slice_strings()` API. The `cudf::string_view::substr` was reworked to minimize counting characters and the gather version of `make_strings_children` is used to build the resulting strings column. This version is already optimized for small and large strings.

Additionally, the code was refactored so the common case of `step==1 and start < stop`  can also make use of the gather approach. Common code was also grouped closer together to help navigate the source file better.

The `slice.cpp` benchmark was updated to better measure large strings with comparable slice boundaries. The benchmark showed performance improvement was up to 9x for larger strings with no significant degradation for smaller strings.

Reference #13048 and #12445

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Elias Stehle (https://github.com/elstehle)

URL: #13057
shwina pushed a commit to shwina/cudf that referenced this issue Apr 18, 2023
Improves on performance for longer strings with `cudf::strings::slice_strings()` API. The `cudf::string_view::substr` was reworked to minimize counting characters and the gather version of `make_strings_children` is used to build the resulting strings column. This version is already optimized for small and large strings.

Additionally, the code was refactored so the common case of `step==1 and start < stop`  can also make use of the gather approach. Common code was also grouped closer together to help navigate the source file better.

The `slice.cpp` benchmark was updated to better measure large strings with comparable slice boundaries. The benchmark showed performance improvement was up to 9x for larger strings with no significant degradation for smaller strings.

Reference rapidsai#13048 and rapidsai#12445

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Elias Stehle (https://github.com/elstehle)

URL: rapidsai#13057
rapids-bot bot pushed a commit that referenced this issue May 8, 2023
…#13226)

Improves performance for longer strings with `cudf::strings::find()` and `cudf::strings::rfind()` APIs.

The current implementation works well with small-ish strings and so this new implementation splits into a long-ish string algorithm when the average number of bytes per string is 64 bytes or greater. The new implementation searches for the target string by applying a warp per string kernel. Additionally, the special logic needed for matching an empty target (easily checked in host code) is factored out into its own transform functor.
For longer strings, the performance improvement is about 2-6x.

Reference #13048

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Yunsong Wang (https://github.com/PointKernel)
  - MithunR (https://github.com/mythrocks)

URL: #13226
rapids-bot bot pushed a commit that referenced this issue May 9, 2023
…trings (#13142)

Improves on performance for longer strings with `cudf::strings::to_lower()` `cudf::strings::to_upper()` and `cudf::strings::swapcase()` APIs. 

The current implementation works well with smallish strings and so this new implementation splits into a longish string algorithm when average number of bytes per string is 64 bytes or greater. The new implementation is similar but computes the output size of each string with a warp per string function. In addition, a check is added for long strings testing if all bytes are ASCII and thereby can run a faster kernel for this case.

Reference #13048

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #13142
rapids-bot bot pushed a commit that referenced this issue May 15, 2023
)

Improves performance for `cudf::strings::all_characters_of_type()` API which covers many cudf `is_X` functions. The solution improves performance for all string lengths as measured by the new benchmark included in this PR.
Additionally, the code was cleaned up to help with maintenance and clarity.

Reference #13048

Authors:
  - David Wendt (https://github.com/davidwendt)
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #13259
rapids-bot bot pushed a commit that referenced this issue May 15, 2023
…gs (#13283)

Improves performance for longer strings with `cudf::strings::join_strings()` API.

The new implementation is inspired by recent changes in json/writer_impl.cu which uses the `make_strings_children` factor that accepts an iterator of string pointers to gather the results into a single column of data but does lose some performance for smallish-strings (less than 32 bytes). So the original code is refactored to work on smaller strings.
Also, a benchmark implementation is added in this PR to test against various sized input columns.
For longer strings, the performance improvement is up to 100x for very long strings (>512 bytes).

Reference #13048

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - Nghia Truong (https://github.com/ttnghia)

URL: #13283
rapids-bot bot pushed a commit that referenced this issue May 24, 2023
Moves some of the nvtext benchmarks to nvbench to help provide a more useful baseline when making improvements to performance. These run with varying parameters for column size and string length. The remaining benchmarks are more involved a may be updated in follow-on PRs.

Reference: #13048

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Yunsong Wang (https://github.com/PointKernel)
  - Robert Maynard (https://github.com/robertmaynard)

URL: #13368
rapids-bot bot pushed a commit that referenced this issue Jun 23, 2023
Minimizes character counting in the kernel logic for `cudf::strings::like` to improve overall performance especially for longer strings.
The nvbench benchmark is updated to include measurements for various strings sizes.

Reference #13048

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #13594
rapids-bot bot pushed a commit that referenced this issue Jun 29, 2023
Improves performance for nvtext tokenize functions by minimizing character counting in the `characters_tokenize` utility functor in `src/text/utilities/tokenize_ops.cuh`.

Functions this change effects are:
- [`nvtext::count_tokens`](https://docs.rapids.ai/api/libcudf/stable/group__nvtext__tokenize.html#ga5323d94dac99bf42f0cbb07c4fcd7242) (single delimiter and whitespace)
- [`nvtext::tokenize`](https://docs.rapids.ai/api/libcudf/stable/group__nvtext__tokenize.html#ga64c2806c398ce476fa5174f3155ea0fb) (single delimiter and whitespace)
- [`nvtext::replace_tokens`](https://docs.rapids.ai/api/libcudf/stable/group__nvtext__replace.html#ga66219b7db6155c4e14bf6f6147e1fc81)
- [`nvtext::normalize_spaces`](https://docs.rapids.ai/api/libcudf/stable/group__nvtext__normalize.html#ga9104dffc71baf77e710bc63e5e2a8837)
- [`nvtext::ngrams_tokenize`](https://docs.rapids.ai/api/libcudf/stable/group__nvtext__ngrams.html#gace17045b4ee5a3b10157ed40f9575298)

This change improved performance by at least 10% for all string lengths for most of these functions.

Reference #13048

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Bradley Dice (https://github.com/bdice)

URL: #13480
@GregoryKimball GregoryKimball moved this from In progress to Story Issue in libcudf Aug 2, 2023
@GregoryKimball GregoryKimball changed the title Story - Improve performance with long strings [FEA] Story - Improve performance with long strings Aug 2, 2023
@GregoryKimball
Copy link
Contributor Author

Hello @davidwendt, I was discussing this issue with @elstehle. Do you think the "combining" or "copying" strings algorithms could benefit from DeviceBatchMemcpy?

@davidwendt
Copy link
Contributor

Hello @davidwendt, I was discussing this issue with @elstehle. Do you think the "combining" or "copying" strings algorithms could benefit from DeviceBatchMemcpy?

There are several parts of the strings (and nvtext) API implementations that basically resolve into a fragmented set of (pointer,size) pairs which are passed to a factory that performs an optimized gather function to build a contiguous byte array for an output strings column. So a significant impact here may be to improve that gather function by taking advantage of DeviceBatchMemcpy perhaps.

The factory function that uses the gather is illustrated in my blog post which use the custom kernel here:

auto result = cudf::make_strings_column(str_ptrs, cudf::string_view{nullptr, 0}, stream);

I'll locate other places that libcudf uses this factory function that also include benchmarks.

@GregoryKimball
Copy link
Contributor Author

Spark-RAPIDS identified long strings performance issues in find (#15405) as well as upper/lower (#15406)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue strings strings issues (C++ and Python)
Projects
Status: Story Issue
Development

No branches or pull requests

3 participants