-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Operations with string_view
involving character counting are very inefficient
#12445
Labels
0 - Backlog
In queue waiting for assignment
bug
Something isn't working
libcudf
Affects libcudf (C++/CUDA) code.
strings
strings issues (C++ and Python)
Comments
ttnghia
added
bug
Something isn't working
Needs Triage
Need team to review and classify
labels
Dec 28, 2022
CC @davidwendt for visibility. |
Recommend not using |
GregoryKimball
added
0 - Backlog
In queue waiting for assignment
libcudf
Affects libcudf (C++/CUDA) code.
strings
strings issues (C++ and Python)
and removed
Needs Triage
Need team to review and classify
labels
Apr 2, 2023
This was referenced Apr 4, 2023
rapids-bot bot
pushed a commit
that referenced
this issue
Apr 18, 2023
Improves on performance for longer strings with `cudf::strings::slice_strings()` API. The `cudf::string_view::substr` was reworked to minimize counting characters and the gather version of `make_strings_children` is used to build the resulting strings column. This version is already optimized for small and large strings. Additionally, the code was refactored so the common case of `step==1 and start < stop` can also make use of the gather approach. Common code was also grouped closer together to help navigate the source file better. The `slice.cpp` benchmark was updated to better measure large strings with comparable slice boundaries. The benchmark showed performance improvement was up to 9x for larger strings with no significant degradation for smaller strings. Reference #13048 and #12445 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Elias Stehle (https://github.com/elstehle) URL: #13057
shwina
pushed a commit
to shwina/cudf
that referenced
this issue
Apr 18, 2023
Improves on performance for longer strings with `cudf::strings::slice_strings()` API. The `cudf::string_view::substr` was reworked to minimize counting characters and the gather version of `make_strings_children` is used to build the resulting strings column. This version is already optimized for small and large strings. Additionally, the code was refactored so the common case of `step==1 and start < stop` can also make use of the gather approach. Common code was also grouped closer together to help navigate the source file better. The `slice.cpp` benchmark was updated to better measure large strings with comparable slice boundaries. The benchmark showed performance improvement was up to 9x for larger strings with no significant degradation for smaller strings. Reference rapidsai#13048 and rapidsai#12445 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Elias Stehle (https://github.com/elstehle) URL: rapidsai#13057
rapids-bot bot
pushed a commit
that referenced
this issue
Apr 20, 2023
Improves performance of calls to `cudf::string_view::find` functions for longer strings by minimizing character and byte counting required for resolving the input parameters. The input values are in character positions/counts so range checks do not include bytes that may be in partial characters. Converting these to bytes requires some counting usually from the beginning of the string for any given character position. The code has been reworked to minimizing this counting as much as possible. This improves performance overall by 10% but greatly improves performance for long strings by up to 3x. Closes #12445 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Nghia Truong (https://github.com/ttnghia) URL: #13100
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
0 - Backlog
In queue waiting for assignment
bug
Something isn't working
libcudf
Affects libcudf (C++/CUDA) code.
strings
strings issues (C++ and Python)
Due to the nature of strings columns that they only contain offsets to determine string boundaries, there is no way to know in
O(1)
how many characters were stored in each string. As such, when an operation requires to know the number of characters, an expensive sequential count is executed:Because of that, a simple operation like
substr
can be extremely expensive. One of my tests with 1.4 million (customized) substring ops from just one source string can be slow like this:When I just use the simple way of
memcpy
(which cannot deal with utf8 character, assuming that the input only has ASCII) then the performance went back to be reasonable:Summary for the tables above: using just
memcpy
for substring without character counting can improve performance from96
seconds down to19
milliseconds.As such, the way of implementing strings column should be improved. We may need to have some efficient way to compute the number of characters in parallel (such as using
thrust::reduce_by_key
), or have a new way to store strings such that we can compute the number of characters inO(1)
time.I don't have a very good idea of how to implement and use those efficiently for various internal
string_view
operations but they should be on the table. Otherwise, we will hit this performance issue more in the future.The text was updated successfully, but these errors were encountered: