-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance improvement for nvtext tokenize/token functions #13480
Performance improvement for nvtext tokenize/token functions #13480
Conversation
Performance results for
|
Performance results for
|
Performance results for
|
Performance results for
|
Performance results for `nvtext::ngrams_tokenize'
|
…ings (#13322) Changes the internal regex logic to minimize character counting to help performance with longer strings. The improvement applies mainly to libcudf regex functions that return strings (i.e. extract, replace, split). The changes here also improve the internal device APIs for clarity to improve maintenance. The most significant change makes the position variables input-only and returning an optional pair to indicate a successful match. There are some more optimizations that are possible here where character positions are passed back and forth that could be replaced with byte positions to further reduce counting. Initial measurements showed this noticeably slowed down small strings so more analysis is required before continuing this optimization. Reference: #13480 ### More Detail First, there is a change to some internal regex function signatures. Notable the `reprog_device::find()` and `reprog_device::extract()` member functions declared in `cpp/src/strings/regex/regex.cuh` that are used by all the libcudf regex functions. The in/out parameters are now input-only parameters (pass by value) and the return is an optional pair that includes the match result. Also, the `begin` parameter is now an iterator and the `end` parameter now has a default. This change requires updating all the definitions and uses of the `find` and `extract` member functions. Using an iterator as the `begin` parameter allows for some optimizations in the calling code to minimize character counting that may be needed for processing multi-byte UTF-8 characters. Rather than using the `cudf::string_view::byte_offset()` member function to convert character positions to byte positions, an iterator can be incremented as we traverse through the string which helps reduce some character counting. So the changes here involve removing some calls to `byte_offset()` and incrementing (really moving) iterators with a pattern like `itr += (new_pos - itr.position());` There is another PR #13428 to make a `move_to` iterator member function. It is possible to reduce the character counting even more as mentioned above but further optimization requires some deeper analysis. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Mark Harris (https://github.com/harrism) - MithunR (https://github.com/mythrocks) URL: #13322
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, one random comment for my own edification.
@@ -58,7 +58,7 @@ namespace { | |||
*/ | |||
struct normalize_spaces_fn { | |||
cudf::column_device_view const d_strings; // strings to normalize | |||
int32_t* d_offsets{}; // offsets into d_buffer | |||
cudf::size_type* d_offsets{}; // offsets into d_chars |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we deprecated offset_type entirely? Has it been removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not yet. I can look into doing that in a separate PR.
/merge |
Fixes memcheck error found by the nightly build in the nvtext `characters_tokenizer` utility function. ``` [ RUN ] TextNgramsTokenizeTest.Tokenize ========= Invalid __global__ read of size 1 bytes ========= at 0x2360 in void cub::CUB_101702_610_860_NS::DeviceScanKernel<cub::CUB_101702_610_860_NS::DeviceScanPolicy<int>::Policy600, thrust::cuda_cub::transform_input_iterator_t<int, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, nvtext::detail::strings_tokenizer>, int *, cub::CUB_101702_610_860_NS::ScanTileState<int, (bool)1>, thrust::plus<int>, cub::CUB_101702_610_860_NS::NullType, int>(T2, T3, T4, int, T5, T6, T7) ========= by thread (5,0,0) in block (0,0,0) ========= Address 0x7f67a0200a65 is out of bounds ========= and is 1 bytes after the nearest allocation at 0x7f67a0200a00 of size 101 bytes ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: [0x30b492] ========= in /usr/lib/x86_64-linux-gnu/libcuda.so.1 ========= Host Frame: [0x1488c] ========= in /conda/envs/rapids/lib/libcudart.so.11.0 ========= Host Frame:cudaLaunchKernel [0x6c318] ========= in /conda/envs/rapids/lib/libcudart.so.11.0 ========= Host Frame:nvtext::detail::ngrams_tokenize(cudf::strings_column_view const&, int, cudf::string_scalar const&, cudf::string_scalar const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*) [0x2693cc9] ========= in /conda/envs/rapids/lib/libcudf.so ``` This error was introduced by changes in #13480 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Bradley Dice (https://github.com/bdice) URL: #13649
Description
Improves performance for nvtext tokenize functions by minimizing character counting in the
characters_tokenize
utility functor insrc/text/utilities/tokenize_ops.cuh
.Functions this change effects are:
nvtext::count_tokens
(single delimiter and whitespace)nvtext::tokenize
(single delimiter and whitespace)nvtext::replace_tokens
nvtext::normalize_spaces
nvtext::ngrams_tokenize
This change improved performance by at least 10% for all string lengths for most of these functions.
Reference #13048
Checklist