-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add nvtext::byte_pair_encoding API #10270
Add nvtext::byte_pair_encoding API #10270
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.04 #10270 +/- ##
================================================
+ Coverage 86.13% 86.18% +0.04%
================================================
Files 139 139
Lines 22438 22468 +30
================================================
+ Hits 19328 19363 +35
+ Misses 3110 3105 -5
Continue to review full report at Codecov.
|
Fixes declaration of the internal `MurmurHash3_32::hash_combine()` to add the `const` qualifier. Found this while working on #10270 and trying to call `hash_combine` from a `const` instance. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Karthikeyan (https://github.com/karthikeyann) - Bradley Dice (https://github.com/bdice) - Conor Hoekstra (https://github.com/codereport) URL: #10311
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMake approval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great overall. The use of algorithms is very clean. I have some comments to address but generally approve of the design.
template <typename CharType> | ||
constexpr bool is_whitespace(CharType ch) | ||
{ | ||
return ch <= ' '; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would treat all the special characters like null (000), bell (007), backspace (010), escape (033) as whitespace. Of course, it also captures whitespace characters like newlines (012), carriage return (015), and tab (011). Is this behavior aligned with how other encoders would handle those special characters, or should this function check for specific characters (space, newline, tab, etc.)? If this is intended behavior, a comment to explain the rationale for that behavior would be helpful.
Possible alternatives:
std::isspace
https://en.cppreference.com/w/cpp/string/byte/isspace- However
cudf::strings::string_character_types::SPACE
works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are correct. This is a shortcut for now since it faster than the alternatives but is also a place-holder in case more complicated handling is needed later -- the tokenizer normalization step may convert all whitespace types into this range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay! That sounds fine. If you think it's appropriate, you might add a comment to indicate that this logic is a shortcut/placeholder for catching common ASCII whitespace characters. Otherwise this can be resolved.
(side note: there are also a smattering of Unicode characters that are considered whitespace that this won't catch. Unicode is so complicated. 😄 https://en.wikipedia.org/wiki/Whitespace_character)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I have only a couple minor questions/comments.
template <typename CharType> | ||
constexpr bool is_whitespace(CharType ch) | ||
{ | ||
return ch <= ' '; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay! That sounds fine. If you think it's appropriate, you might add a comment to indicate that this logic is a shortcut/placeholder for catching common ASCII whitespace characters. Otherwise this can be resolved.
(side note: there are also a smattering of Unicode characters that are considered whitespace that this won't catch. Unicode is so complicated. 😄 https://en.wikipedia.org/wiki/Whitespace_character)
auto const d_pair = d_merges.element<cudf::string_view>(idx); | ||
auto const lhs = d_pair.data(); | ||
auto const end_str = d_pair.data() + d_pair.size_bytes(); | ||
auto const rhs = thrust::find(thrust::seq, lhs, end_str, ' ') + 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would cause a segfault from malformed input, right? That sounds undesirable and possibly exploitable. If no space is found, I would return an empty string for the right side. Either way, it's okay that the behavior is undefined -- I'd just aim for safe memory access.
@gpucibot merge |
Reference #9657
Add the
nvtext::byte_pair_encoding
API. This is not the BPE tokenizer but just the encoding function. The tokenizer will be a larger effort that will probably span multiple PRs. Providing the encoder here to be evaluated independently.Theoretically, this API could be used like the following to achieve a similar BPE tokenizer behavior perhaps: