Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nvtext::byte_pair_encoding API #10270

Merged
merged 25 commits into from
Mar 17, 2022

Conversation

davidwendt
Copy link
Contributor

Reference #9657

Add the nvtext::byte_pair_encoding API. This is not the BPE tokenizer but just the encoding function. The tokenizer will be a larger effort that will probably span multiple PRs. Providing the encoder here to be evaluated independently.

Theoretically, this API could be used like the following to achieve a similar BPE tokenizer behavior perhaps:

input = strings to tokenize
mps = nvtext::load_merge_pairs_file("merges.txt");
bpe = nvtext::byte_pair_encoding( input, mps );

vocab = nvtext::load_vocabulary_file( "hashed_vocab.txt" );
result = nvtext::subword_tokenize( bpe, vocab, max_length, stride, lower_case, truncate, max_rows );

@davidwendt davidwendt added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) non-breaking Non-breaking change labels Feb 10, 2022
@davidwendt davidwendt self-assigned this Feb 10, 2022
@github-actions github-actions bot added CMake CMake build issue conda labels Feb 10, 2022
@codecov
Copy link

codecov bot commented Feb 11, 2022

Codecov Report

Merging #10270 (4da9b53) into branch-22.04 (4596244) will increase coverage by 0.04%.
The diff coverage is 100.00%.

@@               Coverage Diff                @@
##           branch-22.04   #10270      +/-   ##
================================================
+ Coverage         86.13%   86.18%   +0.04%     
================================================
  Files               139      139              
  Lines             22438    22468      +30     
================================================
+ Hits              19328    19363      +35     
+ Misses             3110     3105       -5     
Impacted Files Coverage Δ
python/cudf/cudf/core/tools/numeric.py 89.24% <100.00%> (+0.11%) ⬆️
python/dask_cudf/dask_cudf/backends.py 86.44% <100.00%> (+1.47%) ⬆️
...ython/dask_cudf/dask_cudf/io/tests/test_parquet.py 100.00% <100.00%> (ø)
python/cudf/cudf/core/column/string.py 88.39% <0.00%> (+0.12%) ⬆️
python/cudf/cudf/core/groupby/groupby.py 91.57% <0.00%> (+0.22%) ⬆️
python/cudf/cudf/core/column/numerical.py 95.28% <0.00%> (+0.29%) ⬆️
python/cudf/cudf/core/tools/datetimes.py 84.49% <0.00%> (+0.30%) ⬆️
python/cudf/cudf/core/column/lists.py 90.56% <0.00%> (+0.47%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d4ce5d5...4da9b53. Read the comment docs.

rapids-bot bot pushed a commit that referenced this pull request Feb 17, 2022
Fixes declaration of the internal `MurmurHash3_32::hash_combine()` to add the `const` qualifier.

Found this while working on #10270 and trying to call `hash_combine` from a `const` instance.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Karthikeyan (https://github.com/karthikeyann)
  - Bradley Dice (https://github.com/bdice)
  - Conor Hoekstra (https://github.com/codereport)

URL: #10311
@github-actions github-actions bot removed the conda label Feb 17, 2022
@davidwendt davidwendt marked this pull request as ready for review March 11, 2022 17:31
@davidwendt davidwendt requested review from a team as code owners March 11, 2022 17:31
@davidwendt davidwendt requested review from bdice and nvdbaranec March 11, 2022 17:31
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMake approval

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great overall. The use of algorithms is very clean. I have some comments to address but generally approve of the design.

cpp/include/nvtext/bpe_tokenize.hpp Outdated Show resolved Hide resolved
cpp/include/nvtext/bpe_tokenize.hpp Outdated Show resolved Hide resolved
template <typename CharType>
constexpr bool is_whitespace(CharType ch)
{
return ch <= ' ';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would treat all the special characters like null (000), bell (007), backspace (010), escape (033) as whitespace. Of course, it also captures whitespace characters like newlines (012), carriage return (015), and tab (011). Is this behavior aligned with how other encoders would handle those special characters, or should this function check for specific characters (space, newline, tab, etc.)? If this is intended behavior, a comment to explain the rationale for that behavior would be helpful.

Possible alternatives:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are correct. This is a shortcut for now since it faster than the alternatives but is also a place-holder in case more complicated handling is needed later -- the tokenizer normalization step may convert all whitespace types into this range.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay! That sounds fine. If you think it's appropriate, you might add a comment to indicate that this logic is a shortcut/placeholder for catching common ASCII whitespace characters. Otherwise this can be resolved.

(side note: there are also a smattering of Unicode characters that are considered whitespace that this won't catch. Unicode is so complicated. 😄 https://en.wikipedia.org/wiki/Whitespace_character)

cpp/src/text/subword/bpe_tokenizer.cu Outdated Show resolved Hide resolved
cpp/src/text/subword/bpe_tokenizer.cu Outdated Show resolved Hide resolved
cpp/src/text/subword/bpe_tokenizer.cu Outdated Show resolved Hide resolved
cpp/src/text/subword/load_merges_file.cu Outdated Show resolved Hide resolved
cpp/src/text/subword/load_merges_file.cu Show resolved Hide resolved
cpp/src/text/subword/load_merges_file.cu Outdated Show resolved Hide resolved
cpp/src/text/subword/load_merges_file.cu Outdated Show resolved Hide resolved
@davidwendt davidwendt requested a review from bdice March 15, 2022 17:58
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I have only a couple minor questions/comments.

template <typename CharType>
constexpr bool is_whitespace(CharType ch)
{
return ch <= ' ';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay! That sounds fine. If you think it's appropriate, you might add a comment to indicate that this logic is a shortcut/placeholder for catching common ASCII whitespace characters. Otherwise this can be resolved.

(side note: there are also a smattering of Unicode characters that are considered whitespace that this won't catch. Unicode is so complicated. 😄 https://en.wikipedia.org/wiki/Whitespace_character)

auto const d_pair = d_merges.element<cudf::string_view>(idx);
auto const lhs = d_pair.data();
auto const end_str = d_pair.data() + d_pair.size_bytes();
auto const rhs = thrust::find(thrust::seq, lhs, end_str, ' ') + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would cause a segfault from malformed input, right? That sounds undesirable and possibly exploitable. If no space is found, I would return an empty string for the right side. Either way, it's okay that the behavior is undefined -- I'd just aim for safe memory access.

cpp/src/text/subword/bpe_tokenizer.cu Show resolved Hide resolved
cpp/include/cudf/strings/detail/combine.hpp Show resolved Hide resolved
cpp/include/nvtext/bpe_tokenize.hpp Outdated Show resolved Hide resolved
@davidwendt davidwendt requested a review from nvdbaranec March 16, 2022 09:32
@davidwendt
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 621d26f into rapidsai:branch-22.04 Mar 17, 2022
@davidwendt davidwendt deleted the fea-byte-pair-encoder branch March 17, 2022 23:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants