New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add nvtext::byte_pair_encoding API #10270

Merged

rapids-bot merged 25 commits into rapidsai:branch-22.04 from davidwendt:fea-byte-pair-encoder

Mar 17, 2022

Contributor

davidwendt commented Feb 10, 2022

Reference #9657

Add the nvtext::byte_pair_encoding API. This is not the BPE tokenizer but just the encoding function. The tokenizer will be a larger effort that will probably span multiple PRs. Providing the encoder here to be evaluated independently.

Theoretically, this API could be used like the following to achieve a similar BPE tokenizer behavior perhaps:

input = strings to tokenize
mps = nvtext::load_merge_pairs_file("merges.txt");
bpe = nvtext::byte_pair_encoding( input, mps );

vocab = nvtext::load_vocabulary_file( "hashed_vocab.txt" );
result = nvtext::subword_tokenize( bpe, vocab, max_length, stride, lower_case, truncate, max_rows );


          Add nvtext::byte_pair_encoding API

56758d8

davidwendt added feature request 2 - In Progress libcudf strings non-breaking labels

davidwendt self-assigned this

github-actions bot added CMake conda labels

davidwendt added 3 commits

February 10, 2022 15:45


          Merge branch 'branch-22.04' into fea-byte-pair-encoder

fcb540b


          fix call to detail::rsplit_record

ae2baa0


          Merge branch 'branch-22.04' into fea-byte-pair-encoder

5ee29cc

codecov bot commented Feb 11, 2022 •

edited

Loading

Codecov Report

Merging #10270 (4da9b53) into branch-22.04 (4596244) will increase coverage by 0.04%.
The diff coverage is 100.00%.

@@               Coverage Diff                @@
##           branch-22.04   #10270      +/-   ##
================================================
+ Coverage         86.13%   86.18%   +0.04%     
================================================
  Files               139      139              
  Lines             22438    22468      +30     
================================================
+ Hits              19328    19363      +35     
+ Misses             3110     3105       -5

Impacted Files	Coverage Δ
python/cudf/cudf/core/tools/numeric.py	`89.24% <100.00%> (+0.11%)`	⬆️
python/dask_cudf/dask_cudf/backends.py	`86.44% <100.00%> (+1.47%)`	⬆️
...ython/dask_cudf/dask_cudf/io/tests/test_parquet.py	`100.00% <100.00%> (ø)`
python/cudf/cudf/core/column/string.py	`88.39% <0.00%> (+0.12%)`	⬆️
python/cudf/cudf/core/groupby/groupby.py	`91.57% <0.00%> (+0.22%)`	⬆️
python/cudf/cudf/core/column/numerical.py	`95.28% <0.00%> (+0.29%)`	⬆️
python/cudf/cudf/core/tools/datetimes.py	`84.49% <0.00%> (+0.30%)`	⬆️
python/cudf/cudf/core/column/lists.py	`90.56% <0.00%> (+0.47%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d4ce5d5...4da9b53. Read the comment docs.

davidwendt mentioned this pull request

Add const qualifier to MurmurHash3_32::hash_combine #10311

Merged

rapids-bot bot pushed a commit that referenced this pull request


          Add const qualifier to MurmurHash3_32::hash_combine (#10311)

fdad597

Fixes declaration of the internal `MurmurHash3_32::hash_combine()` to add the `const` qualifier.

Found this while working on #10270 and trying to call `hash_combine` from a `const` instance.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Karthikeyan (https://github.com/karthikeyann)
  - Bradley Dice (https://github.com/bdice)
  - Conor Hoekstra (https://github.com/codereport)

URL: #10311


          change algorithm to use cuco::static-map

aa6f8e8

github-actions bot removed the conda label

davidwendt added 7 commits

February 17, 2022 16:48


          Merge branch 'branch-22.04' into fea-byte-pair-encoder

c215c55


          Merge branch 'branch-22.04' into fea-byte-pair-encoder

85df96e


          handle sliced input column

3df89a0


          Merge branch 'branch-22.04' into fea-byte-pair-encoder

ad438f1


          add leading space to test

6eb6171


          Merge branch 'branch-22.04' into fea-byte-pair-encoder


          add separator test

84a2cbe

davidwendt mentioned this pull request

Move standalone UTF8 functions from string_view.hpp to utf8.hpp #10369

Merged

davidwendt added 4 commits

March 1, 2022 10:22


          Merge branch 'branch-22.04' into fea-byte-pair-encoder

1d35f19


          Merge branch 'branch-22.04' into fea-byte-pair-encoder

61195b7


          Merge branch 'branch-22.04' into fea-byte-pair-encoder

fe7ada7


          fix typos in and clarify comments

d282330

davidwendt marked this pull request as ready for review

March 11, 2022 17:31

davidwendt requested review from a team as code owners

March 11, 2022 17:31

davidwendt requested review from bdice and nvdbaranec

March 11, 2022 17:31

vyasr approved these changes

View reviewed changes

Contributor

vyasr left a comment

CMake approval


          Merge branch 'branch-22.04' into fea-byte-pair-encoder

f9cdc4f

bdice requested changes

View reviewed changes

Contributor

bdice left a comment

Looks great overall. The use of algorithms is very clean. I have some comments to address but generally approve of the design.

cpp/include/nvtext/bpe_tokenize.hpp Outdated Show resolved Hide resolved

cpp/include/nvtext/bpe_tokenize.hpp Outdated Show resolved Hide resolved

cpp/src/text/subword/bpe_tokenizer.cu

+              template <typename CharType>
+              constexpr bool is_whitespace(CharType ch)
+              {
+                return ch <= ' ';

Contributor

bdice Mar 14, 2022

This would treat all the special characters like null (000), bell (007), backspace (010), escape (033) as whitespace. Of course, it also captures whitespace characters like newlines (012), carriage return (015), and tab (011). Is this behavior aligned with how other encoders would handle those special characters, or should this function check for specific characters (space, newline, tab, etc.)? If this is intended behavior, a comment to explain the rationale for that behavior would be helpful.

Possible alternatives:

std::isspace https://en.cppreference.com/w/cpp/string/byte/isspace
However cudf::strings::string_character_types::SPACE works

Contributor Author

davidwendt Mar 15, 2022

Yes, you are correct. This is a shortcut for now since it faster than the alternatives but is also a place-holder in case more complicated handling is needed later -- the tokenizer normalization step may convert all whitespace types into this range.

Contributor

bdice Mar 15, 2022

Okay! That sounds fine. If you think it's appropriate, you might add a comment to indicate that this logic is a shortcut/placeholder for catching common ASCII whitespace characters. Otherwise this can be resolved.

(side note: there are also a smattering of Unicode characters that are considered whitespace that this won't catch. Unicode is so complicated. 😄 https://en.wikipedia.org/wiki/Whitespace_character)

cpp/src/text/subword/bpe_tokenizer.cu Outdated Show resolved Hide resolved

cpp/src/text/subword/bpe_tokenizer.cu Outdated Show resolved Hide resolved

cpp/src/text/subword/bpe_tokenizer.cu Outdated Show resolved Hide resolved

cpp/src/text/subword/load_merges_file.cu Outdated Show resolved Hide resolved

cpp/src/text/subword/load_merges_file.cu Show resolved Hide resolved

cpp/src/text/subword/load_merges_file.cu Outdated Show resolved Hide resolved

cpp/src/text/subword/load_merges_file.cu Outdated Show resolved Hide resolved

davidwendt added 3 commits

March 15, 2022 09:04


          Merge branch 'branch-22.04' into fea-byte-pair-encoder

2fc267c


          fix grammar and typos

93b0842


          Merge branch 'branch-22.04' into fea-byte-pair-encoder

bbc3744

davidwendt requested a review from bdice

March 15, 2022 17:58

bdice approved these changes

View reviewed changes

Contributor

bdice left a comment

Looks good. I have only a couple minor questions/comments.

cpp/src/text/subword/bpe_tokenizer.cu

+              template <typename CharType>
+              constexpr bool is_whitespace(CharType ch)
+              {
+                return ch <= ' ';

Contributor

bdice Mar 15, 2022

Okay! That sounds fine. If you think it's appropriate, you might add a comment to indicate that this logic is a shortcut/placeholder for catching common ASCII whitespace characters. Otherwise this can be resolved.

(side note: there are also a smattering of Unicode characters that are considered whitespace that this won't catch. Unicode is so complicated. 😄 https://en.wikipedia.org/wiki/Whitespace_character)

cpp/src/text/subword/bpe_tokenizer.cu Outdated

+                  auto const d_pair   = d_merges.element<cudf::string_view>(idx);
+                  auto const lhs      = d_pair.data();
+                  auto const end_str  = d_pair.data() + d_pair.size_bytes();
+                  auto const rhs      = thrust::find(thrust::seq, lhs, end_str, ' ') + 1;

Contributor

bdice Mar 15, 2022

This would cause a segfault from malformed input, right? That sounds undesirable and possibly exploitable. If no space is found, I would return an empty string for the right side. Either way, it's okay that the behavior is undefined -- I'd just aim for safe memory access.

cpp/src/text/subword/bpe_tokenizer.cu Show resolved Hide resolved

nvdbaranec requested changes

View reviewed changes

cpp/include/cudf/strings/detail/combine.hpp Show resolved Hide resolved

cpp/include/nvtext/bpe_tokenize.hpp Outdated Show resolved Hide resolved

davidwendt added 3 commits

March 15, 2022 17:18


          add more entries in load_merge_pairs_file doxygen example

845a414


          add check for unexpected data format

060077b


          Merge branch 'branch-22.04' into fea-byte-pair-encoder

cdee746

davidwendt requested a review from nvdbaranec

March 16, 2022 09:32


          Merge branch 'branch-22.04' into fea-byte-pair-encoder

4da9b53

nvdbaranec approved these changes

View reviewed changes

Contributor Author

davidwendt commented Mar 17, 2022

@gpucibot merge

rapids-bot bot merged commit 621d26f into rapidsai:branch-22.04

davidwendt deleted the fea-byte-pair-encoder branch

March 17, 2022 23:51

GregoryKimball mentioned this pull request

[FEA] Improve ORC reader filtering and performance #13882

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review CMake feature request libcudf non-breaking strings