Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for special tokens in nvtext::subword_tokenizer #7254

Merged
merged 7 commits into from
Feb 11, 2021

Conversation

davidwendt
Copy link
Contributor

Closes #6937
This PR adds support for the following 7 special tokens in the subword_tokenize
[BOS], [EOS], [UNK], [SEP], [PAD], [CLS], and [MASK]
Descriptions for these can be found in links/text found in #6937

These can be placed anywhere in the text and may be upper or lower-case. They will be recognized regardless if they exist in the given vocabulary hash table. Example using vocab-hash.txt and code snippet from #6937

>>> text = '[CLS]I ate dinner.[SEP][BOS]It was yummy.[EOS]'
>>> cudf_ser = cudf.Series([text])
>>> tokens, attention_masks, metadata = cudf_ser.str.subword_tokenize('vocab-hash.txt', do_lower=True, do_truncate=False)
>>> print(id2vocab[tokens[0:17].get()])
['[CLS]' 'i' 'ate' 'dinner' '.' '[SEP]' '[BOS]' 'it' 'was' 'yummy' '.'
 '[EOS]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]']

A new gtest was added for this feature.
This requires no API change to the C++ or Python interfaces.

@davidwendt davidwendt added feature request New feature or request 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels Jan 29, 2021
@davidwendt davidwendt self-assigned this Jan 29, 2021
@davidwendt davidwendt requested a review from a team as a code owner January 29, 2021 19:25
@davidwendt
Copy link
Contributor Author

rerun tests

2 similar comments
@davidwendt
Copy link
Contributor Author

rerun tests

@davidwendt
Copy link
Contributor Author

rerun tests

@codecov
Copy link

codecov bot commented Feb 4, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-0.19@809141d). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@              Coverage Diff               @@
##             branch-0.19    #7254   +/-   ##
==============================================
  Coverage               ?   82.20%           
==============================================
  Files                  ?      100           
  Lines                  ?    16966           
  Branches               ?        0           
==============================================
  Hits                   ?    13947           
  Misses                 ?     3019           
  Partials               ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 809141d...db46a14. Read the comment docs.

Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
minor type suggestions.

cpp/src/text/subword/wordpiece_tokenizer.cu Outdated Show resolved Hide resolved
cpp/src/text/subword/wordpiece_tokenizer.cu Outdated Show resolved Hide resolved
cpp/src/text/subword/wordpiece_tokenizer.cu Show resolved Hide resolved
@davidwendt
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 21d2ce6 into rapidsai:branch-0.19 Feb 11, 2021
@davidwendt davidwendt deleted the subword-special-tokens branch February 11, 2021 22:36
@ttnghia
Copy link
Contributor

ttnghia commented Feb 12, 2021

Hi @davidwendt,

I got compilation errors due to this PR:

.../cudf/cpp/src/text/subword/load_hash_file.cu:128:25: error: catching polymorphic type ‘class std::exception’ by value [-Werror=catch-value=]

.../cudf/cpp/src/text/subword/load_hash_file.cu:150:25: error: catching polymorphic type ‘class std::exception’ by value [-Werror=catch-value=]

A suggestion for fixing those is to catch exception by reference.

rapids-bot bot pushed a commit that referenced this pull request Feb 13, 2021
Compile error (for gcc-9) created by change in #7254 as mentioned in the following comment.
#7254 (comment)

Core C++ Guidelines for catching exceptions says to always catch by reference.
Polymorphism is supported only with pointers and references.
http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Re-exception-ref

Authors:
  - David (@davidwendt)

Approvers:
  - Ram (Ramakrishna Prabhu) (@rgsl888prabhu)
  - MithunR (@mythrocks)

URL: #7380
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants