-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for special tokens in nvtext::subword_tokenizer #7254
Add support for special tokens in nvtext::subword_tokenizer #7254
Conversation
rerun tests |
2 similar comments
rerun tests |
rerun tests |
Codecov Report
@@ Coverage Diff @@
## branch-0.19 #7254 +/- ##
==============================================
Coverage ? 82.20%
==============================================
Files ? 100
Lines ? 16966
Branches ? 0
==============================================
Hits ? 13947
Misses ? 3019
Partials ? 0 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
minor type suggestions.
@gpucibot merge |
Hi @davidwendt, I got compilation errors due to this PR:
A suggestion for fixing those is to catch exception by reference. |
Compile error (for gcc-9) created by change in #7254 as mentioned in the following comment. #7254 (comment) Core C++ Guidelines for catching exceptions says to always catch by reference. Polymorphism is supported only with pointers and references. http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Re-exception-ref Authors: - David (@davidwendt) Approvers: - Ram (Ramakrishna Prabhu) (@rgsl888prabhu) - MithunR (@mythrocks) URL: #7380
Closes #6937
This PR adds support for the following 7 special tokens in the
subword_tokenize
[BOS], [EOS], [UNK], [SEP], [PAD], [CLS], and [MASK]
Descriptions for these can be found in links/text found in #6937
These can be placed anywhere in the text and may be upper or lower-case. They will be recognized regardless if they exist in the given vocabulary hash table. Example using vocab-hash.txt and code snippet from #6937
A new gtest was added for this feature.
This requires no API change to the C++ or Python interfaces.