-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subword Tokenizer HuggingFace like API #7942
Subword Tokenizer HuggingFace like API #7942
Conversation
…a/cudf into fea_subword_inmem_hash_bindings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed DataFiles to allow for testing against HuggingFace.
We previously had |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approved now that rapidsai/integration#249 is merged.
@kkraus14, @galipremsagar. Thanks a ton for your detailed reviews. I think all reviews have been addressed, and this PR is ready for another review. |
@gpucibot merge |
rerun tests |
1 similar comment
rerun tests |
This PR closes #5868 by adding a new tokenizer API.
We are seeing speedups even at low batch sizes (10/100) so this should potentially unlock some inference/training use cases for us.
Benchmarks:
(Thanks to @davidwendt for writing the super fast tokenizer 💥 )
API Comparision to HuggingFace:
The goal of this PR is to ensure our API matches up HuggingFace as much as possible to help with ease of porting.
Proposed API in this PR:
HuggingFace API:
TODO:
CC: @raykallen, @BartleyR (from the cyber team)
CC: @randerzander , @beckernick (from the workflows team)