-
Notifications
You must be signed in to change notification settings - Fork 915
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Subword Tokenizer HuggingFace like API (#7942)
This PR closes #5868 by adding a new tokenizer API. We are seeing speedups even at low batch sizes (10/100) so this should potentially unlock some inference/training use cases for us. ## Benchmarks: (Thanks to @davidwendt for writing the super fast tokenizer 💥 ) | Batch Size | HuggingFace | Rapids Old API | Rapids Tokenizer API | Tokenizer API Speed up vs HuggingFace | Rapids New API Speedup | |- |- |- |- |- |- | | 1 | 0.000242 | 0.006890 | 0.000497 | 0.487 | 13.863 | | 10 | 0.002800 | 0.007030 | 0.000516 | 5.426 | 13.624 | | 100 | 0.016200 | 0.007140 | 0.000537 | 30.168 | 13.296 | | 1000 | 0.149000 | 0.007150 | 0.000517 | 288.201 | 13.830 | ## API Comparision to HuggingFace: The goal of this PR is to ensure our API matches up HuggingFace as much as possible to help with ease of porting. Proposed API in this PR: ```python from cudf.core.subword_tokenizer import SubwordTokenizer tokenizer = SubwordTokenizer('bert-base-cased-vocab-hash.txt',do_lower_case=False) output = tokenizer(str_series, max_num_rows=len(str_series), truncation=True, max_length=seq_len, padding='max_length', add_special_tokens=False, return_tensors='pt') ``` HuggingFace API: ```python from transformers import BertTokenizerFast tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased', do_lower_case=False) output = tokenizer(input_sentence_ls, truncation=True, max_length=seq_len, padding='max_length', add_special_tokens=False, return_tensors = 'pt') output_d = {k:v.cuda() for k,v in output.items()} ``` ## TODO: - [x] Add tests - [x] Throw appropriate warnings for HuggingFace discrepancies - [x] API checks - [X] [ Benchmark/Example Notebook ](https://nbviewer.jupyter.org/gist/VibhuJawa/350a8479b10be3591dd9c4d5da3cfc3b) CC: @raykallen, @BartleyR (from the cyber team) CC: @randerzander , @beckernick (from the workflows team) Authors: - Vibhu Jawa (https://github.com/VibhuJawa) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - AJ Schmidt (https://github.com/ajschmidt8) - Keith Kraus (https://github.com/kkraus14) URL: #7942
- Loading branch information
Showing
19 changed files
with
8,491 additions
and
5,596 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.