Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subword Tokenizer HuggingFace like API #7942

Merged

Conversation

VibhuJawa
Copy link
Member

@VibhuJawa VibhuJawa commented Apr 12, 2021

This PR closes #5868 by adding a new tokenizer API.

We are seeing speedups even at low batch sizes (10/100) so this should potentially unlock some inference/training use cases for us.

Benchmarks:

(Thanks to @davidwendt for writing the super fast tokenizer 💥 )

Batch Size HuggingFace Rapids Old API Rapids Tokenizer API Tokenizer API Speed up vs HuggingFace Rapids New API Speedup
1 0.000242 0.006890 0.000497 0.487 13.863
10 0.002800 0.007030 0.000516 5.426 13.624
100 0.016200 0.007140 0.000537 30.168 13.296
1000 0.149000 0.007150 0.000517 288.201 13.830

API Comparision to HuggingFace:

The goal of this PR is to ensure our API matches up HuggingFace as much as possible to help with ease of porting.

Proposed API in this PR:

from cudf.core.subword_tokenizer import SubwordTokenizer
tokenizer = SubwordTokenizer('bert-base-cased-vocab-hash.txt',do_lower_case=False)

output = tokenizer(str_series,
                   max_num_rows=len(str_series),
                   truncation=True,
                   max_length=seq_len,
                   padding='max_length',
                   add_special_tokens=False,
                   return_tensors='pt')

HuggingFace API:

from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased', do_lower_case=False)

output = tokenizer(input_sentence_ls,
                   truncation=True,
                   max_length=seq_len,
                   padding='max_length',
                   add_special_tokens=False,
                   return_tensors = 'pt')
output_d = {k:v.cuda() for k,v in output.items()} 

TODO:

CC: @raykallen, @BartleyR (from the cyber team)

CC: @randerzander , @beckernick (from the workflows team)

@VibhuJawa VibhuJawa added feature request New feature or request 2 - In Progress Currently a work in progress Python Affects Python cuDF API. labels Apr 12, 2021
@VibhuJawa VibhuJawa changed the title Subword Tokenizer API Subword Tokenizer API [skip-ci] Apr 12, 2021
@VibhuJawa VibhuJawa added the non-breaking Non-breaking change label Apr 15, 2021
Copy link
Member Author

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed DataFiles to allow for testing against HuggingFace.

@github-actions github-actions bot added the gpuCI label Apr 15, 2021
@VibhuJawa VibhuJawa added gpuCI and removed gpuCI labels Apr 15, 2021
@VibhuJawa VibhuJawa changed the title Subword Tokenizer API [skip-ci] Subword Tokenizer HuggingFace like API Apr 15, 2021
ci/gpu/build.sh Outdated Show resolved Hide resolved
@VibhuJawa VibhuJawa marked this pull request as ready for review April 15, 2021 22:04
@VibhuJawa VibhuJawa requested review from a team as code owners April 15, 2021 22:04
@VibhuJawa
Copy link
Member Author

Also, is it necessary to change the text files used for testing here? Changing big files like this just blows up the git repository size over time.

We previously had uncased vocabulary which was randomly sampled making it difficult to run tests with it. This new vocabulary is a cased one and does meaningful tokenization. We can probably just download them on the fly if making this is prohibitive. FWIW, I don't expect this file to change again in the future.

Copy link
Member

@ajschmidt8 ajschmidt8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved now that rapidsai/integration#249 is merged.

@VibhuJawa VibhuJawa added 3 - Ready for Review Ready for review by team and removed 0 - Waiting on Author Waiting for author to respond to review labels Apr 22, 2021
@VibhuJawa
Copy link
Member Author

@kkraus14, @galipremsagar. Thanks a ton for your detailed reviews.

I think all reviews have been addressed, and this PR is ready for another review.

@kkraus14 kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Apr 22, 2021
@kkraus14
Copy link
Collaborator

@gpucibot merge

@galipremsagar
Copy link
Contributor

rerun tests

1 similar comment
@VibhuJawa
Copy link
Member Author

rerun tests

@kkraus14 kkraus14 added 0 - Blocked Cannot progress due to external reasons and removed 5 - Ready to Merge Testing and reviews complete, ready to merge labels Apr 27, 2021
@VibhuJawa VibhuJawa removed the 0 - Blocked Cannot progress due to external reasons label May 3, 2021
@rapids-bot rapids-bot bot merged commit 36eaa06 into rapidsai:branch-0.20 May 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Create separate API for loading the vocabulary file for the subword-tokenizer
5 participants