Subword Tokenizer HuggingFace like API #7942

VibhuJawa · 2021-04-12T21:47:25Z

This PR closes #5868 by adding a new tokenizer API.

We are seeing speedups even at low batch sizes (10/100) so this should potentially unlock some inference/training use cases for us.

Benchmarks:

(Thanks to @davidwendt for writing the super fast tokenizer 💥 )

Batch Size	HuggingFace	Rapids Old API	Rapids Tokenizer API	Tokenizer API Speed up vs HuggingFace	Rapids New API Speedup
1	0.000242	0.006890	0.000497	0.487	13.863
10	0.002800	0.007030	0.000516	5.426	13.624
100	0.016200	0.007140	0.000537	30.168	13.296
1000	0.149000	0.007150	0.000517	288.201	13.830

API Comparision to HuggingFace:

The goal of this PR is to ensure our API matches up HuggingFace as much as possible to help with ease of porting.

Proposed API in this PR:

from cudf.core.subword_tokenizer import SubwordTokenizer
tokenizer = SubwordTokenizer('bert-base-cased-vocab-hash.txt',do_lower_case=False)

output = tokenizer(str_series,
                   max_num_rows=len(str_series),
                   truncation=True,
                   max_length=seq_len,
                   padding='max_length',
                   add_special_tokens=False,
                   return_tensors='pt')

HuggingFace API:

from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased', do_lower_case=False)

output = tokenizer(input_sentence_ls,
                   truncation=True,
                   max_length=seq_len,
                   padding='max_length',
                   add_special_tokens=False,
                   return_tensors = 'pt')
output_d = {k:v.cuda() for k,v in output.items()}

TODO:

Add tests
Throw appropriate warnings for HuggingFace discrepancies
API checks
Benchmark/Example Notebook

CC: @raykallen, @BartleyR (from the cyber team)

CC: @randerzander , @beckernick (from the workflows team)

…a/cudf into fea_subword_inmem_hash_bindings

VibhuJawa

Changed DataFiles to allow for testing against HuggingFace.

ci/gpu/build.sh

VibhuJawa · 2021-04-21T23:23:29Z

Also, is it necessary to change the text files used for testing here? Changing big files like this just blows up the git repository size over time.

We previously had uncased vocabulary which was randomly sampled making it difficult to run tests with it. This new vocabulary is a cased one and does meaningful tokenization. We can probably just download them on the fly if making this is prohibitive. FWIW, I don't expect this file to change again in the future.

ajschmidt8

approved now that rapidsai/integration#249 is merged.

python/cudf/requirements/cuda-11.0/dev_requirements.txt

python/cudf/requirements/cuda-11.1/dev_requirements.txt

python/cudf/requirements/cuda-11.2/dev_requirements.txt

VibhuJawa · 2021-04-22T22:03:01Z

@kkraus14, @galipremsagar. Thanks a ton for your detailed reviews.

I think all reviews have been addressed, and this PR is ready for another review.

kkraus14 · 2021-04-22T22:24:02Z

@gpucibot merge

galipremsagar · 2021-04-23T00:49:26Z

rerun tests

VibhuJawa · 2021-04-23T15:08:14Z

rerun tests

VibhuJawa added 8 commits April 8, 2021 19:38

first_successful_compilation

a949b2e

working_subword_tokenizer

0f3623f

minor bug fixes

991085c

first_successful_compilation

56e0552

working_subword_tokenizer

96f1921

minor bug fixes

30405c5

Merge branch 'fea_subword_inmem_hash_bindings' of github.com:vibhujaw…

100ec9e

…a/cudf into fea_subword_inmem_hash_bindings

Added cleaner API

0b7956e

VibhuJawa added feature request New feature or request 2 - In Progress Currently a work in progress Python Affects Python cuDF API. labels Apr 12, 2021

VibhuJawa changed the title ~~Subword Tokenizer API~~ Subword Tokenizer API [skip-ci] Apr 12, 2021

VibhuJawa added 2 commits April 14, 2021 20:49

some API cleanup and inital tests

20a5c24

test cleanup

0843cd7

VibhuJawa added the non-breaking Non-breaking change label Apr 15, 2021

cleanup + working tests

6d22915

VibhuJawa commented Apr 15, 2021

View reviewed changes

Modifed CI

37196d0

github-actions bot added the gpuCI label Apr 15, 2021

VibhuJawa added gpuCI and removed gpuCI labels Apr 15, 2021

VibhuJawa changed the title ~~Subword Tokenizer API [skip-ci]~~ Subword Tokenizer HuggingFace like API Apr 15, 2021

Documentation Changes

daa087e

VibhuJawa commented Apr 15, 2021

View reviewed changes

ci/gpu/build.sh Outdated Show resolved Hide resolved

VibhuJawa marked this pull request as ready for review April 15, 2021 22:04

VibhuJawa requested review from a team as code owners April 15, 2021 22:04

VibhuJawa requested review from trxcllnt and galipremsagar April 15, 2021 22:04

fix_test

25049c0

Address reviews to the API

0018779

ajschmidt8 approved these changes Apr 22, 2021

View reviewed changes

VibhuJawa added 2 commits April 22, 2021 14:31

fixed documentation formatting

26cc440

mypy style fixes

e5cc893

VibhuJawa added 3 - Ready for Review Ready for review by team and removed 0 - Waiting on Author Waiting for author to respond to review labels Apr 22, 2021

added transformers to dev_requirements.txt

23e9fa4

VibhuJawa commented Apr 22, 2021

View reviewed changes

python/cudf/requirements/cuda-11.0/dev_requirements.txt Show resolved Hide resolved

VibhuJawa commented Apr 22, 2021

View reviewed changes

python/cudf/requirements/cuda-11.1/dev_requirements.txt Show resolved Hide resolved

VibhuJawa commented Apr 22, 2021

View reviewed changes

python/cudf/requirements/cuda-11.2/dev_requirements.txt Show resolved Hide resolved

kkraus14 approved these changes Apr 22, 2021

View reviewed changes

kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Apr 22, 2021

Added a test to the old subword tokenizer API to triage CI error

01b4f43

VibhuJawa mentioned this pull request Apr 23, 2021

[BUG] Subword tokenizer raises cudaErrorIllegalAddress #8051

Closed

kkraus14 added 0 - Blocked Cannot progress due to external reasons and removed 5 - Ready to Merge Testing and reviews complete, ready to merge labels Apr 27, 2021

Merge branch 'branch-0.20' into fea_subword_inmem_hash_bindings

2aef505

VibhuJawa removed the 0 - Blocked Cannot progress due to external reasons label May 3, 2021

rapids-bot bot merged commit 36eaa06 into rapidsai:branch-0.20 May 3, 2021

VibhuJawa mentioned this pull request Jun 24, 2021

[FEA] Deprecate ser.str.subword_tokenize API in favor of cudf.core.subword_tokenizer.SubwordTokenizer #8604

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subword Tokenizer HuggingFace like API #7942

Subword Tokenizer HuggingFace like API #7942

VibhuJawa commented Apr 12, 2021 •

edited

Loading

VibhuJawa left a comment

VibhuJawa commented Apr 21, 2021

ajschmidt8 left a comment

VibhuJawa commented Apr 22, 2021

kkraus14 commented Apr 22, 2021

galipremsagar commented Apr 23, 2021

VibhuJawa commented Apr 23, 2021

Subword Tokenizer HuggingFace like API #7942

Subword Tokenizer HuggingFace like API #7942

Conversation

VibhuJawa commented Apr 12, 2021 • edited Loading

Benchmarks:

API Comparision to HuggingFace:

TODO:

VibhuJawa left a comment

Choose a reason for hiding this comment

VibhuJawa commented Apr 21, 2021

ajschmidt8 left a comment

Choose a reason for hiding this comment

VibhuJawa commented Apr 22, 2021

kkraus14 commented Apr 22, 2021

galipremsagar commented Apr 23, 2021

VibhuJawa commented Apr 23, 2021

VibhuJawa commented Apr 12, 2021 •

edited

Loading