[REVIEW]Add function to create hashed vocabulary file from raw vocabulary #6568

VibhuJawa · 2020-10-20T22:11:38Z

This pr closes part of #5799 by upstreaming the perfect_hash.py to cudf.

Please note I don't understand the details of the inner workings of perfect_hash.py and this is more of a one to one port of the file with minimal code changes.

To ensure correctness i ensured that we get the same result as perfect-hash.py ( vocab_hash.txt) created on the vocabulary bert-base-uncased-vocab.txt

The main change here is that I have gotten rid of the non-compact code path as that caused failures like at issue .

TODO:

Add function
Add Test to ensure equivalence
Add ChangeLog

Previous Problems:

Below have been addressed now by sampling nonspecial symbols.

Adding this test will :
a. Add 30s to the test suite
b. Add 1.8 Mb because of the ground truth and vocabulary files

We can reduce both if the above are unexpectable by sampling the vocabulary to lesser words.

Updated PR:

Below have been addressed now by sampling nonspecial symbols.

Adding this test will :
a. Add 1.5 s to the test suite
b. Add 112 kb because of the ground truth and vocabulary files

GPUtester · 2020-10-20T22:12:07Z

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

VibhuJawa · 2020-10-20T22:57:10Z

This pr should be ready for an initial review.

kkraus14 · 2020-10-21T00:07:32Z

Adding this test will :
a. Add 30s to the test suite
b. Add 1.8 Mb because of the ground truth and vocabulary files

We can reduce both if the above are unexpectable by sampling the vocabulary to lesser words.

The 30s testing addition is un-ideal but acceptable, but the 1.8 Mb is too large for a git repo. Could we use a much smaller file for testing (which would presumably speed up the testing anyway)?

VibhuJawa · 2020-10-21T00:21:36Z

Adding this test will :
a. Add 30s to the test suite
b. Add 1.8 Mb because of the ground truth and vocabulary files

We can reduce both if the above are unexpectable by sampling the vocabulary to lesser words.

The 30s testing addition is un-ideal but acceptable, but the 1.8 Mb is too large for a git repo. Could we use a much smaller file for testing (which would presumably speed up the testing anyway)?

Sure, I will do a random sampling of the bert-vocabulary to draw 5% of the data, I think it should cover all edge cases. Will update the PR.

VibhuJawa · 2020-10-21T01:12:29Z

Switched to 5% sampling of nonspecial symbols, special symbols will have to be kept IMO. Anyways, we should be good now as it only adds 112kb and 1.5 s to testing.

codecov · 2020-10-21T01:46:25Z

Codecov Report

Merging #6568 into branch-0.17 will increase coverage by 0.59%.
The diff coverage is 100.00%.

@@               Coverage Diff               @@
##           branch-0.17    #6568      +/-   ##
===============================================
+ Coverage        82.07%   82.67%   +0.59%     
===============================================
  Files               90       91       +1     
  Lines            14592    15058     +466     
===============================================
+ Hits             11977    12449     +472     
+ Misses            2615     2609       -6

Impacted Files	Coverage Δ
python/cudf/cudf/core/column/string.py	`86.21% <ø> (+0.17%)`	⬆️
python/cudf/cudf/utils/hash_vocab_utils.py	`100.00% <100.00%> (ø)`
python/cudf/cudf/io/feather.py	`100.00% <0.00%> (ø)`
python/cudf/cudf/comm/serialize.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/_fuzz_testing/io.py	`0.00% <0.00%> (ø)`
python/dask_cudf/dask_cudf/_version.py	`0.00% <0.00%> (ø)`
python/dask_cudf/dask_cudf/io/tests/test_csv.py	`100.00% <0.00%> (ø)`
python/dask_cudf/dask_cudf/io/tests/test_orc.py	`100.00% <0.00%> (ø)`
python/dask_cudf/dask_cudf/io/tests/test_json.py	`100.00% <0.00%> (ø)`
...ython/dask_cudf/dask_cudf/io/tests/test_parquet.py	`100.00% <0.00%> (ø)`
... and 37 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5a594e6...e55a437. Read the comment docs.

python/cudf/cudf/core/column/string.py

kkraus14 · 2020-10-21T20:49:30Z

CI currently blocked by dask/distributed#4177

VibhuJawa added 2 commits October 20, 2020 14:50

added hashing logic and test to cuDF

0a81347

style check changes

d157435

VibhuJawa requested a review from a team as a code owner October 20, 2020 22:11

VibhuJawa requested review from trxcllnt and cwharris October 20, 2020 22:11

add changelog

a4a9ab5

VibhuJawa changed the title ~~[WIP] Add function to create vocab hash_file to cudf~~ [WIP]Add function to create hashed vocabulary file from raw vocabulary Oct 20, 2020

VibhuJawa added 2 commits October 20, 2020 15:36

black style changes

910e527

added docstring to subword_tokenize about creating hash_file

7073f61

VibhuJawa changed the title ~~[WIP]Add function to create hashed vocabulary file from raw vocabulary~~ [REVIEW]Add function to create hashed vocabulary file from raw vocabulary Oct 20, 2020

VibhuJawa added 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer labels Oct 20, 2020

VibhuJawa added 2 - In Progress Currently a work in progress and removed 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer labels Oct 21, 2020

VibhuJawa changed the title ~~[REVIEW]Add function to create hashed vocabulary file from raw vocabulary~~ [WIP]Add function to create hashed vocabulary file from raw vocabulary Oct 21, 2020

VibhuJawa added 2 commits October 20, 2020 18:01

Switched testing to a 5% sample

27669aa

added sampling script link to the test

a6da52d

VibhuJawa changed the title ~~[WIP]Add function to create hashed vocabulary file from raw vocabulary~~ [REVIEW]Add function to create hashed vocabulary file from raw vocabulary Oct 21, 2020

VibhuJawa added 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer and removed 2 - In Progress Currently a work in progress labels Oct 21, 2020

kkraus14 approved these changes Oct 21, 2020

View reviewed changes

kkraus14 reviewed Oct 21, 2020

View reviewed changes

python/cudf/cudf/core/column/string.py Outdated Show resolved Hide resolved

Double tick in docstring

b3e3d6f

kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer labels Oct 21, 2020

kkraus14 added Python Affects Python cuDF API. strings strings issues (C++ and Python) labels Oct 21, 2020

galipremsagar added 2 commits October 23, 2020 20:31

Merge branch 'branch-0.17' into fea_add_vocabhash_function

5fc7511

Merge branch 'branch-0.17' into fea_add_vocabhash_function

e55a437

kkraus14 merged commit e94ed01 into rapidsai:branch-0.17 Oct 27, 2020

VibhuJawa mentioned this pull request Oct 27, 2020

[DOC]Documentation for subword_tokenize #5799

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW]Add function to create hashed vocabulary file from raw vocabulary #6568

[REVIEW]Add function to create hashed vocabulary file from raw vocabulary #6568

VibhuJawa commented Oct 20, 2020 •

edited

Loading

GPUtester commented Oct 20, 2020

VibhuJawa commented Oct 20, 2020

kkraus14 commented Oct 21, 2020

VibhuJawa commented Oct 21, 2020 •

edited

Loading

VibhuJawa commented Oct 21, 2020

codecov bot commented Oct 21, 2020 •

edited

Loading

kkraus14 commented Oct 21, 2020

[REVIEW]Add function to create hashed vocabulary file from raw vocabulary #6568

[REVIEW]Add function to create hashed vocabulary file from raw vocabulary #6568

Conversation

VibhuJawa commented Oct 20, 2020 • edited Loading

TODO:

Previous Problems:

Updated PR:

GPUtester commented Oct 20, 2020

VibhuJawa commented Oct 20, 2020

kkraus14 commented Oct 21, 2020

VibhuJawa commented Oct 21, 2020 • edited Loading

VibhuJawa commented Oct 21, 2020

codecov bot commented Oct 21, 2020 • edited Loading

Codecov Report

kkraus14 commented Oct 21, 2020

VibhuJawa commented Oct 20, 2020 •

edited

Loading

VibhuJawa commented Oct 21, 2020 •

edited

Loading

codecov bot commented Oct 21, 2020 •

edited

Loading