-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW]Add function to create hashed vocabulary file from raw vocabulary #6568
[REVIEW]Add function to create hashed vocabulary file from raw vocabulary #6568
Conversation
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
This pr should be ready for an initial review. |
The 30s testing addition is un-ideal but acceptable, but the 1.8 Mb is too large for a git repo. Could we use a much smaller file for testing (which would presumably speed up the testing anyway)? |
Sure, I will do a random sampling of the bert-vocabulary to draw 5% of the data, I think it should cover all edge cases. Will update the PR. |
Switched to 5% sampling of nonspecial symbols, special symbols will have to be kept IMO. Anyways, we should be good now as it only adds 112kb and 1.5 s to testing. |
Codecov Report
@@ Coverage Diff @@
## branch-0.17 #6568 +/- ##
===============================================
+ Coverage 82.07% 82.67% +0.59%
===============================================
Files 90 91 +1
Lines 14592 15058 +466
===============================================
+ Hits 11977 12449 +472
+ Misses 2615 2609 -6
Continue to review full report at Codecov.
|
CI currently blocked by dask/distributed#4177 |
This pr closes part of #5799 by upstreaming the
perfect_hash.py
tocudf
.Please note I don't understand the details of the inner workings of
perfect_hash.py
and this is more of a one to one port of the file with minimal code changes.To ensure correctness i ensured that we get the same result as
perfect-hash.py
( vocab_hash.txt) created on the vocabularybert-base-uncased-vocab.txt
The main change here is that I have gotten rid of the
non-compact
code path as that caused failures like at issue .TODO:
Previous Problems:
Below have been addressed now by sampling nonspecial symbols.
a. Add
30s
to the test suiteb. Add
1.8 Mb
because of theground truth
andvocabulary files
We can reduce both if the above are unexpectable by sampling the vocabulary to lesser words.
Updated PR:
Below have been addressed now by sampling nonspecial symbols.
a. Add
1.5 s
to the test suiteb. Add
112 kb
because of theground truth
andvocabulary files