Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW]Add function to create hashed vocabulary file from raw vocabulary #6568

Merged
merged 10 commits into from
Oct 27, 2020

Conversation

VibhuJawa
Copy link
Member

@VibhuJawa VibhuJawa commented Oct 20, 2020

This pr closes part of #5799 by upstreaming the perfect_hash.py to cudf.

Please note I don't understand the details of the inner workings of perfect_hash.py and this is more of a one to one port of the file with minimal code changes.

To ensure correctness i ensured that we get the same result as perfect-hash.py ( vocab_hash.txt) created on the vocabulary bert-base-uncased-vocab.txt

The main change here is that I have gotten rid of the non-compact code path as that caused failures like at issue .

TODO:

  • Add function
  • Add Test to ensure equivalence
  • Add ChangeLog

Previous Problems:

Below have been addressed now by sampling nonspecial symbols.

  1. Adding this test will :
    a. Add 30s to the test suite
    b. Add 1.8 Mb because of the ground truth and vocabulary files

We can reduce both if the above are unexpectable by sampling the vocabulary to lesser words.

Updated PR:

Below have been addressed now by sampling nonspecial symbols.

  1. Adding this test will :
    a. Add 1.5 s to the test suite
    b. Add 112 kb because of the ground truth and vocabulary files

@VibhuJawa VibhuJawa requested a review from a team as a code owner October 20, 2020 22:11
@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@VibhuJawa VibhuJawa changed the title [WIP] Add function to create vocab hash_file to cudf [WIP]Add function to create hashed vocabulary file from raw vocabulary Oct 20, 2020
@VibhuJawa VibhuJawa changed the title [WIP]Add function to create hashed vocabulary file from raw vocabulary [REVIEW]Add function to create hashed vocabulary file from raw vocabulary Oct 20, 2020
@VibhuJawa
Copy link
Member Author

This pr should be ready for an initial review.

@kkraus14
Copy link
Collaborator

  1. Adding this test will :
    a. Add 30s to the test suite
    b. Add 1.8 Mb because of the ground truth and vocabulary files

We can reduce both if the above are unexpectable by sampling the vocabulary to lesser words.

The 30s testing addition is un-ideal but acceptable, but the 1.8 Mb is too large for a git repo. Could we use a much smaller file for testing (which would presumably speed up the testing anyway)?

@VibhuJawa
Copy link
Member Author

VibhuJawa commented Oct 21, 2020

  1. Adding this test will :
    a. Add 30s to the test suite
    b. Add 1.8 Mb because of the ground truth and vocabulary files

We can reduce both if the above are unexpectable by sampling the vocabulary to lesser words.

The 30s testing addition is un-ideal but acceptable, but the 1.8 Mb is too large for a git repo. Could we use a much smaller file for testing (which would presumably speed up the testing anyway)?

Sure, I will do a random sampling of the bert-vocabulary to draw 5% of the data, I think it should cover all edge cases. Will update the PR.

@VibhuJawa VibhuJawa added 2 - In Progress Currently a work in progress and removed 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer labels Oct 21, 2020
@VibhuJawa VibhuJawa changed the title [REVIEW]Add function to create hashed vocabulary file from raw vocabulary [WIP]Add function to create hashed vocabulary file from raw vocabulary Oct 21, 2020
@VibhuJawa VibhuJawa changed the title [WIP]Add function to create hashed vocabulary file from raw vocabulary [REVIEW]Add function to create hashed vocabulary file from raw vocabulary Oct 21, 2020
@VibhuJawa VibhuJawa added 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer and removed 2 - In Progress Currently a work in progress labels Oct 21, 2020
@VibhuJawa
Copy link
Member Author

Switched to 5% sampling of nonspecial symbols, special symbols will have to be kept IMO. Anyways, we should be good now as it only adds 112kb and 1.5 s to testing.

@codecov
Copy link

codecov bot commented Oct 21, 2020

Codecov Report

Merging #6568 into branch-0.17 will increase coverage by 0.59%.
The diff coverage is 100.00%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.17    #6568      +/-   ##
===============================================
+ Coverage        82.07%   82.67%   +0.59%     
===============================================
  Files               90       91       +1     
  Lines            14592    15058     +466     
===============================================
+ Hits             11977    12449     +472     
+ Misses            2615     2609       -6     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/string.py 86.21% <ø> (+0.17%) ⬆️
python/cudf/cudf/utils/hash_vocab_utils.py 100.00% <100.00%> (ø)
python/cudf/cudf/io/feather.py 100.00% <0.00%> (ø)
python/cudf/cudf/comm/serialize.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/io.py 0.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/_version.py 0.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_csv.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_orc.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_json.py 100.00% <0.00%> (ø)
...ython/dask_cudf/dask_cudf/io/tests/test_parquet.py 100.00% <0.00%> (ø)
... and 37 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5a594e6...e55a437. Read the comment docs.

@kkraus14 kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer labels Oct 21, 2020
@kkraus14 kkraus14 added Python Affects Python cuDF API. strings strings issues (C++ and Python) labels Oct 21, 2020
@kkraus14
Copy link
Collaborator

CI currently blocked by dask/distributed#4177

@kkraus14 kkraus14 merged commit e94ed01 into rapidsai:branch-0.17 Oct 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge Python Affects Python cuDF API. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants