Skip to content

Commit

Permalink
Add function to create hashed vocabulary file from raw vocabulary (#6568
Browse files Browse the repository at this point in the history
)

This pr closes part of #5799  by upstreaming the [`perfect_hash.py`](https://github.com/rapidsai/clx/blob/267c6d30805c9dcbf80840f222bf31c5c4b7068a/python/clx/analytics/perfect_hash.py) to `cudf`. 

Please note I don't understand the details of the inner workings of `perfect_hash.py` and this is more of a one to one port of the file with minimal code changes. 

To ensure correctness i ensured that we get the same result as `perfect-hash.py`  ( [vocab_hash.txt](https://github.com/rapidsai/cudf/blob/910e5276e2a7b734652d05b18e9fbf9b5571fa25/python/cudf/cudf/tests/data/vocab_hash/ground_truth_vocab_hash.txt)) created on the vocabulary [`bert-base-uncased-vocab.txt`]( python/cudf/cudf/tests/data/vocab_hash/bert-base-uncased-vocab.txt) 

The main change here is that I have gotten rid of the `non-compact` code path as that caused failures like at [issue](#5760 (comment)) . 


### TODO: 
- [x] Add function
- [x] Add Test to ensure equivalence 
- [x] Add ChangeLog  


### Previous Problems:
Below have been addressed now by sampling nonspecial symbols. 
1.  Adding this test will : 
a. Add `30s` to the test suite
b. Add `1.8 Mb` because of the `ground truth` and `vocabulary files` 

We can reduce both if the above are unexpectable by sampling the vocabulary to lesser words. 



### Updated  PR:
Below have been addressed now by sampling nonspecial symbols. 
1.  Adding this test will : 
a. Add `1.5 s` to the test suite
b. Add `112 kb` because of the `ground truth` and `vocabulary files`
  • Loading branch information
VibhuJawa authored Oct 27, 2020
1 parent e620a73 commit e94ed01
Show file tree
Hide file tree
Showing 6 changed files with 5,904 additions and 1 deletion.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,9 @@

## New Features

- PR #6460 Add is_timestamp format check API
- PR #6528 Enable `fixed_point` binary operations
- PR #6460 Add is_timestamp format check API
- PR #6568 Add function to create hashed vocabulary file from raw vocabulary
- PR #6581 Add JNI API to check if PTDS is enabled

## Improvements
Expand Down
2 changes: 2 additions & 0 deletions python/cudf/cudf/core/column/string.py
Original file line number Diff line number Diff line change
Expand Up @@ -4201,6 +4201,8 @@ def subword_tokenize(
----------
hash_file : str
Path to hash file containing vocabulary of words with token-ids.
This can be created from the raw vocabulary
using the ``cudf.utils.hash_vocab_utils.hash_vocab`` function
max_length : int, Default is 64
Limits the length of the sequence returned.
If tokenized string is shorter than max_length,
Expand Down
Loading

0 comments on commit e94ed01

Please sign in to comment.