Add function to create hashed vocabulary file from raw vocabulary (#6568

) This pr closes part of #5799 by upstreaming the [`perfect_hash.py`](https://github.com/rapidsai/clx/blob/267c6d30805c9dcbf80840f222bf31c5c4b7068a/python/clx/analytics/perfect_hash.py) to `cudf`. Please note I don't understand the details of the inner workings of `perfect_hash.py` and this is more of a one to one port of the file with minimal code changes. To ensure correctness i ensured that we get the same result as `perfect-hash.py` ( [vocab_hash.txt](https://github.com/rapidsai/cudf/blob/910e5276e2a7b734652d05b18e9fbf9b5571fa25/python/cudf/cudf/tests/data/vocab_hash/ground_truth_vocab_hash.txt)) created on the vocabulary [`bert-base-uncased-vocab.txt`]( python/cudf/cudf/tests/data/vocab_hash/bert-base-uncased-vocab.txt) The main change here is that I have gotten rid of the `non-compact` code path as that caused failures like at [issue](#5760 (comment)) . ### TODO: - [x] Add function - [x] Add Test to ensure equivalence - [x] Add ChangeLog ### Previous Problems: Below have been addressed now by sampling nonspecial symbols. 1. Adding this test will : a. Add `30s` to the test suite b. Add `1.8 Mb` because of the `ground truth` and `vocabulary files` We can reduce both if the above are unexpectable by sampling the vocabulary to lesser words. ### Updated PR: Below have been addressed now by sampling nonspecial symbols. 1. Adding this test will : a. Add `1.5 s` to the test suite b. Add `112 kb` because of the `ground truth` and `vocabulary files`
rapidsai · Oct 27, 2020 · e94ed01 · e94ed01
1 parent e620a73
commit e94ed01
Show file tree

Hide file tree

Showing 6 changed files with 5,904 additions and 1 deletion.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,8 +2,9 @@
 
 ## New Features
 
-- PR #6460 Add is_timestamp format check API
 - PR #6528 Enable `fixed_point` binary operations
+- PR #6460 Add is_timestamp format check API
+- PR #6568 Add function to create hashed vocabulary file from raw vocabulary
 - PR #6581 Add JNI API to check if PTDS is enabled
 
 ## Improvements

diff --git a/python/cudf/cudf/core/column/string.py b/python/cudf/cudf/core/column/string.py
@@ -4201,6 +4201,8 @@ def subword_tokenize(
         ----------
         hash_file : str
             Path to hash file containing vocabulary of words with token-ids.
+            This can be created from the raw vocabulary
+            using the ``cudf.utils.hash_vocab_utils.hash_vocab`` function
         max_length : int, Default is 64
             Limits the length of the sequence returned.
             If tokenized string is shorter than max_length,