-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOC]Documentation for subword_tokenize #5799
Comments
I am not sure why we have this parameter tbh. I think it might be worth at least throwing a logic error (which I think should come at no extra cost) when the number of input strings is larger than the provided limit. Having this additional parameter with a low default value may trip other users too. The error a user will get here is a |
Happy to take this on for Related to this issue, I will need support on how to add the hashingfile to the Related See open question at: #5868 (comment) CC: @kkraus14 . |
) This pr closes part of #5799 by upstreaming the [`perfect_hash.py`](https://github.com/rapidsai/clx/blob/267c6d30805c9dcbf80840f222bf31c5c4b7068a/python/clx/analytics/perfect_hash.py) to `cudf`. Please note I don't understand the details of the inner workings of `perfect_hash.py` and this is more of a one to one port of the file with minimal code changes. To ensure correctness i ensured that we get the same result as `perfect-hash.py` ( [vocab_hash.txt](https://github.com/rapidsai/cudf/blob/910e5276e2a7b734652d05b18e9fbf9b5571fa25/python/cudf/cudf/tests/data/vocab_hash/ground_truth_vocab_hash.txt)) created on the vocabulary [`bert-base-uncased-vocab.txt`]( python/cudf/cudf/tests/data/vocab_hash/bert-base-uncased-vocab.txt) The main change here is that I have gotten rid of the `non-compact` code path as that caused failures like at [issue](#5760 (comment)) . ### TODO: - [x] Add function - [x] Add Test to ensure equivalence - [x] Add ChangeLog ### Previous Problems: Below have been addressed now by sampling nonspecial symbols. 1. Adding this test will : a. Add `30s` to the test suite b. Add `1.8 Mb` because of the `ground truth` and `vocabulary files` We can reduce both if the above are unexpectable by sampling the vocabulary to lesser words. ### Updated PR: Below have been addressed now by sampling nonspecial symbols. 1. Adding this test will : a. Add `1.5 s` to the test suite b. Add `112 kb` because of the `ground truth` and `vocabulary files`
Last of the issues will be addressed with #6608 . |
This pr improves subword tokenizer docs by improving the example as well as the general docstring and closes last bits of #5799 . I wasn't sure on the exact details about `max_rows_tensor ` (CC: @davidwendt to confirm) . It is rendered like below: ![image](https://user-images.githubusercontent.com/4837571/97377583-a0a3cc80-187d-11eb-8fc6-21ae18c7a76e.png)
Report incorrect documentation
We need to improve our documentation for subword_tokenize, am starting this issue to document places where we need to improve it.
Describe the problems or issues found in the documentation
We need to provide better documentation for the following parameters as these are new parameters that are not present in the defacto standard of Hugging Face that we should follow.
Input Parameters:
All these are new parameters that are not present in HF
Below is present in both HF and ours but we have a divergence
Output Parameters:
Handling Special Characters:
We also seem to not be differently handling
special characters
in inputs currently but I am guessing that maybe stemming from on how we create hash_table. (Related issue: #5765) .Below have been fixed with PR 5919 and PR 6658
hash_file: We need to provide documentation for how to create the hash file
max_num_strings: We have undefined behavior if
input_strings > max_num_strings
and the default is really less (100). This can lead to issues like in the one at Related issue.max_num_chars: This also a new parameter that we introduce, so we should provide more info about this.
CC: @davidwendt / @raykallen / @BartleyR / @randerzander
The text was updated successfully, but these errors were encountered: