Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC]Documentation for subword_tokenize #5799

Closed
VibhuJawa opened this issue Jul 29, 2020 · 4 comments
Closed

[DOC]Documentation for subword_tokenize #5799

VibhuJawa opened this issue Jul 29, 2020 · 4 comments
Assignees
Labels
doc Documentation libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. strings strings issues (C++ and Python)

Comments

@VibhuJawa
Copy link
Member

VibhuJawa commented Jul 29, 2020

Report incorrect documentation

We need to improve our documentation for subword_tokenize, am starting this issue to document places where we need to improve it.

Describe the problems or issues found in the documentation

We need to provide better documentation for the following parameters as these are new parameters that are not present in the defacto standard of Hugging Face that we should follow.

Input Parameters:
All these are new parameters that are not present in HF

  • max_rows_tensor: The default for this also quite less(500) and is a new parameter that we introduce, we should look into better ways to document this

Below is present in both HF and ours but we have a divergence

  • stride : (We might have a divergence when setting to 0 which is the HF default) , we either need to fix that or mention it in the documentation at least.

Output Parameters:

  • metadataColumn: This also a new parameter that we have introduced here, we should look into providing an example as we have in the blog

Handling Special Characters:

We also seem to not be differently handling special characters in inputs currently but I am guessing that maybe stemming from on how we create hash_table. (Related issue: #5765) .

cudf_tensor,_,_ = cudf.Series("a [SEP] b").str.subword_tokenize("vocab-hash.txt",do_lower=False)
cudf_tensor
170,   164, 12342,  2101,   166,   171
# [SEP]-> Sep is broken down into 164, 12342,  2101, 166 ([', 'SE', '##P', '])  vs just `[SEP]` in HF. 

Below have been fixed with PR 5919 and PR 6658

  • hash_file: We need to provide documentation for how to create the hash file

  • max_num_strings: We have undefined behavior if input_strings > max_num_strings and the default is really less (100). This can lead to issues like in the one at Related issue.

  • max_num_chars: This also a new parameter that we introduce, so we should provide more info about this.

CC: @davidwendt / @raykallen / @BartleyR / @randerzander

@VibhuJawa VibhuJawa added doc Documentation Needs Triage Need team to review and classify labels Jul 29, 2020
@VibhuJawa
Copy link
Member Author

VibhuJawa commented Jul 29, 2020

  • max_num_strings: We have undefined behavior if input_strings > max_num_strings and the default is really less (100). This can lead to issues like in the one at Related issue.

I am not sure why we have this parameter tbh.

I think it might be worth at least throwing a logic error (which I think should come at no extra cost) when the number of input strings is larger than the provided limit. Having this additional parameter with a low default value may trip other users too. The error a user will get here is a cudaErrorIllegalAddress which does not inform the user about where it is stemming from.

@VibhuJawa VibhuJawa added the strings strings issues (C++ and Python) label Jul 29, 2020
@kkraus14 kkraus14 added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Aug 5, 2020
@VibhuJawa
Copy link
Member Author

Happy to take this on for 0.17 release. Would love to clean this up.

Related to this issue, I will need support on how to add the hashingfile to the cudf package to close part of this.

Related See open question at: #5868 (comment)

CC: @kkraus14 .

@VibhuJawa VibhuJawa self-assigned this Oct 12, 2020
kkraus14 pushed a commit that referenced this issue Oct 27, 2020
)

This pr closes part of #5799  by upstreaming the [`perfect_hash.py`](https://github.com/rapidsai/clx/blob/267c6d30805c9dcbf80840f222bf31c5c4b7068a/python/clx/analytics/perfect_hash.py) to `cudf`. 

Please note I don't understand the details of the inner workings of `perfect_hash.py` and this is more of a one to one port of the file with minimal code changes. 

To ensure correctness i ensured that we get the same result as `perfect-hash.py`  ( [vocab_hash.txt](https://github.com/rapidsai/cudf/blob/910e5276e2a7b734652d05b18e9fbf9b5571fa25/python/cudf/cudf/tests/data/vocab_hash/ground_truth_vocab_hash.txt)) created on the vocabulary [`bert-base-uncased-vocab.txt`]( python/cudf/cudf/tests/data/vocab_hash/bert-base-uncased-vocab.txt) 

The main change here is that I have gotten rid of the `non-compact` code path as that caused failures like at [issue](#5760 (comment)) . 


### TODO: 
- [x] Add function
- [x] Add Test to ensure equivalence 
- [x] Add ChangeLog  


### Previous Problems:
Below have been addressed now by sampling nonspecial symbols. 
1.  Adding this test will : 
a. Add `30s` to the test suite
b. Add `1.8 Mb` because of the `ground truth` and `vocabulary files` 

We can reduce both if the above are unexpectable by sampling the vocabulary to lesser words. 



### Updated  PR:
Below have been addressed now by sampling nonspecial symbols. 
1.  Adding this test will : 
a. Add `1.5 s` to the test suite
b. Add `112 kb` because of the `ground truth` and `vocabulary files`
@VibhuJawa
Copy link
Member Author

Last of the issues will be addressed with #6608 .

kkraus14 pushed a commit that referenced this issue Oct 28, 2020
This pr improves subword tokenizer docs by improving the example as well as the general docstring  and closes last bits of #5799 . 

I wasn't sure on the exact details about `max_rows_tensor ` (CC: @davidwendt  to confirm) .

It is rendered like below: 
![image](https://user-images.githubusercontent.com/4837571/97377583-a0a3cc80-187d-11eb-8fc6-21ae18c7a76e.png)
@VibhuJawa
Copy link
Member Author

Closed by #6608, #5919 and #6658

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc Documentation libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. strings strings issues (C++ and Python)
Projects
None yet
Development

No branches or pull requests

2 participants