[DOC]Documentation for subword_tokenize #5799

VibhuJawa · 2020-07-29T00:41:14Z

Report incorrect documentation

We need to improve our documentation for subword_tokenize, am starting this issue to document places where we need to improve it.

Describe the problems or issues found in the documentation

We need to provide better documentation for the following parameters as these are new parameters that are not present in the defacto standard of Hugging Face that we should follow.

Input Parameters:
All these are new parameters that are not present in HF

max_rows_tensor: The default for this also quite less(500) and is a new parameter that we introduce, we should look into better ways to document this

Below is present in both HF and ours but we have a divergence

stride : (We might have a divergence when setting to 0 which is the HF default) , we either need to fix that or mention it in the documentation at least.

Output Parameters:

metadataColumn: This also a new parameter that we have introduced here, we should look into providing an example as we have in the blog

Handling Special Characters:

We also seem to not be differently handling special characters in inputs currently but I am guessing that maybe stemming from on how we create hash_table. (Related issue: #5765) .

cudf_tensor,_,_ = cudf.Series("a [SEP] b").str.subword_tokenize("vocab-hash.txt",do_lower=False)
cudf_tensor

170,   164, 12342,  2101,   166,   171
# [SEP]-> Sep is broken down into 164, 12342,  2101, 166 ([', 'SE', '##P', '])  vs just `[SEP]` in HF.

Below have been fixed with PR 5919 and PR 6658

hash_file: We need to provide documentation for how to create the hash file
max_num_strings: We have undefined behavior if input_strings > max_num_strings and the default is really less (100). This can lead to issues like in the one at Related issue.
max_num_chars: This also a new parameter that we introduce, so we should provide more info about this.

CC: @davidwendt / @raykallen / @BartleyR / @randerzander

The text was updated successfully, but these errors were encountered:

VibhuJawa · 2020-07-29T01:04:51Z

max_num_strings: We have undefined behavior if input_strings > max_num_strings and the default is really less (100). This can lead to issues like in the one at Related issue.

I am not sure why we have this parameter tbh.

I think it might be worth at least throwing a logic error (which I think should come at no extra cost) when the number of input strings is larger than the provided limit. Having this additional parameter with a low default value may trip other users too. The error a user will get here is a cudaErrorIllegalAddress which does not inform the user about where it is stemming from.

VibhuJawa · 2020-10-09T21:07:06Z

Happy to take this on for 0.17 release. Would love to clean this up.

Related to this issue, I will need support on how to add the hashingfile to the cudf package to close part of this.

Related See open question at: #5868 (comment)

CC: @kkraus14 .

) This pr closes part of #5799 by upstreaming the [`perfect_hash.py`](https://github.com/rapidsai/clx/blob/267c6d30805c9dcbf80840f222bf31c5c4b7068a/python/clx/analytics/perfect_hash.py) to `cudf`. Please note I don't understand the details of the inner workings of `perfect_hash.py` and this is more of a one to one port of the file with minimal code changes. To ensure correctness i ensured that we get the same result as `perfect-hash.py` ( [vocab_hash.txt](https://github.com/rapidsai/cudf/blob/910e5276e2a7b734652d05b18e9fbf9b5571fa25/python/cudf/cudf/tests/data/vocab_hash/ground_truth_vocab_hash.txt)) created on the vocabulary [`bert-base-uncased-vocab.txt`]( python/cudf/cudf/tests/data/vocab_hash/bert-base-uncased-vocab.txt) The main change here is that I have gotten rid of the `non-compact` code path as that caused failures like at [issue](#5760 (comment)) . ### TODO: - [x] Add function - [x] Add Test to ensure equivalence - [x] Add ChangeLog ### Previous Problems: Below have been addressed now by sampling nonspecial symbols. 1. Adding this test will : a. Add `30s` to the test suite b. Add `1.8 Mb` because of the `ground truth` and `vocabulary files` We can reduce both if the above are unexpectable by sampling the vocabulary to lesser words. ### Updated PR: Below have been addressed now by sampling nonspecial symbols. 1. Adding this test will : a. Add `1.5 s` to the test suite b. Add `112 kb` because of the `ground truth` and `vocabulary files`

VibhuJawa · 2020-10-28T22:27:29Z

Last of the issues will be addressed with #6608 .

@davidwendt

This pr improves subword tokenizer docs by improving the example as well as the general docstring and closes last bits of #5799 . I wasn't sure on the exact details about `max_rows_tensor ` (CC: @davidwendt to confirm) . It is rendered like below: ![image](https://user-images.githubusercontent.com/4837571/97377583-a0a3cc80-187d-11eb-8fc6-21ae18c7a76e.png)

VibhuJawa · 2020-10-28T23:19:23Z

Closed by #6608, #5919 and #6658

VibhuJawa added doc Documentation Needs Triage Need team to review and classify labels Jul 29, 2020

VibhuJawa added the strings strings issues (C++ and Python) label Jul 29, 2020

kkraus14 added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Aug 5, 2020

This was referenced Aug 6, 2020

[QST] Hash table related question for the subword_tokenize function #5760

Closed

[REVIEW] Remove max_strings and max_chars from nvtext::subword_tokenize #5919

Merged

VibhuJawa self-assigned this Oct 12, 2020

VibhuJawa mentioned this issue Oct 20, 2020

[REVIEW]Add function to create hashed vocabulary file from raw vocabulary #6568

Merged

3 tasks

VibhuJawa mentioned this issue Oct 28, 2020

[REVIEW] Improve subword tokenizer docs #6608

Merged

VibhuJawa closed this as completed Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DOC]Documentation for subword_tokenize #5799

[DOC]Documentation for subword_tokenize #5799

VibhuJawa commented Jul 29, 2020 •

edited

Loading

VibhuJawa commented Jul 29, 2020 •

edited

Loading

VibhuJawa commented Oct 9, 2020

VibhuJawa commented Oct 28, 2020

VibhuJawa commented Oct 28, 2020

[DOC]Documentation for subword_tokenize #5799

[DOC]Documentation for subword_tokenize #5799

Comments

VibhuJawa commented Jul 29, 2020 • edited Loading

Report incorrect documentation

Handling Special Characters:

VibhuJawa commented Jul 29, 2020 • edited Loading

VibhuJawa commented Oct 9, 2020

VibhuJawa commented Oct 28, 2020

VibhuJawa commented Oct 28, 2020

VibhuJawa commented Jul 29, 2020 •

edited

Loading

VibhuJawa commented Jul 29, 2020 •

edited

Loading