Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Hash table related question for the subword_tokenize function #5760

Closed
VibhuJawa opened this issue Jul 24, 2020 · 6 comments
Closed

[QST] Hash table related question for the subword_tokenize function #5760

VibhuJawa opened this issue Jul 24, 2020 · 6 comments
Labels
doc Documentation libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. question Further information is requested strings strings issues (C++ and Python)

Comments

@VibhuJawa
Copy link
Member

[QST]Hash table related question for the subword_tokenize function

We recently added subword_tokenize which takes in a hash_file as input but doesn't provide instructions on creating it.

My best guess was that we create it using perfect_hash.py based on the below documentation:

// obtained from the python script perfect_hash.py which generates the expected tables.
.

But the perfect_hash.py file may be giving us incorrect results (or we may have a bug downstream)

See below discrepency with hugging face which makes me feel that they are incorrect:

# !wget https://raw.githubusercontent.com/rapidsai/clx/267c6d30805c9dcbf80840f222bf31c5c4b7068a/python/clx/analytics/perfect_hash.py,
# !wget https://cdn.huggingface.co/dslim/bert-base-NER/vocab.txt
# !python3 perfect_hash.py  --vocab 'vocab.txt' --output 'vocab-hash.txt'

import cudf
import numpy as np
from transformers import BertTokenizer, BertModel

text = "Jenna is from London"
cudf_ser = cudf.Series([text])
cudf_tokens, masks, metadata = cudf_ser.str.subword_tokenize("vocab-hash.txt",do_lower=False)
hugging_face_tokenizer = BertTokenizer(vocab_file=f"vocab.txt",do_lower_case=False)

for token_id in cudf_tokens:
    if token_id!=0:
        print(f"cudf- token_id = {token_id} token = {hugging_face_tokenizer.convert_ids_to_tokens([token_id])[0]}")
        
print("\n")       
d = hugging_face_tokenizer(text)
h_tokens, token_type_ids, attention_mask = d['input_ids'],d['token_type_ids'],d['attention_mask']
for token_id in h_tokens:
    if token_id!=0:
        print(f"huggingface token_id = {token_id} token = {hugging_face_tokenizer.convert_ids_to_tokens([token_id])[0]}")

Look at the Jenna String .

cudf- token_id = 147 token = J
cudf- token_id = 1424 token = ##en
cudf- token_id = 1605 token = ##na
cudf- token_id = 1110 token = is
cudf- token_id = 1121 token = from
cudf- token_id = 1498 token = London


huggingface token_id = 101 token = [CLS]
huggingface token_id = 13862 token = Jenna
huggingface token_id = 1110 token = is
huggingface token_id = 1121 token = from
huggingface token_id = 1498 token = London
huggingface token_id = 102 token = [SEP]

Questions:

A. Is the perfect_hash.py file liked here correct?
B. Is there a way to check the correctness of this hashing that we run?
C. Are there any other reasons we may see this discrepancy?

CC: @davidwendt / @efajardo-nv

@VibhuJawa VibhuJawa added question Further information is requested doc Documentation Needs Triage Need team to review and classify labels Jul 24, 2020
@VibhuJawa
Copy link
Member Author

VibhuJawa commented Jul 27, 2020

Based on discussion with Rachel Allen and @BartleyR , we have come to the below conclusion so far:

a. The hashing is correct.
b. The special token discrepancy can be fixed by using the keyword add_special_tokens=False with Hugging Face.
c. We don't support detokenize with cudf.

My last standing question is why is below the encoding of Jenna different.

# !wget https://raw.githubusercontent.com/rapidsai/clx/267c6d30805c9dcbf80840f222bf31c5c4b7068a/python/clx/analytics/perfect_hash.py,
# !wget https://cdn.huggingface.co/dslim/bert-base-NER/vocab.txt
# !python3 perfect_hash.py  --vocab 'vocab.txt' --output 'vocab-hash.txt'

import cudf
import numpy as np
from transformers import BertTokenizer, BertModel

text = "Jenna is from London"
cudf_ser = cudf.Series([text])
cudf_tokens, masks, metadata = cudf_ser.str.subword_tokenize("vocab-hash.txt",do_lower=False)
hugging_face_tokenizer = BertTokenizer(vocab_file=f"vocab.txt",do_lower_case=False)

d = hugging_face_tokenizer(text,add_special_tokens=False)
h_tokens, token_type_ids, attention_mask = d['input_ids'],d['token_type_ids'],d['attention_mask']

print("cudf_tokens", cudf_tokens[cudf_tokens!=0])
print("h_tokens", h_tokens)

### Discrepency is for the token Jenna
# Jenna is broken down into -> 147 1424 1605 (J, ##en,##na)
# Huggine Face is -> 13862 (Jenna)
cudf_tokens [ 147 1424 1605 1110 1121 1498]
h_tokens [13862, 1110, 1121, 1498]

Rachel / @BartleyR Please feel free to edit any errors I may have made in the above answers. :-)

@VibhuJawa VibhuJawa added the strings strings issues (C++ and Python) label Jul 27, 2020
@raykallen
Copy link

I believe this discrepancy is due to the vocab.txt file in the tokenizer test repo being improperly encoded for unicode strings. Because of this mistake, the tokenizer uses rules to split words that are not based on the complete bert vocab, and therefore it does not match the Hugging Face rules to split words. I think we may need to refactor the subword tokenizer in order for it to match the bert tokenizer from Hugging Face. CC @efajardo-nv @davidwendt

@raykallen
Copy link

It seems the issue is that the perfect_hash.py function is no longer generating the expected output.

@raykallen
Copy link

If you use compact hashing it works.
Use
# !python3 perfect_hash.py --vocab 'vocab.txt' --output 'vocab-hash.txt' --compact
instead of
# !python3 perfect_hash.py --vocab 'vocab.txt' --output 'vocab-hash.txt'

@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Aug 5, 2020
@davidwendt
Copy link
Contributor

@VibhuJawa Can we close this?
I think the original question is answered by Rachel above and you have a separate issue #5799 to improve documentation.

@VibhuJawa
Copy link
Member Author

@VibhuJawa Can we close this?
I think the original question is answered by Rachel above and you have a separate issue #5799 to improve documentation.

Yeah, I think this good to close. We can address this in #5799 .

kkraus14 pushed a commit that referenced this issue Oct 27, 2020
)

This pr closes part of #5799  by upstreaming the [`perfect_hash.py`](https://github.com/rapidsai/clx/blob/267c6d30805c9dcbf80840f222bf31c5c4b7068a/python/clx/analytics/perfect_hash.py) to `cudf`. 

Please note I don't understand the details of the inner workings of `perfect_hash.py` and this is more of a one to one port of the file with minimal code changes. 

To ensure correctness i ensured that we get the same result as `perfect-hash.py`  ( [vocab_hash.txt](https://github.com/rapidsai/cudf/blob/910e5276e2a7b734652d05b18e9fbf9b5571fa25/python/cudf/cudf/tests/data/vocab_hash/ground_truth_vocab_hash.txt)) created on the vocabulary [`bert-base-uncased-vocab.txt`]( python/cudf/cudf/tests/data/vocab_hash/bert-base-uncased-vocab.txt) 

The main change here is that I have gotten rid of the `non-compact` code path as that caused failures like at [issue](#5760 (comment)) . 


### TODO: 
- [x] Add function
- [x] Add Test to ensure equivalence 
- [x] Add ChangeLog  


### Previous Problems:
Below have been addressed now by sampling nonspecial symbols. 
1.  Adding this test will : 
a. Add `30s` to the test suite
b. Add `1.8 Mb` because of the `ground truth` and `vocabulary files` 

We can reduce both if the above are unexpectable by sampling the vocabulary to lesser words. 



### Updated  PR:
Below have been addressed now by sampling nonspecial symbols. 
1.  Adding this test will : 
a. Add `1.5 s` to the test suite
b. Add `112 kb` because of the `ground truth` and `vocabulary files`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc Documentation libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. question Further information is requested strings strings issues (C++ and Python)
Projects
None yet
Development

No branches or pull requests

4 participants