[QST] Hash table related question for the `subword_tokenize` function #5760

VibhuJawa · 2020-07-24T06:55:09Z

[QST]Hash table related question for the `subword_tokenize` function

We recently added subword_tokenize which takes in a hash_file as input but doesn't provide instructions on creating it.

My best guess was that we create it using perfect_hash.py based on the below documentation:

cudf/cpp/src/text/subword/detail/hash_utils.cuh

Line 155 in e81d6a1

    
           // obtained from the python script perfect_hash.py which generates the expected tables.

.

But the perfect_hash.py file may be giving us incorrect results (or we may have a bug downstream)

See below discrepency with hugging face which makes me feel that they are incorrect:

# !wget https://raw.githubusercontent.com/rapidsai/clx/267c6d30805c9dcbf80840f222bf31c5c4b7068a/python/clx/analytics/perfect_hash.py,
# !wget https://cdn.huggingface.co/dslim/bert-base-NER/vocab.txt
# !python3 perfect_hash.py  --vocab 'vocab.txt' --output 'vocab-hash.txt'

import cudf
import numpy as np
from transformers import BertTokenizer, BertModel

text = "Jenna is from London"
cudf_ser = cudf.Series([text])
cudf_tokens, masks, metadata = cudf_ser.str.subword_tokenize("vocab-hash.txt",do_lower=False)
hugging_face_tokenizer = BertTokenizer(vocab_file=f"vocab.txt",do_lower_case=False)

for token_id in cudf_tokens:
    if token_id!=0:
        print(f"cudf- token_id = {token_id} token = {hugging_face_tokenizer.convert_ids_to_tokens([token_id])[0]}")
        
print("\n")       
d = hugging_face_tokenizer(text)
h_tokens, token_type_ids, attention_mask = d['input_ids'],d['token_type_ids'],d['attention_mask']
for token_id in h_tokens:
    if token_id!=0:
        print(f"huggingface token_id = {token_id} token = {hugging_face_tokenizer.convert_ids_to_tokens([token_id])[0]}")

Look at the Jenna String .

cudf- token_id = 147 token = J
cudf- token_id = 1424 token = ##en
cudf- token_id = 1605 token = ##na
cudf- token_id = 1110 token = is
cudf- token_id = 1121 token = from
cudf- token_id = 1498 token = London


huggingface token_id = 101 token = [CLS]
huggingface token_id = 13862 token = Jenna
huggingface token_id = 1110 token = is
huggingface token_id = 1121 token = from
huggingface token_id = 1498 token = London
huggingface token_id = 102 token = [SEP]

Questions:

A. Is the perfect_hash.py file liked here correct?
B. Is there a way to check the correctness of this hashing that we run?
C. Are there any other reasons we may see this discrepancy?

CC: @davidwendt / @efajardo-nv

The text was updated successfully, but these errors were encountered:

VibhuJawa · 2020-07-27T17:35:19Z

Based on discussion with Rachel Allen and @BartleyR , we have come to the below conclusion so far:

a. The hashing is correct.
b. The special token discrepancy can be fixed by using the keyword add_special_tokens=False with Hugging Face.
c. We don't support detokenize with cudf.

My last standing question is why is below the encoding of Jenna different.

# !wget https://raw.githubusercontent.com/rapidsai/clx/267c6d30805c9dcbf80840f222bf31c5c4b7068a/python/clx/analytics/perfect_hash.py,
# !wget https://cdn.huggingface.co/dslim/bert-base-NER/vocab.txt
# !python3 perfect_hash.py  --vocab 'vocab.txt' --output 'vocab-hash.txt'

import cudf
import numpy as np
from transformers import BertTokenizer, BertModel

text = "Jenna is from London"
cudf_ser = cudf.Series([text])
cudf_tokens, masks, metadata = cudf_ser.str.subword_tokenize("vocab-hash.txt",do_lower=False)
hugging_face_tokenizer = BertTokenizer(vocab_file=f"vocab.txt",do_lower_case=False)

d = hugging_face_tokenizer(text,add_special_tokens=False)
h_tokens, token_type_ids, attention_mask = d['input_ids'],d['token_type_ids'],d['attention_mask']

print("cudf_tokens", cudf_tokens[cudf_tokens!=0])
print("h_tokens", h_tokens)

### Discrepency is for the token Jenna
# Jenna is broken down into -> 147 1424 1605 (J, ##en,##na)
# Huggine Face is -> 13862 (Jenna)

cudf_tokens [ 147 1424 1605 1110 1121 1498]
h_tokens [13862, 1110, 1121, 1498]

Rachel / @BartleyR Please feel free to edit any errors I may have made in the above answers. :-)

raykallen · 2020-07-27T19:40:14Z

I believe this discrepancy is due to the vocab.txt file in the tokenizer test repo being improperly encoded for unicode strings. Because of this mistake, the tokenizer uses rules to split words that are not based on the complete bert vocab, and therefore it does not match the Hugging Face rules to split words. I think we may need to refactor the subword tokenizer in order for it to match the bert tokenizer from Hugging Face. CC @efajardo-nv @davidwendt

raykallen · 2020-07-27T22:17:28Z

It seems the issue is that the perfect_hash.py function is no longer generating the expected output.

raykallen · 2020-07-27T22:30:37Z

If you use compact hashing it works.
Use
# !python3 perfect_hash.py --vocab 'vocab.txt' --output 'vocab-hash.txt' --compact
instead of
# !python3 perfect_hash.py --vocab 'vocab.txt' --output 'vocab-hash.txt'

davidwendt · 2020-08-06T14:30:11Z

@VibhuJawa Can we close this?
I think the original question is answered by Rachel above and you have a separate issue #5799 to improve documentation.

VibhuJawa · 2020-08-06T16:12:21Z

@VibhuJawa Can we close this?
I think the original question is answered by Rachel above and you have a separate issue #5799 to improve documentation.

Yeah, I think this good to close. We can address this in #5799 .

) This pr closes part of #5799 by upstreaming the [`perfect_hash.py`](https://github.com/rapidsai/clx/blob/267c6d30805c9dcbf80840f222bf31c5c4b7068a/python/clx/analytics/perfect_hash.py) to `cudf`. Please note I don't understand the details of the inner workings of `perfect_hash.py` and this is more of a one to one port of the file with minimal code changes. To ensure correctness i ensured that we get the same result as `perfect-hash.py` ( [vocab_hash.txt](https://github.com/rapidsai/cudf/blob/910e5276e2a7b734652d05b18e9fbf9b5571fa25/python/cudf/cudf/tests/data/vocab_hash/ground_truth_vocab_hash.txt)) created on the vocabulary [`bert-base-uncased-vocab.txt`]( python/cudf/cudf/tests/data/vocab_hash/bert-base-uncased-vocab.txt) The main change here is that I have gotten rid of the `non-compact` code path as that caused failures like at [issue](#5760 (comment)) . ### TODO: - [x] Add function - [x] Add Test to ensure equivalence - [x] Add ChangeLog ### Previous Problems: Below have been addressed now by sampling nonspecial symbols. 1. Adding this test will : a. Add `30s` to the test suite b. Add `1.8 Mb` because of the `ground truth` and `vocabulary files` We can reduce both if the above are unexpectable by sampling the vocabulary to lesser words. ### Updated PR: Below have been addressed now by sampling nonspecial symbols. 1. Adding this test will : a. Add `1.5 s` to the test suite b. Add `112 kb` because of the `ground truth` and `vocabulary files`

VibhuJawa added question Further information is requested doc Documentation Needs Triage Need team to review and classify labels Jul 24, 2020

VibhuJawa mentioned this issue Jul 24, 2020

[QST] How do we handle Special tokens in subword_tokenize function ? #5765

Closed

VibhuJawa added the strings strings issues (C++ and Python) label Jul 27, 2020

VibhuJawa mentioned this issue Jul 29, 2020

[DOC]Documentation for subword_tokenize #5799

Closed

kkraus14 added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Aug 5, 2020

VibhuJawa closed this as completed Aug 6, 2020

VibhuJawa mentioned this issue Aug 6, 2020

[FEA] Create separate API for loading the vocabulary file for the subword-tokenizer #5868

Closed

VibhuJawa mentioned this issue Oct 20, 2020

[REVIEW]Add function to create hashed vocabulary file from raw vocabulary #6568

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] Hash table related question for the `subword_tokenize` function #5760

[QST] Hash table related question for the `subword_tokenize` function #5760

VibhuJawa commented Jul 24, 2020

VibhuJawa commented Jul 27, 2020 •

edited

Loading

raykallen commented Jul 27, 2020

raykallen commented Jul 27, 2020

raykallen commented Jul 27, 2020

davidwendt commented Aug 6, 2020

VibhuJawa commented Aug 6, 2020

[QST] Hash table related question for the subword_tokenize function #5760

[QST] Hash table related question for the subword_tokenize function #5760

Comments

VibhuJawa commented Jul 24, 2020

[QST]Hash table related question for the subword_tokenize function

Questions:

VibhuJawa commented Jul 27, 2020 • edited Loading

raykallen commented Jul 27, 2020

raykallen commented Jul 27, 2020

raykallen commented Jul 27, 2020

davidwendt commented Aug 6, 2020

VibhuJawa commented Aug 6, 2020

[QST] Hash table related question for the `subword_tokenize` function #5760

[QST] Hash table related question for the `subword_tokenize` function #5760

[QST]Hash table related question for the `subword_tokenize` function

VibhuJawa commented Jul 27, 2020 •

edited

Loading