-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] Hash table related question for the subword_tokenize
function
#5760
Comments
Based on discussion with Rachel Allen and @BartleyR , we have come to the below conclusion so far: a. The hashing is correct. My last standing question is why is below the encoding of # !wget https://raw.githubusercontent.com/rapidsai/clx/267c6d30805c9dcbf80840f222bf31c5c4b7068a/python/clx/analytics/perfect_hash.py,
# !wget https://cdn.huggingface.co/dslim/bert-base-NER/vocab.txt
# !python3 perfect_hash.py --vocab 'vocab.txt' --output 'vocab-hash.txt'
import cudf
import numpy as np
from transformers import BertTokenizer, BertModel
text = "Jenna is from London"
cudf_ser = cudf.Series([text])
cudf_tokens, masks, metadata = cudf_ser.str.subword_tokenize("vocab-hash.txt",do_lower=False)
hugging_face_tokenizer = BertTokenizer(vocab_file=f"vocab.txt",do_lower_case=False)
d = hugging_face_tokenizer(text,add_special_tokens=False)
h_tokens, token_type_ids, attention_mask = d['input_ids'],d['token_type_ids'],d['attention_mask']
print("cudf_tokens", cudf_tokens[cudf_tokens!=0])
print("h_tokens", h_tokens)
### Discrepency is for the token Jenna
# Jenna is broken down into -> 147 1424 1605 (J, ##en,##na)
# Huggine Face is -> 13862 (Jenna)
Rachel / @BartleyR Please feel free to edit any errors I may have made in the above answers. :-) |
I believe this discrepancy is due to the vocab.txt file in the tokenizer test repo being improperly encoded for unicode strings. Because of this mistake, the tokenizer uses rules to split words that are not based on the complete bert vocab, and therefore it does not match the Hugging Face rules to split words. I think we may need to refactor the subword tokenizer in order for it to match the bert tokenizer from Hugging Face. CC @efajardo-nv @davidwendt |
It seems the issue is that the perfect_hash.py function is no longer generating the expected output. |
If you use compact hashing it works. |
@VibhuJawa Can we close this? |
Yeah, I think this good to close. We can address this in #5799 . |
) This pr closes part of #5799 by upstreaming the [`perfect_hash.py`](https://github.com/rapidsai/clx/blob/267c6d30805c9dcbf80840f222bf31c5c4b7068a/python/clx/analytics/perfect_hash.py) to `cudf`. Please note I don't understand the details of the inner workings of `perfect_hash.py` and this is more of a one to one port of the file with minimal code changes. To ensure correctness i ensured that we get the same result as `perfect-hash.py` ( [vocab_hash.txt](https://github.com/rapidsai/cudf/blob/910e5276e2a7b734652d05b18e9fbf9b5571fa25/python/cudf/cudf/tests/data/vocab_hash/ground_truth_vocab_hash.txt)) created on the vocabulary [`bert-base-uncased-vocab.txt`]( python/cudf/cudf/tests/data/vocab_hash/bert-base-uncased-vocab.txt) The main change here is that I have gotten rid of the `non-compact` code path as that caused failures like at [issue](#5760 (comment)) . ### TODO: - [x] Add function - [x] Add Test to ensure equivalence - [x] Add ChangeLog ### Previous Problems: Below have been addressed now by sampling nonspecial symbols. 1. Adding this test will : a. Add `30s` to the test suite b. Add `1.8 Mb` because of the `ground truth` and `vocabulary files` We can reduce both if the above are unexpectable by sampling the vocabulary to lesser words. ### Updated PR: Below have been addressed now by sampling nonspecial symbols. 1. Adding this test will : a. Add `1.5 s` to the test suite b. Add `112 kb` because of the `ground truth` and `vocabulary files`
[QST]Hash table related question for the
subword_tokenize
functionWe recently added subword_tokenize which takes in a hash_file as input but doesn't provide instructions on creating it.
My best guess was that we create it using perfect_hash.py based on the below documentation:
cudf/cpp/src/text/subword/detail/hash_utils.cuh
Line 155 in e81d6a1
But the perfect_hash.py file may be giving us incorrect results (or we may have a bug downstream)
See below discrepency with hugging face which makes me feel that they are incorrect:
Look at the Jenna String .
Questions:
A. Is the perfect_hash.py file liked here correct?
B. Is there a way to check the correctness of this hashing that we run?
C. Are there any other reasons we may see this discrepancy?
CC: @davidwendt / @efajardo-nv
The text was updated successfully, but these errors were encountered: