-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Support [CLS] [SEP] for subword_tokenize to handle correctly #6937
Comments
Is this different than #5765? May that one could be updated instead? |
@davidwendt , Could we close that one and use this one instead, that one was based on a hashing-logic where the problem might lie that has since be upstreamed. Anyways a minimal example of the issue: Minimal Example Helper Function to create vocab+textimport cudf
with open('test_vocab.txt','w') as f:
string ='[PAD]\n[UNK]\n[CLS]\n[SEP]\n[MASK]\nclschar\nsepchar\nmsk_char\ni\nate\ndinner\nit\nwas\nyummy\n.'
f.write(string)
def create_vocab_table(vocabpath):
"""
Create Vocabulary tables from the vocab.txt file
Parameters:
___________
vocabpath: Path of vocablary file
Returns:
___________
id2vocab: np.array, dtype=<U5
vocab2id: dict that maps strings to int
"""
id2vocab = []
vocab2id = {}
import numpy as np
with open(vocabpath) as f:
for index, line in enumerate(f):
token = line.split()[0]
id2vocab.append(token)
vocab2id[token] = index
return np.array(id2vocab), vocab2id
id2vocab,vocab2int = create_vocab_table('test_vocab.txt')
from cudf.utils.hash_vocab_utils import hash_vocab
hash_vocab('test_vocab.txt','vocab-hash.txt') Minimal Example: (The encoding of [CLS], [SEP] is off)text = '[CLS]I ate dinner.[SEP]It was yummy.[SEP]'
cudf_ser = cudf.Series([text])
tokens, attention_masks, metadata = cudf_ser.str.subword_tokenize('vocab-hash.txt', do_lower=True,do_truncate=False)
print(tokens[0:17])
print(id2vocab[tokens[0:17].get()]) [ 1 1 1 8 9 10 14 1 1 1 11 12 13 14 1 1 1]
['[UNK]' '[UNK]' '[UNK]' 'i' 'ate' 'dinner' '.' '[UNK]' '[UNK]' '[UNK]'
'it' 'was' 'yummy' '.' '[UNK]' '[UNK]' '[UNK]'] Expected outputIf switch the non special symbol to special symbol this goes away. Below is a work-around for the current issue. text = '[CLS]I ate dinner.[SEP]It was yummy.[SEP]'
cudf_ser = cudf.Series([text])
cudf_ser=cudf_ser.str.replace(["[CLS]",'[SEP]'],['clschar ',' sepchar '],regex=False)
cudf_ser=cudf_ser.str.normalize_spaces()
tokens, attention_masks, metadata = cudf_ser.str.subword_tokenize('vocab-hash.txt', do_lower=True,do_truncate=False)
### replace all occurence of mask with the one in true vocab ### its 4 here
tokens[tokens==5]=2
### replace all occurence of sepchar with 3 (true value)
tokens[tokens==6]=3
print(tokens[0:17])
print(id2vocab[tokens[0:17].get()]) [ 2 8 9 10 14 3 11 12 13 14 3 0 0 0 0 0 0]
['[CLS]' 'i' 'ate' 'dinner' '.' '[SEP]' 'it' 'was' 'yummy' '.' '[SEP]'
'[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]'] |
There is no code in the subword tokenizer implementation that looks for these special tokens. So this would be a feature request.
The bracket characters '[' and ']' are categorized as pad-with-space probably so words inside are properly parsed/tokenized.
|
No, I don't think that is a safe assumption. This can be configurable based on vocabulary but the convention is to use them like that.
In most cases we have a finite set see below (from link ) but this can be configurable. See additional_special_tokens argument.
No, it really should not.
Gotcha, thanks for explaining that. Yes, then this will a feature request. Behaviour for these special tokens:I believe the requested behavior will be that we don't tokenize/lowercase these special tokens and skip any pre-processing so that they pick the right token_ids. This I believe will follow what hugging face does. Inital Solution:I think if we can just provide support for the above-mentioned 7 tokens with appropriate defaults will cover most use cases so if handling an arbitrary list of special tokens is extra work we can probably skip it for now. CC: @raykallen and @BartleyR , In case they have any use cases that need more than the above 7 special tokens. |
Closes #6937 This PR adds support for the following 7 special tokens in the `subword_tokenize` [BOS], [EOS], [UNK], [SEP], [PAD], [CLS], and [MASK] Descriptions for these can be found in links/text found in #6937 These can be placed anywhere in the text and may be upper or lower-case. They will be recognized regardless if they exist in the given vocabulary hash table. Example using vocab-hash.txt and code snippet from #6937 ``` >>> text = '[CLS]I ate dinner.[SEP][BOS]It was yummy.[EOS]' >>> cudf_ser = cudf.Series([text]) >>> tokens, attention_masks, metadata = cudf_ser.str.subword_tokenize('vocab-hash.txt', do_lower=True, do_truncate=False) >>> print(id2vocab[tokens[0:17].get()]) ['[CLS]' 'i' 'ate' 'dinner' '.' '[SEP]' '[BOS]' 'it' 'was' 'yummy' '.' '[EOS]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]'] ``` A new gtest was added for this feature. This requires no API change to the C++ or Python interfaces. Authors: - David (@davidwendt) Approvers: - Devavret Makkar (@devavret) - Karthikeyan (@karthikeyann) URL: #7254
The solution I chose for this in #7254 was to hardcode recognizing the following 7 special tokens
These can appear anywhere in the string and may be upper or lower case. |
This issue has been labeled |
@VibhuJawa Can we close this? |
@davidwendt, Yup this is good to close. Thanks for your work on this. |
Hi @VibhuJawa ,
As we discussed in the chat, it seems there was an problem that
subword_tokenize
would not handle special tokens (e.g.,[CLS]
,[SEP]
) correctly, and I'm creating this github issue to track it.Thanks!
Shang
The text was updated successfully, but these errors were encountered: