[FEA] Support [CLS] [SEP] for subword_tokenize to handle correctly #6937

shangw-nvidia · 2020-12-07T21:44:53Z

As we discussed in the chat, it seems there was an problem that subword_tokenize would not handle special tokens (e.g., [CLS], [SEP]) correctly, and I'm creating this github issue to track it.

Thanks!
Shang

The text was updated successfully, but these errors were encountered:

davidwendt · 2020-12-08T01:39:30Z

Is this different than #5765? May that one could be updated instead?

VibhuJawa · 2020-12-08T02:03:38Z

@davidwendt , Could we close that one and use this one instead, that one was based on a hashing-logic where the problem might lie that has since be upstreamed. Anyways a minimal example of the issue:

Minimal Example

Helper Function to create vocab+text

import cudf
with open('test_vocab.txt','w') as f:
    string ='[PAD]\n[UNK]\n[CLS]\n[SEP]\n[MASK]\nclschar\nsepchar\nmsk_char\ni\nate\ndinner\nit\nwas\nyummy\n.'  
    f.write(string)
    
def create_vocab_table(vocabpath):
    """
        Create Vocabulary tables from the vocab.txt file
        
        Parameters:
        ___________
        vocabpath: Path of vocablary file
        Returns:
        ___________
        id2vocab: np.array, dtype=<U5
        vocab2id: dict that maps strings to int
    """
    id2vocab = []
    vocab2id = {}
    import numpy as np
    with open(vocabpath) as f:
        for index, line in enumerate(f):
            token = line.split()[0]
            id2vocab.append(token)
            vocab2id[token] = index
    return np.array(id2vocab), vocab2id

id2vocab,vocab2int = create_vocab_table('test_vocab.txt')

from cudf.utils.hash_vocab_utils  import hash_vocab
hash_vocab('test_vocab.txt','vocab-hash.txt')

Minimal Example: (The encoding of [CLS], [SEP] is off)

text = '[CLS]I ate dinner.[SEP]It was yummy.[SEP]'
cudf_ser = cudf.Series([text])
tokens, attention_masks, metadata = cudf_ser.str.subword_tokenize('vocab-hash.txt', do_lower=True,do_truncate=False)
print(tokens[0:17])
print(id2vocab[tokens[0:17].get()])

[ 1  1  1  8  9 10 14  1  1  1 11 12 13 14  1  1  1]
['[UNK]' '[UNK]' '[UNK]' 'i' 'ate' 'dinner' '.' '[UNK]' '[UNK]' '[UNK]'
 'it' 'was' 'yummy' '.' '[UNK]' '[UNK]' '[UNK]']

Expected output

If switch the non special symbol to special symbol this goes away. Below is a work-around for the current issue.

text = '[CLS]I ate dinner.[SEP]It was yummy.[SEP]'
cudf_ser = cudf.Series([text])
cudf_ser=cudf_ser.str.replace(["[CLS]",'[SEP]'],['clschar ',' sepchar '],regex=False)
cudf_ser=cudf_ser.str.normalize_spaces()
tokens, attention_masks, metadata = cudf_ser.str.subword_tokenize('vocab-hash.txt', do_lower=True,do_truncate=False)
### replace all occurence of mask with the one in true vocab  ### its 4 here
tokens[tokens==5]=2
### replace all occurence of sepchar with 3 (true value)
tokens[tokens==6]=3
print(tokens[0:17])
print(id2vocab[tokens[0:17].get()])

[ 2  8  9 10 14  3 11 12 13 14  3  0  0  0  0  0  0]
['[CLS]' 'i' 'ate' 'dinner' '.' '[SEP]' 'it' 'was' 'yummy' '.' '[SEP]'
 '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]']

davidwendt · 2021-01-14T19:35:03Z

There is no code in the subword tokenizer implementation that looks for these special tokens. So this would be a feature request.

[CLS]I ate dinner.[SEP]It was yummy.[SEP]` 
is tokenized into (after lower-casing):
[ cls ] i ate dinner .  [ sep ] it was yummy .
1 1   1 8 9   10     14 1 1   1 11 12  13    14

The bracket characters '[' and ']' are categorized as pad-with-space probably so words inside are properly parsed/tokenized.
What are the rules here?

Are the special tokens always 3 upper-case characters in brackets [XYZ]?
Are there a finite set of special tokens?
Should the code just always treat text [*] as a single token? This seems like it would be a significant change if anyone is relying the current behavior.

VibhuJawa · 2021-01-15T14:57:30Z

Are the special tokens always 3 upper-case characters in brackets [XYZ]?

No, I don't think that is a safe assumption. This can be configurable based on vocabulary but the convention is to use them like that.

Are there a finite set of special tokens?

In most cases we have a finite set see below (from link ) but this can be configurable. See additional_special_tokens argument.

bos_token (str or tokenizers.AddedToken, optional) – A special token representing the beginning of a sentence. 

eos_token (str or tokenizers.AddedToken, optional) – A special token representing the end of a sentence. 

unk_token (str or tokenizers.AddedToken, optional) – A special token representing an out-of-vocabulary token.

sep_token (str or tokenizers.AddedToken, optional) – A special token separating two different sentences in the same input (used by BERT for instance). 

pad_token (str or tokenizers.AddedToken, optional) – A special token used to make arrays of tokens the same size for batching purpose.

cls_token (str or tokenizers.AddedToken, optional) – A special token representing the class of the input (used by BERT for instance). 

mask_token (str or tokenizers.AddedToken, optional) – A special token representing a masked token (used by masked-language modeling pretraining objectives, like BERT).

Should the code just always treat text [*] as a single token? This seems like it would be a significant change if anyone is relying the current behavior.

No, it really should not.

So this would be a feature request.

Gotcha, thanks for explaining that. Yes, then this will a feature request.

Behaviour for these special tokens:

I believe the requested behavior will be that we don't tokenize/lowercase these special tokens and skip any pre-processing so that they pick the right token_ids. This I believe will follow what hugging face does.

See link and link.

Inital Solution:

I think if we can just provide support for the above-mentioned 7 tokens with appropriate defaults will cover most use cases so if handling an arbitrary list of special tokens is extra work we can probably skip it for now.

CC: @raykallen and @BartleyR , In case they have any use cases that need more than the above 7 special tokens.

@davidwendt

Closes #6937 This PR adds support for the following 7 special tokens in the `subword_tokenize` [BOS], [EOS], [UNK], [SEP], [PAD], [CLS], and [MASK] Descriptions for these can be found in links/text found in #6937 These can be placed anywhere in the text and may be upper or lower-case. They will be recognized regardless if they exist in the given vocabulary hash table. Example using vocab-hash.txt and code snippet from #6937 ``` >>> text = '[CLS]I ate dinner.[SEP][BOS]It was yummy.[EOS]' >>> cudf_ser = cudf.Series([text]) >>> tokens, attention_masks, metadata = cudf_ser.str.subword_tokenize('vocab-hash.txt', do_lower=True, do_truncate=False) >>> print(id2vocab[tokens[0:17].get()]) ['[CLS]' 'i' 'ate' 'dinner' '.' '[SEP]' '[BOS]' 'it' 'was' 'yummy' '.' '[EOS]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]'] ``` A new gtest was added for this feature. This requires no API change to the C++ or Python interfaces. Authors: - David (@davidwendt) Approvers: - Devavret Makkar (@devavret) - Karthikeyan (@karthikeyann) URL: #7254

davidwendt · 2021-02-16T20:43:17Z

The solution I chose for this in #7254 was to hardcode recognizing the following 7 special tokens

[BOS] [EOS] [UNK] [SEP] [PAD] [CLS] [MASK]

These can appear anywhere in the string and may be upper or lower case.
If the provided vocab hash includes these tokens, then they will be assigned appropriately otherwise they will assigned the UNK token value.

github-actions · 2021-03-18T21:15:46Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

davidwendt · 2021-03-23T16:03:15Z

@VibhuJawa Can we close this?
We can reopen it if the solution in #7254 mentioned above is not adequate.

VibhuJawa · 2021-03-23T17:08:07Z

@davidwendt, Yup this is good to close. Thanks for your work on this.

shangw-nvidia added Needs Triage Need team to review and classify question Further information is requested labels Dec 7, 2020

kkraus14 added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) and removed Needs Triage Need team to review and classify question Further information is requested labels Dec 7, 2020

kkraus14 changed the title ~~[QST] Support [CLS] [SEP] for subword_tokenize to handle correctly~~ [BUG] Support [CLS] [SEP] for subword_tokenize to handle correctly Dec 7, 2020

VibhuJawa mentioned this issue Jan 15, 2021

[QST] How do we handle Special tokens in subword_tokenize function ? #5765

Closed

harrism changed the title ~~[BUG] Support [CLS] [SEP] for subword_tokenize to handle correctly~~ [FEA] Support [CLS] [SEP] for subword_tokenize to handle correctly Jan 18, 2021

harrism added feature request New feature or request and removed bug Something isn't working labels Jan 18, 2021

davidwendt self-assigned this Jan 26, 2021

davidwendt mentioned this issue Jan 29, 2021

Add support for special tokens in nvtext::subword_tokenizer #7254

Merged

github-actions bot added the inactive-30d label Mar 18, 2021

github-actions bot removed the inactive-30d label Mar 23, 2021

VibhuJawa closed this as completed Mar 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support [CLS] [SEP] for subword_tokenize to handle correctly #6937

[FEA] Support [CLS] [SEP] for subword_tokenize to handle correctly #6937

shangw-nvidia commented Dec 7, 2020

davidwendt commented Dec 8, 2020

VibhuJawa commented Dec 8, 2020

davidwendt commented Jan 14, 2021 •

edited

Loading

VibhuJawa commented Jan 15, 2021 •

edited

Loading

davidwendt commented Feb 16, 2021

github-actions bot commented Mar 18, 2021

davidwendt commented Mar 23, 2021

VibhuJawa commented Mar 23, 2021

[FEA] Support [CLS] [SEP] for subword_tokenize to handle correctly #6937

[FEA] Support [CLS] [SEP] for subword_tokenize to handle correctly #6937

Comments

shangw-nvidia commented Dec 7, 2020

davidwendt commented Dec 8, 2020

VibhuJawa commented Dec 8, 2020

Helper Function to create vocab+text

Minimal Example: (The encoding of [CLS], [SEP] is off)

Expected output

davidwendt commented Jan 14, 2021 • edited Loading

VibhuJawa commented Jan 15, 2021 • edited Loading

Behaviour for these special tokens:

Inital Solution:

davidwendt commented Feb 16, 2021

github-actions bot commented Mar 18, 2021

davidwendt commented Mar 23, 2021

VibhuJawa commented Mar 23, 2021

davidwendt commented Jan 14, 2021 •

edited

Loading

VibhuJawa commented Jan 15, 2021 •

edited

Loading