Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support [CLS] [SEP] for subword_tokenize to handle correctly #6937

Closed
shangw-nvidia opened this issue Dec 7, 2020 · 8 comments
Closed
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)

Comments

@shangw-nvidia
Copy link

Hi @VibhuJawa ,

As we discussed in the chat, it seems there was an problem that subword_tokenize would not handle special tokens (e.g., [CLS], [SEP]) correctly, and I'm creating this github issue to track it.

Thanks!
Shang

@shangw-nvidia shangw-nvidia added Needs Triage Need team to review and classify question Further information is requested labels Dec 7, 2020
@kkraus14 kkraus14 added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) and removed Needs Triage Need team to review and classify question Further information is requested labels Dec 7, 2020
@kkraus14 kkraus14 changed the title [QST] Support [CLS] [SEP] for subword_tokenize to handle correctly [BUG] Support [CLS] [SEP] for subword_tokenize to handle correctly Dec 7, 2020
@davidwendt
Copy link
Contributor

Is this different than #5765? May that one could be updated instead?

@VibhuJawa
Copy link
Member

@davidwendt , Could we close that one and use this one instead, that one was based on a hashing-logic where the problem might lie that has since be upstreamed. Anyways a minimal example of the issue:

Minimal Example

Helper Function to create vocab+text
import cudf
with open('test_vocab.txt','w') as f:
    string ='[PAD]\n[UNK]\n[CLS]\n[SEP]\n[MASK]\nclschar\nsepchar\nmsk_char\ni\nate\ndinner\nit\nwas\nyummy\n.'  
    f.write(string)
    
def create_vocab_table(vocabpath):
    """
        Create Vocabulary tables from the vocab.txt file
        
        Parameters:
        ___________
        vocabpath: Path of vocablary file
        Returns:
        ___________
        id2vocab: np.array, dtype=<U5
        vocab2id: dict that maps strings to int
    """
    id2vocab = []
    vocab2id = {}
    import numpy as np
    with open(vocabpath) as f:
        for index, line in enumerate(f):
            token = line.split()[0]
            id2vocab.append(token)
            vocab2id[token] = index
    return np.array(id2vocab), vocab2id

id2vocab,vocab2int = create_vocab_table('test_vocab.txt')

from cudf.utils.hash_vocab_utils  import hash_vocab
hash_vocab('test_vocab.txt','vocab-hash.txt')

Minimal Example: (The encoding of [CLS], [SEP] is off)

text = '[CLS]I ate dinner.[SEP]It was yummy.[SEP]'
cudf_ser = cudf.Series([text])
tokens, attention_masks, metadata = cudf_ser.str.subword_tokenize('vocab-hash.txt', do_lower=True,do_truncate=False)
print(tokens[0:17])
print(id2vocab[tokens[0:17].get()])
[ 1  1  1  8  9 10 14  1  1  1 11 12 13 14  1  1  1]
['[UNK]' '[UNK]' '[UNK]' 'i' 'ate' 'dinner' '.' '[UNK]' '[UNK]' '[UNK]'
 'it' 'was' 'yummy' '.' '[UNK]' '[UNK]' '[UNK]']

Expected output

If switch the non special symbol to special symbol this goes away. Below is a work-around for the current issue.

text = '[CLS]I ate dinner.[SEP]It was yummy.[SEP]'
cudf_ser = cudf.Series([text])
cudf_ser=cudf_ser.str.replace(["[CLS]",'[SEP]'],['clschar ',' sepchar '],regex=False)
cudf_ser=cudf_ser.str.normalize_spaces()
tokens, attention_masks, metadata = cudf_ser.str.subword_tokenize('vocab-hash.txt', do_lower=True,do_truncate=False)
### replace all occurence of mask with the one in true vocab  ### its 4 here
tokens[tokens==5]=2
### replace all occurence of sepchar with 3 (true value)
tokens[tokens==6]=3
print(tokens[0:17])
print(id2vocab[tokens[0:17].get()])
[ 2  8  9 10 14  3 11 12 13 14  3  0  0  0  0  0  0]
['[CLS]' 'i' 'ate' 'dinner' '.' '[SEP]' 'it' 'was' 'yummy' '.' '[SEP]'
 '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]']

@davidwendt
Copy link
Contributor

davidwendt commented Jan 14, 2021

There is no code in the subword tokenizer implementation that looks for these special tokens. So this would be a feature request.

[CLS]I ate dinner.[SEP]It was yummy.[SEP]` 
is tokenized into (after lower-casing):
[ cls ] i ate dinner .  [ sep ] it was yummy .
1 1   1 8 9   10     14 1 1   1 11 12  13    14

The bracket characters '[' and ']' are categorized as pad-with-space probably so words inside are properly parsed/tokenized.
What are the rules here?

  • Are the special tokens always 3 upper-case characters in brackets [XYZ]?
  • Are there a finite set of special tokens?
  • Should the code just always treat text [*] as a single token? This seems like it would be a significant change if anyone is relying the current behavior.

@VibhuJawa
Copy link
Member

VibhuJawa commented Jan 15, 2021

  • Are the special tokens always 3 upper-case characters in brackets [XYZ]?

No, I don't think that is a safe assumption. This can be configurable based on vocabulary but the convention is to use them like that.

  • Are there a finite set of special tokens?

In most cases we have a finite set see below (from link ) but this can be configurable. See additional_special_tokens argument.

bos_token (str or tokenizers.AddedToken, optional) – A special token representing the beginning of a sentence. 

eos_token (str or tokenizers.AddedToken, optional) – A special token representing the end of a sentence. 

unk_token (str or tokenizers.AddedToken, optional) – A special token representing an out-of-vocabulary token.

sep_token (str or tokenizers.AddedToken, optional) – A special token separating two different sentences in the same input (used by BERT for instance). 

pad_token (str or tokenizers.AddedToken, optional) – A special token used to make arrays of tokens the same size for batching purpose.

cls_token (str or tokenizers.AddedToken, optional) – A special token representing the class of the input (used by BERT for instance). 

mask_token (str or tokenizers.AddedToken, optional) – A special token representing a masked token (used by masked-language modeling pretraining objectives, like BERT).
  • Should the code just always treat text [*] as a single token? This seems like it would be a significant change if anyone is relying the current behavior.

No, it really should not.

So this would be a feature request.

Gotcha, thanks for explaining that. Yes, then this will a feature request.

Behaviour for these special tokens:

I believe the requested behavior will be that we don't tokenize/lowercase these special tokens and skip any pre-processing so that they pick the right token_ids. This I believe will follow what hugging face does.

See link and link.

Inital Solution:

I think if we can just provide support for the above-mentioned 7 tokens with appropriate defaults will cover most use cases so if handling an arbitrary list of special tokens is extra work we can probably skip it for now.

CC: @raykallen and @BartleyR , In case they have any use cases that need more than the above 7 special tokens.

@harrism harrism changed the title [BUG] Support [CLS] [SEP] for subword_tokenize to handle correctly [FEA] Support [CLS] [SEP] for subword_tokenize to handle correctly Jan 18, 2021
@harrism harrism added feature request New feature or request and removed bug Something isn't working labels Jan 18, 2021
@davidwendt davidwendt self-assigned this Jan 26, 2021
rapids-bot bot pushed a commit that referenced this issue Feb 11, 2021
Closes #6937 
This PR adds support for the following 7 special tokens in the `subword_tokenize`
[BOS], [EOS], [UNK], [SEP], [PAD], [CLS], and [MASK]
Descriptions for these can be found in links/text found in #6937

These can be placed anywhere in the text and may be upper or lower-case. They will be recognized regardless if they exist in the given vocabulary hash table. Example using vocab-hash.txt and code snippet from #6937
```
>>> text = '[CLS]I ate dinner.[SEP][BOS]It was yummy.[EOS]'
>>> cudf_ser = cudf.Series([text])
>>> tokens, attention_masks, metadata = cudf_ser.str.subword_tokenize('vocab-hash.txt', do_lower=True, do_truncate=False)
>>> print(id2vocab[tokens[0:17].get()])
['[CLS]' 'i' 'ate' 'dinner' '.' '[SEP]' '[BOS]' 'it' 'was' 'yummy' '.'
 '[EOS]' '[PAD]' '[PAD]' '[PAD]' '[PAD]' '[PAD]']
```

A new gtest was added for this feature.
This requires no API change to the C++ or Python interfaces.

Authors:
  - David (@davidwendt)

Approvers:
  - Devavret Makkar (@devavret)
  - Karthikeyan (@karthikeyann)

URL: #7254
@davidwendt
Copy link
Contributor

The solution I chose for this in #7254 was to hardcode recognizing the following 7 special tokens

[BOS] [EOS] [UNK] [SEP] [PAD] [CLS] [MASK]

These can appear anywhere in the string and may be upper or lower case.
If the provided vocab hash includes these tokens, then they will be assigned appropriately otherwise they will assigned the UNK token value.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@davidwendt
Copy link
Contributor

@VibhuJawa Can we close this?
We can reopen it if the solution in #7254 mentioned above is not adequate.

@VibhuJawa
Copy link
Member

@davidwendt, Yup this is good to close. Thanks for your work on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)
Projects
None yet
Development

No branches or pull requests

5 participants