Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Byte Pair Encoding Tokenizer #9657

Open
VibhuJawa opened this issue Nov 11, 2021 · 15 comments
Open

[FEA] Byte Pair Encoding Tokenizer #9657

VibhuJawa opened this issue Nov 11, 2021 · 15 comments
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)

Comments

@VibhuJawa
Copy link
Member

VibhuJawa commented Nov 11, 2021

Is your feature request related to a problem? Please describe.

We should add byte pair encoding tokenizer to cuDF. Like our subword-tokenizer adds a bridge to Bert link models. Byte Pair EncodingTokenizer is used by roberta, gpt-2 , gpt-3 and will give us a bridge to a lot of DL models.

We should focus porting a pre-trained tokenizer first.

Describe the solution you'd like

The implimentation should follow GPT-2 tokenizer but should be extendable to the robert-a , gpt-3, megatron etc. We should follow the HuggingFace API for this.

Algorithim:

  1. Add an identifier (</w>) at the end of each word to identify the end of a word and then calculate the word frequency in the text.
  2. Split the word into characters and then calculate the character frequency.
  3. From the character tokens, for a predefined number of iterations, count the frequency of the consecutive byte pairs and merge the most frequently occurring byte pairing.
  4. Keep iterating until you have reached the iteration limit (set by you) or until you have reached the token limit.

Ref: Link

Additional context

Best Explanation of Algorithm: https://leimao.github.io/blog/Byte-Pair-Encoding/

CC: @randerzander , @beckernick

@VibhuJawa VibhuJawa added feature request New feature or request Needs Triage Need team to review and classify strings strings issues (C++ and Python) and removed Needs Triage Need team to review and classify labels Nov 11, 2021
@VibhuJawa
Copy link
Member Author

We will probably need a libcudf implementation of the following function BPE function . (See HF reference implementation) .

Here given the rank of each bigram we combine the most occuring bigram based on the rank provided in merges file. Once we have that we then convert it into token id using the vocabulary provided.

Actual Algorithim:

def bpe(token, bpe_ranks):
    # if token in self.cache:
    #     return self.cache[token]
    word = tuple(token)
    pairs = get_pairs(word)

    if not pairs:
        return token

    while True:
        bigram = min(pairs, key=lambda pair: bpe_ranks.get(pair, float("inf")))
        #print(bigram)
        
        if bigram not in bpe_ranks:
            break
        first, second = bigram
        new_word = []
        i = 0
        while i < len(word):
            try:
                j = word.index(first, i)
            except ValueError:
                new_word.extend(word[i:])
                break
            else:
                new_word.extend(word[i:j])
                i = j

            if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
                new_word.append(first + second)
                i += 2
            else:
                new_word.append(word[i])
                i += 1
        new_word = tuple(new_word)
        word = new_word
        if len(word) == 1:
            break
        else:
            pairs = get_pairs(word)
            print(pairs)
    word = " ".join(word)
    #self.cache[token] = word
    return word

   def get_pairs(word):
    """
    Return set of symbol pairs in a word.

    Word is represented as tuple of symbols (symbols being variable-length strings).
    """
    pairs = set()
    prev_char = word[0]
    for char in word[1:]:
        pairs.add((prev_char, char))
        prev_char = char
    return pairs

Example Call

# wget https://huggingface.co/gpt2/raw/main/merges.txt 
# to get this file

merges_file = 'gpt_2_tokenizer/merges.txt'
with open(merges_file, encoding="utf-8") as merges_handle:
    bpe_merges = merges_handle.read().split("\n")[1:-1]
bpe_merges = [tuple(merge.split()) for merge in bpe_merges]
bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))

bpe("Thisisit", bpe_ranks)
'This is it'

@VibhuJawa
Copy link
Member Author

VibhuJawa commented Nov 19, 2021

CC: @davidwendt for awareness.

@meghmak13
Copy link

There is a need for the aforementioned feature request as we currently only support tokenization for BERT. As especially considering newer architectures like RoBERTa, GPT, T5 are getting adopted.

@VibhuJawa
Copy link
Member Author

Basic Algo:

  1. Basic pre-processing like space cleanup, utf-8 decoding.
  2. Tokenize each sentence based on a delimiter
  3. Call BPE on each token to further tokenize it
  4. Find the numeric representation of each token in the provided vocabulary
  5. Pad according to the provided padding and return the input_ids which are essentially the key look up from the vocabulary table
  6. Also return the attention_masks which are a binary tensor indicating the position of the padded indices so that the model does not attend to them.

** Extra Notes **
We will have to add stuff like padding and strides similar to what we have for the Subword tokenizer.

Python code to show this in action

from transformers import GPT2Tokenizer
import pandas as pd
import json

# !wget https://huggingface.co/gpt2/raw/main/vocab.json
# !wget https://huggingface.co/gpt2/raw/main/merges.txt
with open('vocab.json') as f:
    token_to_id = json.load(f)
    id_to_token = {v: k for k, v in token_to_id.items()}
    
text_ser = ["This is test-sentence-1", "This is test sentence-2", "This-is test sentence 3"]
tokenizer = GPT2Tokenizer(vocab_file = 'vocab.json', merges_file='merges.txt')
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
encoded_batch = tokenizer.batch_encode_plus(text_ser,
                                            return_tensors='np',
                                            truncation=True, 
                                            padding='max_length',
                                            max_length=12)




print("BPE output", [tokenizer.bpe(token) for token in text_ser[0].split(' ')])

print("tokenizer-output-with-not=cleaned-up-special-token ", [id_to_token.get(i, '[PAD]') for i in encoded_batch['input_ids'][0]])
print("tokenizer-output-cleaned-up", [tokenizer.decode(i) for i in encoded_batch['input_ids'][0]])
print("Final Output of tokenizer: ", encoded_batch['input_ids'][0])

print("\n"+"*"*50+"\n")
print("Batched Output")
print("Final Output of tokenizer:\n", encoded_batch['input_ids'])
BPE output ['This', 'is', 'test - sent ence - 1']
tokenizer-output-with-not=cleaned-up-special-token  ['This', 'Ġis', 'Ġtest', '-', 'sent', 'ence', '-', '1', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
tokenizer-output-cleaned-up ['This', ' is', ' test', '-', 'sent', 'ence', '-', '1', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
Final Output of tokenizer:  [ 1212   318  1332    12 34086   594    12    16 50257 50257 50257 50257]

**************************************************

Batched Output
Final Output of tokenizer:
 [[ 1212   318  1332    12 34086   594    12    16 50257 50257 50257 50257]
 [ 1212   318  1332  6827    12    17 50257 50257 50257 50257 50257 50257]
 [ 1212    12   271  1332  6827   513 50257 50257 50257 50257 50257 50257]]

CC: @davidwendt

@github-actions
Copy link

github-actions bot commented Jan 1, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@teju85
Copy link
Member

teju85 commented Jan 20, 2022

Has anyone been working on this? Or has this been prioritized for anytime soon? In the past week I got/saw requests for this at a couple of places.

@davidwendt
Copy link
Contributor

davidwendt commented Jan 20, 2022

Has anyone been working on this? Or has this been prioritized for anytime soon? In the past week I got/saw requests for this at a couple of places.

I've not worked on it yet but I hope to start on it in 22.04.

@davidwendt davidwendt self-assigned this Jan 20, 2022
@davidwendt
Copy link
Contributor

@VibhuJawa Some questions based on the examples given here. You want a BPE function that takes a host string (and the merge/rank table) and returns the BPE as a host string?

This shows passing in a word (substring of a string) and returning the BPE and then the Python code builds an array of BPE strings from each token.

text_ser = ["This is test-sentence-1", "This is test sentence-2", "This-is test sentence 3"]
...
print("BPE output", [tokenizer.bpe(token) for token in text_ser[0].split(' ')])

The Thisisit example showed the same thing -- single host string returns a single host string.

I'm trying to understand the inputs and outputs from a cudf usecase. Are you expecting give the libcudf BPE API a strings column of words and return the encoding of each as a strings column?

Or do I have this all wrong and you are expecting a libcudf API that does everything GPT2Tokenizer is doing in the last example above?

@davidwendt
Copy link
Contributor

@VibhuJawa
Copy link
Member Author

On the vocab front

I tried to verify if we can indeed treat the vocab.json files similar to how we treat vocab in Subword tokenizer and i think we can but there are three main discrepancies i found.

Similarity: The vocab dict is a continuous range of ints mapping to a tokens.

Verified that across the commonly used models the token->id dict can be treated as a list as there are no missing ids (Its a continuous range ) like subword tokenizer vocabulary.

Below for the verification reference:
https://gist.github.com/VibhuJawa/1670178d07d9659a084a8fbe7d160d23

Discrepancy:

  1. Special Tokens:
    Most BPE models have these special tokens
'<s>', '</s>', '<unk>', '<pad>', '<mask>'

but can also include something like

'<extra_id_0>', '<extra_id_1>', '<extra_id_2>', '<extra_id_3>',

while subword one mostly has these:

[BOS],[EOS],[UNK],[SEP],[PAD],[CLS],[MASK];

I think it might make sense to make this configurable from the python API that we will initialize with the right defaults.

2. Padding Token:

Padding token's id is dependent on dictionary (id of <pad>) so its value can change. We should ensure we handle that correctly.

I think (unsure) but we just treat it as 0 currently in Subword.

  1. Treating space characters.

BPE seems to treat space characters differently . That is Hello world and Hello world get mapped differently.

When there is space before the word it gets mapped to ĠHello and if no space to Hello .

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("roberta-base")

id_to_token = {v: k for k, v in tokenizer.vocab.items()}

no_space_hello = "Hello world"
no_space_input_ids = tokenizer(no_space_hello ,add_special_tokens=False)['input_ids']
print(no_space_input_ids)
print([id_to_token[i] for i in no_space_input_ids])
print("----"*10)
space_hello = " Hello world"
space_input_ids = tokenizer(space_hello ,add_special_tokens=False)['input_ids']
print(space_input_ids)
print([id_to_token[i] for i in space_input_ids])
 
[31414, 232]
['Hello', 'Ġworld']
----------------------------------------
[20920, 232]
['ĠHello', 'Ġworld']

On getting a testable example to you.

Sorry on getting a meaningful end to end python example that works across models. It turns out to be tougher than I anticipated but will update here once i have it working.

@github-actions
Copy link

github-actions bot commented Mar 6, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

rapids-bot bot pushed a commit that referenced this issue Mar 17, 2022
Reference #9657 

Add the `nvtext::byte_pair_encoding` API. This is not the BPE tokenizer but just the encoding function. The tokenizer will be a larger effort that will probably span multiple PRs. Providing the encoder here to be evaluated independently.

Theoretically, this API could be used like the following to achieve a _similar_ BPE tokenizer behavior perhaps:
```
input = strings to tokenize
mps = nvtext::load_merge_pairs_file("merges.txt");
bpe = nvtext::byte_pair_encoding( input, mps );

vocab = nvtext::load_vocabulary_file( "hashed_vocab.txt" );
result = nvtext::subword_tokenize( bpe, vocab, max_length, stride, lower_case, truncate, max_rows );
```

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Bradley Dice (https://github.com/bdice)
  - https://github.com/nvdbaranec

URL: #10270
@github-actions
Copy link

github-actions bot commented Jun 4, 2022

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@BartleyR
Copy link
Member

We have a potential Morpheus customer who wants to use the phishing detection pipeline but in a non-English language. So we'd have to replace the BERT model with something else, and it would need a BPE tokenizer. We can do a POC using a CPU-based tokenizer, but would be good to scope this if we can for an upcoming release. @GregoryKimball for viz

@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. and removed inactive-30d labels Nov 21, 2022
@GregoryKimball
Copy link
Contributor

This request is still relevant. After discussing with @VibhuJawa, the next step is benchmarking a GPT-3 style training workflow, and measuring the percentage of time spent in tokenization. If tokenization is 15-30% of the total time (as we see in bert), then this is worth prioritizing. Otherwise we should recommend tokenization with HuggingFace.

@mtsai-rapids
Copy link

mtsai-rapids commented Jan 6, 2025

based on the conversation with @VibhuJawa , we are looking to use Rapids for sentencePiece BPE https://rapids-goai.slack.com/archives/C5E06F4DC/p1736188662934809?thread_ts=1735837564.762089&cid=C5E06F4DC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)
Projects
None yet
Development

No branches or pull requests

7 participants