[FEA] Byte Pair Encoding Tokenizer #9657

VibhuJawa · 2021-11-11T00:33:11Z

Is your feature request related to a problem? Please describe.

We should add byte pair encoding tokenizer to cuDF. Like our subword-tokenizer adds a bridge to Bert link models. Byte Pair EncodingTokenizer is used by roberta, gpt-2 , gpt-3 and will give us a bridge to a lot of DL models.

We should focus porting a pre-trained tokenizer first.

Describe the solution you'd like

The implimentation should follow GPT-2 tokenizer but should be extendable to the robert-a , gpt-3, megatron etc. We should follow the HuggingFace API for this.

Algorithim:

Add an identifier (</w>) at the end of each word to identify the end of a word and then calculate the word frequency in the text.
Split the word into characters and then calculate the character frequency.
From the character tokens, for a predefined number of iterations, count the frequency of the consecutive byte pairs and merge the most frequently occurring byte pairing.
Keep iterating until you have reached the iteration limit (set by you) or until you have reached the token limit.

Ref: Link

Additional context

Best Explanation of Algorithm: https://leimao.github.io/blog/Byte-Pair-Encoding/

CC: @randerzander , @beckernick

The text was updated successfully, but these errors were encountered:

VibhuJawa · 2021-11-19T00:47:37Z

We will probably need a libcudf implementation of the following function BPE function . (See HF reference implementation) .

Here given the rank of each bigram we combine the most occuring bigram based on the rank provided in merges file. Once we have that we then convert it into token id using the vocabulary provided.

Actual Algorithim:

def bpe(token, bpe_ranks):
    # if token in self.cache:
    #     return self.cache[token]
    word = tuple(token)
    pairs = get_pairs(word)

    if not pairs:
        return token

    while True:
        bigram = min(pairs, key=lambda pair: bpe_ranks.get(pair, float("inf")))
        #print(bigram)
        
        if bigram not in bpe_ranks:
            break
        first, second = bigram
        new_word = []
        i = 0
        while i < len(word):
            try:
                j = word.index(first, i)
            except ValueError:
                new_word.extend(word[i:])
                break
            else:
                new_word.extend(word[i:j])
                i = j

            if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
                new_word.append(first + second)
                i += 2
            else:
                new_word.append(word[i])
                i += 1
        new_word = tuple(new_word)
        word = new_word
        if len(word) == 1:
            break
        else:
            pairs = get_pairs(word)
            print(pairs)
    word = " ".join(word)
    #self.cache[token] = word
    return word

   def get_pairs(word):
    """
    Return set of symbol pairs in a word.

    Word is represented as tuple of symbols (symbols being variable-length strings).
    """
    pairs = set()
    prev_char = word[0]
    for char in word[1:]:
        pairs.add((prev_char, char))
        prev_char = char
    return pairs

Example Call

# wget https://huggingface.co/gpt2/raw/main/merges.txt 
# to get this file

merges_file = 'gpt_2_tokenizer/merges.txt'
with open(merges_file, encoding="utf-8") as merges_handle:
    bpe_merges = merges_handle.read().split("\n")[1:-1]
bpe_merges = [tuple(merge.split()) for merge in bpe_merges]
bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))

bpe("Thisisit", bpe_ranks)

'This is it'

VibhuJawa · 2021-11-19T00:47:48Z

CC: @davidwendt for awareness.

meghmak13 · 2021-11-29T19:10:19Z

There is a need for the aforementioned feature request as we currently only support tokenization for BERT. As especially considering newer architectures like RoBERTa, GPT, T5 are getting adopted.

VibhuJawa · 2021-12-02T00:32:33Z

Basic Algo:

Basic pre-processing like space cleanup, utf-8 decoding.
Tokenize each sentence based on a delimiter
Call BPE on each token to further tokenize it
Find the numeric representation of each token in the provided vocabulary
Pad according to the provided padding and return the input_ids which are essentially the key look up from the vocabulary table
Also return the attention_masks which are a binary tensor indicating the position of the padded indices so that the model does not attend to them.

** Extra Notes **
We will have to add stuff like padding and strides similar to what we have for the Subword tokenizer.

Python code to show this in action

from transformers import GPT2Tokenizer
import pandas as pd
import json

# !wget https://huggingface.co/gpt2/raw/main/vocab.json
# !wget https://huggingface.co/gpt2/raw/main/merges.txt
with open('vocab.json') as f:
    token_to_id = json.load(f)
    id_to_token = {v: k for k, v in token_to_id.items()}
    
text_ser = ["This is test-sentence-1", "This is test sentence-2", "This-is test sentence 3"]
tokenizer = GPT2Tokenizer(vocab_file = 'vocab.json', merges_file='merges.txt')
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
encoded_batch = tokenizer.batch_encode_plus(text_ser,
                                            return_tensors='np',
                                            truncation=True, 
                                            padding='max_length',
                                            max_length=12)




print("BPE output", [tokenizer.bpe(token) for token in text_ser[0].split(' ')])

print("tokenizer-output-with-not=cleaned-up-special-token ", [id_to_token.get(i, '[PAD]') for i in encoded_batch['input_ids'][0]])
print("tokenizer-output-cleaned-up", [tokenizer.decode(i) for i in encoded_batch['input_ids'][0]])
print("Final Output of tokenizer: ", encoded_batch['input_ids'][0])

print("\n"+"*"*50+"\n")
print("Batched Output")
print("Final Output of tokenizer:\n", encoded_batch['input_ids'])

BPE output ['This', 'is', 'test - sent ence - 1']
tokenizer-output-with-not=cleaned-up-special-token  ['This', 'Ġis', 'Ġtest', '-', 'sent', 'ence', '-', '1', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
tokenizer-output-cleaned-up ['This', ' is', ' test', '-', 'sent', 'ence', '-', '1', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
Final Output of tokenizer:  [ 1212   318  1332    12 34086   594    12    16 50257 50257 50257 50257]

**************************************************

Batched Output
Final Output of tokenizer:
 [[ 1212   318  1332    12 34086   594    12    16 50257 50257 50257 50257]
 [ 1212   318  1332  6827    12    17 50257 50257 50257 50257 50257 50257]
 [ 1212    12   271  1332  6827   513 50257 50257 50257 50257 50257 50257]]

CC: @davidwendt

github-actions · 2022-01-01T01:26:57Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

teju85 · 2022-01-20T12:01:15Z

Has anyone been working on this? Or has this been prioritized for anytime soon? In the past week I got/saw requests for this at a couple of places.

davidwendt · 2022-01-20T13:08:20Z

Has anyone been working on this? Or has this been prioritized for anytime soon? In the past week I got/saw requests for this at a couple of places.

I've not worked on it yet but I hope to start on it in 22.04.

davidwendt · 2022-02-03T13:25:18Z

@VibhuJawa Some questions based on the examples given here. You want a BPE function that takes a host string (and the merge/rank table) and returns the BPE as a host string?

This shows passing in a word (substring of a string) and returning the BPE and then the Python code builds an array of BPE strings from each token.

text_ser = ["This is test-sentence-1", "This is test sentence-2", "This-is test sentence 3"]
...
print("BPE output", [tokenizer.bpe(token) for token in text_ser[0].split(' ')])

The Thisisit example showed the same thing -- single host string returns a single host string.

I'm trying to understand the inputs and outputs from a cudf usecase. Are you expecting give the libcudf BPE API a strings column of words and return the encoding of each as a strings column?

Or do I have this all wrong and you are expecting a libcudf API that does everything GPT2Tokenizer is doing in the last example above?

davidwendt · 2022-02-03T23:52:08Z

For reference: https://gist.github.com/VibhuJawa/8df50cd638d3d98f36109d8316dfa4ad

VibhuJawa · 2022-02-04T00:48:35Z

On the vocab front

I tried to verify if we can indeed treat the vocab.json files similar to how we treat vocab in Subword tokenizer and i think we can but there are three main discrepancies i found.

Similarity: The vocab dict is a continuous range of ints mapping to a tokens.

Verified that across the commonly used models the token->id dict can be treated as a list as there are no missing ids (Its a continuous range ) like subword tokenizer vocabulary.

Below for the verification reference:
https://gist.github.com/VibhuJawa/1670178d07d9659a084a8fbe7d160d23

Discrepancy:

Special Tokens:
Most BPE models have these special tokens

'<s>', '</s>', '<unk>', '<pad>', '<mask>'

but can also include something like

'<extra_id_0>', '<extra_id_1>', '<extra_id_2>', '<extra_id_3>',

while subword one mostly has these:

[BOS],[EOS],[UNK],[SEP],[PAD],[CLS],[MASK];

I think it might make sense to make this configurable from the python API that we will initialize with the right defaults.

2. Padding Token:

Padding token's id is dependent on dictionary (id of <pad>) so its value can change. We should ensure we handle that correctly.

I think (unsure) but we just treat it as 0 currently in Subword.

Treating space characters.

BPE seems to treat space characters differently . That is Hello world and Hello world get mapped differently.

When there is space before the word it gets mapped to ĠHello and if no space to Hello .

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("roberta-base")

id_to_token = {v: k for k, v in tokenizer.vocab.items()}

no_space_hello = "Hello world"
no_space_input_ids = tokenizer(no_space_hello ,add_special_tokens=False)['input_ids']
print(no_space_input_ids)
print([id_to_token[i] for i in no_space_input_ids])
print("----"*10)
space_hello = " Hello world"
space_input_ids = tokenizer(space_hello ,add_special_tokens=False)['input_ids']
print(space_input_ids)
print([id_to_token[i] for i in space_input_ids])

[31414, 232]
['Hello', 'Ġworld']
----------------------------------------
[20920, 232]
['ĠHello', 'Ġworld']

On getting a testable example to you.

Sorry on getting a meaningful end to end python example that works across models. It turns out to be tougher than I anticipated but will update here once i have it working.

github-actions · 2022-03-06T01:28:46Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Reference #9657 Add the `nvtext::byte_pair_encoding` API. This is not the BPE tokenizer but just the encoding function. The tokenizer will be a larger effort that will probably span multiple PRs. Providing the encoder here to be evaluated independently. Theoretically, this API could be used like the following to achieve a _similar_ BPE tokenizer behavior perhaps: ``` input = strings to tokenize mps = nvtext::load_merge_pairs_file("merges.txt"); bpe = nvtext::byte_pair_encoding( input, mps ); vocab = nvtext::load_vocabulary_file( "hashed_vocab.txt" ); result = nvtext::subword_tokenize( bpe, vocab, max_length, stride, lower_case, truncate, max_rows ); ``` Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Bradley Dice (https://github.com/bdice) - https://github.com/nvdbaranec URL: #10270

github-actions · 2022-06-04T01:31:07Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

BartleyR · 2022-10-13T17:05:23Z

We have a potential Morpheus customer who wants to use the phishing detection pipeline but in a non-English language. So we'd have to replace the BERT model with something else, and it would need a BPE tokenizer. We can do a POC using a CPU-based tokenizer, but would be good to scope this if we can for an upcoming release. @GregoryKimball for viz

GregoryKimball · 2023-06-05T17:34:27Z

This request is still relevant. After discussing with @VibhuJawa, the next step is benchmarking a GPT-3 style training workflow, and measuring the percentage of time spent in tokenization. If tokenization is 15-30% of the total time (as we see in bert), then this is worth prioritizing. Otherwise we should recommend tokenization with HuggingFace.

mtsai-rapids · 2025-01-06T18:54:31Z

based on the conversation with @VibhuJawa , we are looking to use Rapids for sentencePiece BPE https://rapids-goai.slack.com/archives/C5E06F4DC/p1736188662934809?thread_ts=1735837564.762089&cid=C5E06F4DC

VibhuJawa added feature request New feature or request Needs Triage Need team to review and classify strings strings issues (C++ and Python) and removed Needs Triage Need team to review and classify labels Nov 11, 2021

github-actions bot added the inactive-30d label Jan 1, 2022

github-actions bot removed the inactive-30d label Jan 20, 2022

davidwendt self-assigned this Jan 20, 2022

davidwendt mentioned this issue Feb 10, 2022

Add nvtext::byte_pair_encoding API #10270

Merged

github-actions bot added the inactive-30d label Mar 6, 2022

github-actions bot added the inactive-90d label Jun 4, 2022

GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. and removed inactive-30d labels Nov 21, 2022

pdmack mentioned this issue Nov 29, 2022

[FEA]: Byte-Pair Encoder (BPE) support nv-morpheus/Morpheus#507

Open

2 tasks

GregoryKimball added this to the Language model acceleration milestone Aug 14, 2023

GregoryKimball added this to libcudf Aug 14, 2023

GregoryKimball mentioned this issue Aug 15, 2023

[FEA] Improve ORC reader filtering and performance #13882

Open

davidwendt mentioned this issue Sep 12, 2023

Add BytePairEncoder class to cuDF #13891

Merged

3 tasks

GregoryKimball removed this from libcudf Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Byte Pair Encoding Tokenizer #9657

[FEA] Byte Pair Encoding Tokenizer #9657

VibhuJawa commented Nov 11, 2021 •

edited

Loading

VibhuJawa commented Nov 19, 2021

VibhuJawa commented Nov 19, 2021 •

edited

Loading

meghmak13 commented Nov 29, 2021

VibhuJawa commented Dec 2, 2021

github-actions bot commented Jan 1, 2022

teju85 commented Jan 20, 2022

davidwendt commented Jan 20, 2022 •

edited

Loading

davidwendt commented Feb 3, 2022

davidwendt commented Feb 3, 2022

VibhuJawa commented Feb 4, 2022

github-actions bot commented Mar 6, 2022

github-actions bot commented Jun 4, 2022

BartleyR commented Oct 13, 2022

GregoryKimball commented Jun 5, 2023

mtsai-rapids commented Jan 6, 2025 •

edited

Loading

[FEA] Byte Pair Encoding Tokenizer #9657

[FEA] Byte Pair Encoding Tokenizer #9657

Comments

VibhuJawa commented Nov 11, 2021 • edited Loading

VibhuJawa commented Nov 19, 2021

Actual Algorithim:

Example Call

VibhuJawa commented Nov 19, 2021 • edited Loading

meghmak13 commented Nov 29, 2021

VibhuJawa commented Dec 2, 2021

github-actions bot commented Jan 1, 2022

teju85 commented Jan 20, 2022

davidwendt commented Jan 20, 2022 • edited Loading

davidwendt commented Feb 3, 2022

davidwendt commented Feb 3, 2022

VibhuJawa commented Feb 4, 2022

github-actions bot commented Mar 6, 2022

github-actions bot commented Jun 4, 2022

BartleyR commented Oct 13, 2022

GregoryKimball commented Jun 5, 2023

mtsai-rapids commented Jan 6, 2025 • edited Loading

VibhuJawa commented Nov 11, 2021 •

edited

Loading

VibhuJawa commented Nov 19, 2021 •

edited

Loading

davidwendt commented Jan 20, 2022 •

edited

Loading

mtsai-rapids commented Jan 6, 2025 •

edited

Loading