-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Byte Pair Encoding Tokenizer #9657
Comments
We will probably need a libcudf implementation of the following function BPE function . (See HF reference implementation) . Here given the rank of each bigram we combine the most occuring bigram based on the rank provided in merges file. Once we have that we then convert it into token id using the vocabulary provided. Actual Algorithim:def bpe(token, bpe_ranks):
# if token in self.cache:
# return self.cache[token]
word = tuple(token)
pairs = get_pairs(word)
if not pairs:
return token
while True:
bigram = min(pairs, key=lambda pair: bpe_ranks.get(pair, float("inf")))
#print(bigram)
if bigram not in bpe_ranks:
break
first, second = bigram
new_word = []
i = 0
while i < len(word):
try:
j = word.index(first, i)
except ValueError:
new_word.extend(word[i:])
break
else:
new_word.extend(word[i:j])
i = j
if word[i] == first and i < len(word) - 1 and word[i + 1] == second:
new_word.append(first + second)
i += 2
else:
new_word.append(word[i])
i += 1
new_word = tuple(new_word)
word = new_word
if len(word) == 1:
break
else:
pairs = get_pairs(word)
print(pairs)
word = " ".join(word)
#self.cache[token] = word
return word
def get_pairs(word):
"""
Return set of symbol pairs in a word.
Word is represented as tuple of symbols (symbols being variable-length strings).
"""
pairs = set()
prev_char = word[0]
for char in word[1:]:
pairs.add((prev_char, char))
prev_char = char
return pairs Example Call# wget https://huggingface.co/gpt2/raw/main/merges.txt
# to get this file
merges_file = 'gpt_2_tokenizer/merges.txt'
with open(merges_file, encoding="utf-8") as merges_handle:
bpe_merges = merges_handle.read().split("\n")[1:-1]
bpe_merges = [tuple(merge.split()) for merge in bpe_merges]
bpe_ranks = dict(zip(bpe_merges, range(len(bpe_merges))))
bpe("Thisisit", bpe_ranks)
|
CC: @davidwendt for awareness. |
There is a need for the aforementioned feature request as we currently only support tokenization for BERT. As especially considering newer architectures like RoBERTa, GPT, T5 are getting adopted. |
Basic Algo:
** Extra Notes ** Python code to show this in action from transformers import GPT2Tokenizer
import pandas as pd
import json
# !wget https://huggingface.co/gpt2/raw/main/vocab.json
# !wget https://huggingface.co/gpt2/raw/main/merges.txt
with open('vocab.json') as f:
token_to_id = json.load(f)
id_to_token = {v: k for k, v in token_to_id.items()}
text_ser = ["This is test-sentence-1", "This is test sentence-2", "This-is test sentence 3"]
tokenizer = GPT2Tokenizer(vocab_file = 'vocab.json', merges_file='merges.txt')
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
encoded_batch = tokenizer.batch_encode_plus(text_ser,
return_tensors='np',
truncation=True,
padding='max_length',
max_length=12)
print("BPE output", [tokenizer.bpe(token) for token in text_ser[0].split(' ')])
print("tokenizer-output-with-not=cleaned-up-special-token ", [id_to_token.get(i, '[PAD]') for i in encoded_batch['input_ids'][0]])
print("tokenizer-output-cleaned-up", [tokenizer.decode(i) for i in encoded_batch['input_ids'][0]])
print("Final Output of tokenizer: ", encoded_batch['input_ids'][0])
print("\n"+"*"*50+"\n")
print("Batched Output")
print("Final Output of tokenizer:\n", encoded_batch['input_ids']) BPE output ['This', 'is', 'test - sent ence - 1']
tokenizer-output-with-not=cleaned-up-special-token ['This', 'Ġis', 'Ġtest', '-', 'sent', 'ence', '-', '1', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
tokenizer-output-cleaned-up ['This', ' is', ' test', '-', 'sent', 'ence', '-', '1', '[PAD]', '[PAD]', '[PAD]', '[PAD]']
Final Output of tokenizer: [ 1212 318 1332 12 34086 594 12 16 50257 50257 50257 50257]
**************************************************
Batched Output
Final Output of tokenizer:
[[ 1212 318 1332 12 34086 594 12 16 50257 50257 50257 50257]
[ 1212 318 1332 6827 12 17 50257 50257 50257 50257 50257 50257]
[ 1212 12 271 1332 6827 513 50257 50257 50257 50257 50257 50257]] CC: @davidwendt |
This issue has been labeled |
Has anyone been working on this? Or has this been prioritized for anytime soon? In the past week I got/saw requests for this at a couple of places. |
I've not worked on it yet but I hope to start on it in 22.04. |
@VibhuJawa Some questions based on the examples given here. You want a BPE function that takes a host string (and the merge/rank table) and returns the BPE as a host string? This shows passing in a word (substring of a string) and returning the BPE and then the Python code builds an array of BPE strings from each token.
The I'm trying to understand the inputs and outputs from a cudf usecase. Are you expecting give the libcudf BPE API a strings column of words and return the encoding of each as a strings column? Or do I have this all wrong and you are expecting a libcudf API that does everything |
On the vocab front I tried to verify if we can indeed treat the Similarity: The vocab dict is a continuous range of ints mapping to a tokens. Verified that across the commonly used models the Below for the verification reference: Discrepancy:
but can also include something like
while subword one mostly has these:
I think it might make sense to make this configurable from the python API that we will initialize with the right defaults. 2. Padding Token: Padding token's id is dependent on dictionary (id of I think (unsure) but we just treat it as
BPE seems to treat space characters differently . That is When there is space before the word it gets mapped to from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
id_to_token = {v: k for k, v in tokenizer.vocab.items()}
no_space_hello = "Hello world"
no_space_input_ids = tokenizer(no_space_hello ,add_special_tokens=False)['input_ids']
print(no_space_input_ids)
print([id_to_token[i] for i in no_space_input_ids])
print("----"*10)
space_hello = " Hello world"
space_input_ids = tokenizer(space_hello ,add_special_tokens=False)['input_ids']
print(space_input_ids)
print([id_to_token[i] for i in space_input_ids])
On getting a testable example to you. Sorry on getting a meaningful end to end python example that works across models. It turns out to be tougher than I anticipated but will update here once i have it working. |
This issue has been labeled |
Reference #9657 Add the `nvtext::byte_pair_encoding` API. This is not the BPE tokenizer but just the encoding function. The tokenizer will be a larger effort that will probably span multiple PRs. Providing the encoder here to be evaluated independently. Theoretically, this API could be used like the following to achieve a _similar_ BPE tokenizer behavior perhaps: ``` input = strings to tokenize mps = nvtext::load_merge_pairs_file("merges.txt"); bpe = nvtext::byte_pair_encoding( input, mps ); vocab = nvtext::load_vocabulary_file( "hashed_vocab.txt" ); result = nvtext::subword_tokenize( bpe, vocab, max_length, stride, lower_case, truncate, max_rows ); ``` Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Bradley Dice (https://github.com/bdice) - https://github.com/nvdbaranec URL: #10270
This issue has been labeled |
We have a potential Morpheus customer who wants to use the phishing detection pipeline but in a non-English language. So we'd have to replace the BERT model with something else, and it would need a BPE tokenizer. We can do a POC using a CPU-based tokenizer, but would be good to scope this if we can for an upcoming release. @GregoryKimball for viz |
This request is still relevant. After discussing with @VibhuJawa, the next step is benchmarking a GPT-3 style training workflow, and measuring the percentage of time spent in tokenization. If tokenization is 15-30% of the total time (as we see in |
based on the conversation with @VibhuJawa , we are looking to use Rapids for sentencePiece BPE https://rapids-goai.slack.com/archives/C5E06F4DC/p1736188662934809?thread_ts=1735837564.762089&cid=C5E06F4DC |
Is your feature request related to a problem? Please describe.
We should add byte pair encoding tokenizer to cuDF. Like our subword-tokenizer adds a bridge to Bert link models. Byte Pair EncodingTokenizer is used by
roberta
,gpt-2
,gpt-3
and will give us a bridge to a lot of DL models.We should focus porting a pre-trained tokenizer first.
Describe the solution you'd like
The implimentation should follow GPT-2 tokenizer but should be extendable to the robert-a ,
gpt-3
,megatron
etc. We should follow the HuggingFace API for this.Algorithim:
(</w>)
at the end of each word to identify the end of a word and then calculate the word frequency in the text.Ref: Link
Additional context
Best Explanation of Algorithm: https://leimao.github.io/blog/Byte-Pair-Encoding/
CC: @randerzander , @beckernick
The text was updated successfully, but these errors were encountered: