Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative to bpe #50

Open
marcov-dart opened this issue Feb 28, 2024 · 16 comments
Open

Alternative to bpe #50

marcov-dart opened this issue Feb 28, 2024 · 16 comments

Comments

@marcov-dart
Copy link

Maybe I am completely wrong, but to me using something like bpe to build an encoding for text feels stupid. Sure, it is a fairly easy way and it will build an encoding that is efficient in terms of sequence length, but is that the only requirement for such an encoding? Would using an encoding that makes sense not make training and inference easier?
Should we not engineer an encoding instead? Using a-priori knowledge of languages from dictionaries for instance?

@JohannesVod
Copy link

I don't understand this question. Take a closer look at for example the gpt-4 tokenizer. It basically splits the text up into words. How does this encoding not make sense?

@marcov-dart
Copy link
Author

Well, I did look at the cl100k_base tokens. And sure it has lots of words in them. And lots of wordpieces that make sense to me. But also just as an example:

26217 - ancellationToken
34233 - cancellationToken
36935 - cancellation
38337 - CancellationToken
50322 - ancellation
82449 - (CancellationToken
83398 - Cancellation
96570 - cancellationToken

Not to mention things like:

28424 - ]])
8432 - [:,
28290 - ")));
27875 - [`
27662 - ')).
27657 - ^^^^
27608 - ,.

I hope this clarifies what I mean it doesn't make much sense.

@marcov-dart
Copy link
Author

I was thinking a scheme where the tokeniser doesn't just give a sequence of tokens but also a set of modifiers to the tokens. The integer we would use as normal to lookup the token embedding, but the modifiers we map onto synthetic (non-trained) dimensions of the token embedding.

The tokeniser would have to be very different from just doing bpe. Instead we would make an attempt of building a tokeniser using a-priori knowledge of different languages. Just as an example:

" bus", "Bus", "busses" for instance would all map to the same integer and the synthetic dimensions would provide the information of the specific variation. Maybe language could be a modifier too. So that the same word in different languages also have the same integer.

@JohannesVod
Copy link

JohannesVod commented Feb 29, 2024

That's an interesting idea. Normally this mapping happens automatically during training, meaning " bus", "Bus" and "busses" get mapped to a similar point in the embedding space. The unique "variation" is simply learned by the language model. Also i don't see a simple way to calculate the modifiers or knowing which words should map to the same integer. But i don't want to discourage you. Maybe you can take an english dictionary and find the variations of each word by some predefined rules. For example you could say, for each verb, you map every variation to the same integer, and then take unique modifiers for each combination of singluar/plural, person, etc. and add them after the embedding. I assume this is what you meant. Indeed, i think this would make it easier for the model to learn, so this is a very nice idea. However you need an english dictionary for this and you also need to know what to do with words that are not in the dictionary. It's not easy to do this and there are probably many challenges that will arise.

@marcov-dart
Copy link
Author

Yes, I do not expect it to be easy. But then again the current approach shows us that our encoding doesn't have to be perfect. A hybrid approach maybe. Partly crafted/designed and using bpe to fill in the blanks.
Of course I am working from my hypothesis that a more sensible encoding has tangible benefits during training and inference. And that would require a demonstration. A test of some sorts.

@JohannesVod
Copy link

JohannesVod commented Feb 29, 2024

Yes, i suggest trying it out on the TinyStories dataset as it only contains simple english words. I would first take a modern BPE, maybe the one from this repo and compare it with yours. But you have to be very careful on how you measure the performance. Maybe compare model parameters with loss?

@teleprint-me
Copy link

You can do a lazy evaluation.

For example, just lowercase each input, check if it's part of the subsequent word, and if it is, map it to the same id.

"bus" in "Bus".lower()  # True
"bus" in "busses" # True
# etc

This would require some form of pre-tokenization though.

Keep in mind the more complex it becomes, the more likely you've gone down the wrong path.

The idea and goal would be to remain as general as possible.

@marcov-dart
Copy link
Author

@NiceGuySaysHi Thank you for the suggestion. It was honestly helpful. I think your suggestion of taking a modern bpe is good. If the tokeniser gets trained on the entire training data then it would result in a best-case bpe based encoder. This encoding would be the shortest encoding for the training data given a certain dictionary size. And would be the most efficient encoding in terms of required compute per training cycle.
The only way I see for a different encoding to win out in that respect would be if it reached lower loss with less training cycles.
If it doesn't then it would be disappointing but it wouldn't necessarily be a total loss. It may be that the resulting model performs better in other respects. Maybe it performs better outside of the data it was trained on. Or maybe it is less prone to hallucinations. Or maybe the resulting model survives quantisation more gracefully.
It may be interesting to train the same model but using the stock cl100k encoding and llama encoding and compare the training loss curves of those with the best-case bpe encoder.

@zouharvi
Copy link

zouharvi commented Mar 2, 2024

We looked into what makes good tokenizations from mathematical perspective and found that Rényi entropy is a good ranker across multiple tokenizations. That is if you tokenize either with BPE, morphologically, or with some other algorithm it tends to select the tokenization which will lead to the highest model performance.

There's a tool for this evaluation tokenization-scorer.

I'd be super interested what kind of tokenization maximizes this metric (it's not BPE) and whether it's really the best one from performance perspective. It's a bit tangential to the original question but I still wonder what tokenization could directly optimize this.

Or maybe we could come up with different metrics that measure something else (ie not model performance) based on the unigram(?) distribution of the tokenization.

(apologies for pushing another paper here)

@EAzari
Copy link

EAzari commented Mar 2, 2024

Can we develop tiny language models based on characters (and or character groups)? Not word-based approaches, like tokens?
The idea is that there're n-chars (something similar to n-grams) as multivectors, from geometric algebra
1-chars are characters
2-chars are words with two characters
3-chars are words with three characters
...
n-chars are words with n characters

So, content is splitting into n-chars as geometric algebra vectors, not a big chunk as tokens (some words and characters)

That way even a dictionary can help the process
Characters are finite, limited to a small number of characters! Words are finite! Phrases are finite! and even sentences are finite!

@marcov-dart
Copy link
Author

@zouharvi You're fine. I am super interested to learn anything relating to encoding of text and downstream performance. I haven't properly read your paper yet, but I will.
The biggest issues I see with encoding text using something like bpe is that it quite often constructs tokens that obfuscate some of the contained information contained in the text.
Take 57266 from cl100k for instance. The model needs to learn how and when to use this token from examples. And that is fine, it can do that given enough examples. 57266 by itself doesn't tell me anything, but the sequence of characters associated with this token is ".isLoading" and this is suddenly tells me a lot. An encoding that turns this into a sequence of 3 tokens ".", "is" and "Loading" is certainly longer, but should (I think) allow for much easier generalisation.
I mean it is not like there are no other tokens that follow a similar scheme: .isPresent, .isBlank, .isVisible, .hasMore etc.

Don't even get me started on 24106 - .DataGridViewTextBoxColumn. WTAF.

@zouharvi
Copy link

zouharvi commented Mar 3, 2024

@marcov-dart I see your point. Morphological tokenization (e.g. using morfessor) might do just that. For example the word pretokenization is split into morphemes as pre token iz ation. I point this out because morphemes are the minimal unit that carries meaning[1] and does not give way to statistical anomalies.

However the paper (and many other previous works) show that this is not the best tokenization. This makes me wonder what really makes good tokenization.

[1] Please correct me if this is wrong.

@marcov-dart
Copy link
Author

I feel certain that these problems are related to how human languages work and how the transformer works. In language there is lots of meaning in the separate words and lots of meaning in the structure of the sentence. The transformer learns the meaning of the tokens and learns what to pay attention to in terms of structure. This seems to fit well together which is nice and it explains why transformers work as well as they do.
Where it seems to fall down a bit is on the level of characters that make up the words. I suspect this is because individual characters don't really mean anything and combining characters into a word is a rather arbitrary exercise unless you are combining existing words into a new one.

@EAzari
Copy link

EAzari commented Mar 3, 2024

@marcov-dart that's why I mentioned a character-based or character-grouping (vowels, plosives, fricatives, etc.) approach!
The approach I'm working on is something like applying RAG-like solution in pre-processing, tokenization and other steps!
DataGridViewTextBoxColumn is a 24-char token as a vector
Then it splits it to known n-chars vectors: Data + Grid + View + Text + Box + Column
And each of n-char is based on combination of 1-char vectors: D + t + a

@sgdescent
Copy link

I have a question regarding the discussion that follows. Has anyone ever tried doing something like Backoff merging i.e First merge N tokens instead of pairs, then reduce N and finally come down to a pair.

@EAzari
Copy link

EAzari commented Mar 23, 2024

Library Is All You Need

It will bring efficiency and consistency, scalability, adaptability, and "interoperability" in language models globally

The idea is having a universal linguistic unit library/dictionary/model for representation and processing

BPE doesn't have dictionary, it's dependent to the dictionary (vocab pool) LLM builds during the training

I say let's have a separate (external) standard dictionary with universal indexed linguistic unites that is in hands of everyone
And all develop their LLMs based on

Current LLMs, RAGs, and Vector DBs are based on processing "individual characters and words" which is inefficient

There is a need for a universal linguistic unit library/dictionary/model for representation and processing to improve LLMs and reduce costs and hallucinations.

We have Unicode characters
Do we have Unicode character groups
Do we have Unicode words?
Do we have Unicode phrases?
Do we have Unicode sentences?
etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants