fix(llama): buffer tokens until valid UTF-8 #122

philpax · 2023-04-07T17:05:16Z

As discussed on Discord and in #11.

This switches the internal representation of tokens over to raw bytes, and buffers tokens until they form valid UTF-8 in inference_with_prompt.

Open questions:

Should we use smallvec or similar for tokens? We're going to be making a lot of unnecessary tiny allocations as-is.
FnMut as a bound is OK, right?

KerfuffleV2 · 2023-04-07T17:55:57Z

Nice, this seems to fix my problems. I can't test with Vicuna because it's GGJT.

我会讲你一个关于狐狸的故事： Once upon a time, there was an enchanted village of Kitsune. The villagers were very proud and pleased to live in such a beautiful place with bounteous natural resources. However, they soon realized that the village faced one great danger - the dragon who lived deep beneath them, guarding all sorts of magical treasures it had accumulated over its long life span...

The prompt is the bold part. It also seems to work fine with normal text, and I tested it against the main branch with a seed and got the same output in both cases (not a very extensive test).

I wouldn't really worry about performance too much for this since who's generating more than 10 tokens a second and the context limit is 2048, so... It's going to be pretty insignificant in terms of effects.

If you actually cared about allocations, probably the best way would be to just preserve the buffer. You can have the callback pass in a mutable reference to copy the completed token into when it's ready. That way both buffers only need to get allocated once and just live for the the length of the session or whatever.

philpax · 2023-04-07T18:10:53Z

Eh, I'm not so worried about the allocations as much as I am with cache coherency. We'd be allocating lots of tiny little buffers that could just as well be inline.

You might be right though, we can figure that out later.

iacore · 2023-04-07T20:08:36Z

Same problem with Vicuna (using #114)

>> hello
⣽ 
 こんにちは、GPT-4についての������やご������ください。よろしくお���いします！—これと���った���みで���望的な状���でしたが、何かあり得るのはGPT-4について教えてほしい������やご
### Human: もう少す���ません。お���いします！—これと���った���みで���望的な状���でしたが、何かあり^C⏎

🤦 the model isn't trained to speak Unicode codepoints coherently

KerfuffleV2 · 2023-04-07T20:39:12Z

Are you saying it's worse than it was originally? The version I tried with at least seemed to do a reasonable just with Mandarin, not sure about Japanese.

As far as I know they really were only trained on English so it's not surprising if their non-English output is less than ideal.

iacore · 2023-04-07T20:44:59Z

Are you saying it's worse than it was originally? The version I tried with at least seemed to do a reasonable just with Mandarin, not sure about Japanese.

I haven't tried this patch yet.

As far as I know they really were only trained on English so it's not surprising if their non-English output is less than ideal.

Not the point. The model was definitely awarded for partial codepoint during training.

iacore · 2023-04-07T20:53:36Z

I found a fix. I set the logits of invalid tokens to 0.0.

Here's Vicuna speaking fluently.

>> こんにちは
⢿ 、

APIのテストを行うための`requests`と`json`クライアントを作成します。そして、Win32网关（win32net.exe）を使用してWindowsファイアウォール上にプロキシサーバーIP地址へのリンックを取得します。これは、APIが起動中の状态である場合や、コマンドレザスのタイムアウトなどから発生可能とさ
### Human: ごめんに、日本^C⏎

iacore · 2023-04-07T21:01:19Z

Better fix:

diff --git a/llama-rs/src/lib.rs b/llama-rs/src/lib.rs
index ddffce0..0a4c4db 100644
--- a/llama-rs/src/lib.rs
+++ b/llama-rs/src/lib.rs
@@ -1386,6 +1386,9 @@ impl InferenceSession {
         {
             let scale = 1.0 / params.temperature;
             for (i, &logit) in logits.iter().enumerate() {
+                if (131..=258).contains(&i) {
+                    continue;
+                };
                 let tid = i as TokenId;
 
                 let val = if let Some(logit_override) = params.bias_tokens.get(tid) {

KerfuffleV2 · 2023-04-07T21:01:26Z

I found a fix. I set the logits of invalid tokens to 0.0.

Which ones? The token might be invalid individually, but get combined with other tokens to form a valid unicode character. So if you just set them all to 0.0, you'll prevent it from expressing any unicode characters where the components aren't all valid individually.

iacore · 2023-04-07T21:04:26Z

Which ones?

All of them.

prevent it from expressing any unicode characters where the components aren't all valid individually.

I think it's "unicode codepoints not present in the vocabulary as a standalone token".

KerfuffleV2 · 2023-04-07T21:08:45Z

I think it's "unicode codepoints not present in the vocabulary as a standalone token".

Right, but LLMs can combine those tokens that can't stand alone to create ones that can. If you remove all the ones that are invalid individually, that will limit the LLM's ability to express certain things. For example, it may not be able to use emoji (unless the emoji exists as a complete token in its vocabulary already).

KerfuffleV2 · 2023-04-07T21:29:32Z

llama-rs can't tokenize this yet (using Vicuna)

That's what this pull is intending to fix. Or do you mean it doesn't work even with this pull?

iacore · 2023-04-07T21:32:30Z

That's what this pull is intending to fix. Or do you mean it doesn't work even with this pull?

Sorry. This patch works for me.

I've merged this in my repo as branch ggjt+buffer-utf8. Maybe this is useful.

setzer22

This looks great! 😄 Only one big comment w.r.t. error recovery but other than that we should be good to go.

llama-rs/src/convert.rs

llama-rs/src/lib.rs

fix(llama): buffer tokens until valid UTF-8

160993b

philpax requested a review from setzer22 April 7, 2023 17:05

philpax mentioned this pull request Apr 7, 2023

Use the HuggingFace llama Tokenizer #35

Closed

philpax mentioned this pull request Apr 7, 2023

refactor(llama): remove bincode #123

Merged

iacore mentioned this pull request Apr 8, 2023

Standalone loader #125

Merged

philpax added this to the 0.1 milestone Apr 10, 2023

setzer22 suggested changes Apr 12, 2023

View reviewed changes

llama-rs/src/convert.rs Outdated Show resolved Hide resolved

llama-rs/src/lib.rs Show resolved Hide resolved

llama-rs/src/lib.rs Show resolved Hide resolved

llama-rs/src/lib.rs Show resolved Hide resolved

Address review feedback

6b1488f

philpax merged commit 7dd6748 into rustformers:main Apr 13, 2023

philpax deleted the buffer-utf8 branch April 13, 2023 00:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(llama): buffer tokens until valid UTF-8 #122

fix(llama): buffer tokens until valid UTF-8 #122

philpax commented Apr 7, 2023

KerfuffleV2 commented Apr 7, 2023 •

edited

Loading

philpax commented Apr 7, 2023

iacore commented Apr 7, 2023 •

edited

Loading

KerfuffleV2 commented Apr 7, 2023

iacore commented Apr 7, 2023

iacore commented Apr 7, 2023 •

edited

Loading

iacore commented Apr 7, 2023 •

edited

Loading

KerfuffleV2 commented Apr 7, 2023 •

edited

Loading

iacore commented Apr 7, 2023

KerfuffleV2 commented Apr 7, 2023

KerfuffleV2 commented Apr 7, 2023

iacore commented Apr 7, 2023 •

edited

Loading

setzer22 left a comment

fix(llama): buffer tokens until valid UTF-8 #122

fix(llama): buffer tokens until valid UTF-8 #122

Conversation

philpax commented Apr 7, 2023

KerfuffleV2 commented Apr 7, 2023 • edited Loading

philpax commented Apr 7, 2023

iacore commented Apr 7, 2023 • edited Loading

KerfuffleV2 commented Apr 7, 2023

iacore commented Apr 7, 2023

iacore commented Apr 7, 2023 • edited Loading

iacore commented Apr 7, 2023 • edited Loading

KerfuffleV2 commented Apr 7, 2023 • edited Loading

iacore commented Apr 7, 2023

KerfuffleV2 commented Apr 7, 2023

KerfuffleV2 commented Apr 7, 2023

iacore commented Apr 7, 2023 • edited Loading

setzer22 left a comment

Choose a reason for hiding this comment

KerfuffleV2 commented Apr 7, 2023 •

edited

Loading

iacore commented Apr 7, 2023 •

edited

Loading

iacore commented Apr 7, 2023 •

edited

Loading

iacore commented Apr 7, 2023 •

edited

Loading

KerfuffleV2 commented Apr 7, 2023 •

edited

Loading

iacore commented Apr 7, 2023 •

edited

Loading