Skip to content
This repository has been archived by the owner on Jun 24, 2024. It is now read-only.

Add GGJT loader #114

Closed
wants to merge 10 commits into from
Closed

Add GGJT loader #114

wants to merge 10 commits into from

Conversation

iacore
Copy link
Contributor

@iacore iacore commented Apr 6, 2023

Related to #62 #93

single-file model format magic=ggjt

@iacore iacore changed the title Add loader stub for GGJT Add GGJT loader Apr 6, 2023
@philpax
Copy link
Collaborator

philpax commented Apr 6, 2023

Very nice! Glad someone took this on - we were discussing this on the Discord but decided to wait until upstream figured out what they wanted to do.

There's overlap with #84 and #85, so the merge in the future might be tricky. Just a heads-up.

@iacore
Copy link
Contributor Author

iacore commented Apr 6, 2023

I ported the code from C++.

I'm sure the reading part is correct. Maybe Tensor setup has changed? Or ggml has changed?

The model runs, but is producing garbage

### Assistant: 1 + 1 is '`
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 131
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 132
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 133
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 134
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 135
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 136
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 137
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 138
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 139
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 140
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 141
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 142
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 143
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 144
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 145
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 146
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 147
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 148
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 149
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 150
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 151
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 152
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 153
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 154
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 155
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 156
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 157
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 158
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 159
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 160
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 161
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 162
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 163
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 164
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 165
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 166
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 167
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 168
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 169
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 170
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 171
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 172
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 173
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 174
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 175
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 176
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 177
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 178
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 179
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 180
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 181
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 182
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 183
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 184
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 185
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 186
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 187
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 188
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 189
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 190
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 191
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 192
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 193
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 194
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 195
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 196
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 197
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 198
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 199
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 200
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 201
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 202
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 203
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 204
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 205
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 206
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 207
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 208
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 209
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 210
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 211
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 212
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 213
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 214
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 215
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 216
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 217
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 218
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 219
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 220
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 221
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 222
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 223
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 224
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 225
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 226
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 227
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 228
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 229
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 230
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 231
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 232
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 233
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 234
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 235
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 236
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 237
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 238
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 239
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 240
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 241
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 242
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 243
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 244
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 245
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 246
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 247
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 248
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 249
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 250
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 251
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 252
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 253
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 254
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 255
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 256
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 257
[2023-04-06T20:03:26Z INFO  llama_cli] Warning: Bad token in vocab at index 258
[2023-04-06T20:03:26Z INFO  llama_cli] ggml ctx size = 7759.50 MB
    
[2023-04-06T20:03:26Z INFO  llama_cli] Loading model part 1/1 from 'models/ggml-vicuna-13b-4bit.bin'
    
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 8/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 16/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 24/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 32/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 40/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 48/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 56/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 64/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 72/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 80/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 88/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 96/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 104/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 112/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 120/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 128/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 136/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 144/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 152/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 160/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 168/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 176/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 184/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 192/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 200/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 208/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 216/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 224/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 232/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 240/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 248/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 256/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 264/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 272/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 280/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 288/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 296/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 304/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 312/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 320/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 328/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 336/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 344/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 352/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loaded tensor 360/363
[2023-04-06T20:03:26Z INFO  llama_cli] Loading of 'models/ggml-vicuna-13b-4bit.bin' complete
[2023-04-06T20:03:26Z INFO  llama_cli] Model size = 7759.40 MB / num tensors = 363
[2023-04-06T20:03:26Z INFO  llama_cli] Model fully loaded!
>> Hello?
⣟ 

!#
  #
   #!$

@philpax
Copy link
Collaborator

philpax commented Apr 6, 2023

I'd take the same model in ggmj and ggjt format and compare the loaded tensors. I'm guessing it's probably a misalignment or something similar.

@iacore
Copy link
Contributor Author

iacore commented Apr 6, 2023

Is it normal to warn for bad tokens? The bad tokens are single-byte string >128. That's invalid UTF-8, but the tokenizer don't care?

I'm probably done with this for awhile.

@philpax
Copy link
Collaborator

philpax commented Apr 6, 2023

Yes, we're discussing that over at #11. The short explanation is that the invalid tokens are not valid UTF8 by themselves, but are composed to form valid UTF8. We're figuring out what the actual solution should be.

@KerfuffleV2
Copy link
Contributor

Does it work with the old formats but just gets garbage when using a GGJT model or just always garbage?

@iacore
Copy link
Contributor Author

iacore commented Apr 7, 2023

I haven't change the behavior of old formats. only GGJT produce garbage.

I don't have an older model. Can you try this branch on GGMF models?

@KerfuffleV2
Copy link
Contributor

ggml

Fine.

ggmf

thread 'main' panicked at 'Could not load model: TensorWrongSize { tensor_name: "tok_embeddings.weight", path: "blah.bin" }', llama-cli/src/main.rs:206:6

(I don't know if it's actually supposed to work here or not.)

ggjt

Garbage.

@KerfuffleV2
Copy link
Contributor

Was that last force push just rebasing on the current version or does it involve changes to the loading that may fix stuff that previously didn't work?

@iacore
Copy link
Contributor Author

iacore commented Apr 7, 2023

Was that last force push just rebasing on the current version or does it involve changes to the loading that may fix stuff that previously didn't work?

just rebase

can you share a simple GGMF model that I can use for testing?

@KerfuffleV2
Copy link
Contributor

I think this one is GGMF: https://huggingface.co/Sosaka/Alpaca-native-4bit-ggml/

(Don't know if it's a problem to post something like that here, if so let me know and I'll edit it out after iacore sees it.)

@iacore
Copy link
Contributor Author

iacore commented Apr 7, 2023

That one is GGML aka unversioned, not GGMF

@KerfuffleV2
Copy link
Contributor

Sorry, my mistake. Unfortunately, I don't really have a reasonable way to share huge files.

I've been trying to figure out what the issue with GGJT is. One thing I can say is I don't think your logic for finding the tensors/lengths/offsets has a problem.

I added some printfs to llama.cpp and the corresponding ones to llama-rs.

Load: tok_embeddings.weight, offset=432672, size=102403200
Load: norm.weight, offset=102835904, size=20480
Load: output.weight, offset=102856448, size=102403200
[...]
Load: layers.39.feed_forward.w2.weight, offset=8048282880, size=44236800
Load: layers.39.feed_forward.w3.weight, offset=8092519744, size=44236800
Load: layers.39.ffn_norm.weight, offset=8136756608, size=20480

Absolutely no difference in output between the C and Rust versions.

in math, tensor loading
@iacore
Copy link
Contributor Author

iacore commented Apr 7, 2023

still no clue

@KerfuffleV2
Copy link
Contributor

Okay, so I got it actually running inference on a GGJT model. However, what I had to do makes the mmap part pointless.

I believe the problem has something to do with how you can set a context parameters to no_alloc. In the llama.cpp change that added the format, they set no_alloc to true for the main context and then reduce the context size a lot so that GGML doesn't allocate memory for the actual tensors.

However, we're still doing the old context size calculation. I tried making the Context::init function take a bool for no_alloc and set it, but just got a segfault immediately.

Anyway, in loader.rs:load_weights_ggjt just change:

tensor.set_data(ptr as *mut std::ffi::c_void);

to

ptr.copy_to_nonoverlapping(tensor.data() as *mut u8, tensor.nbytes());

With that, it runs just fine on the GGJT model. Loading speed seems normal compared to the current version.

Obviously it's silly to mmap and just copy the data immediately. What I'd recommend is just ditching mmap right now and simply reading into the tensor data instead.

Then later on it will be possible to add mmap support as a separate thing which could work in a general way for the other formats too.

By the way, you probably need to run clippy on your changes. It's very unhappy right now!

@philpax
Copy link
Collaborator

philpax commented Apr 7, 2023

Yeah, I'd be happy with not supporting mmap right now. We can figure out what that's meant to look like once we have support for all the model types working.

@KerfuffleV2
Copy link
Contributor

KerfuffleV2 commented Apr 7, 2023

The approach I'd go for if I was writing it would be to just have a general type that can describe the types of tensor and where they are, dimensions, etc. Something similar to this: https://github.com/KerfuffleV2/smolrsrwkv/blob/182cd3205b7a7c95571a09bcfbb954b0041e4f90/smolrwkv/src/loader.rs#L15

Then the specific file format code can just scan through the file for metadata, build a structure with a list of those things or whatever. Then there can be generic loader code that just loads tensors based on that: it could use reading files, mmap, whatever. edit: Also could convert from something like SafeTensors, PyTorch, whatever. Then if some data actually needs to be converted, it could be described in that structure and the conversion process wouldn't have to care about low level details like GGJT vs GGML, just "I have a tensor of type X, but I need Y".

I think that approach would make dealing with different file formats much easier.

@philpax
Copy link
Collaborator

philpax commented Apr 7, 2023

I'd be OK with that. It'd also help with #117 / #84.

@iacore iacore marked this pull request as ready for review April 7, 2023 19:25
@iacore
Copy link
Contributor Author

iacore commented Apr 7, 2023

Got Vicuna working here. The loading speed is unfortunate.

I can use io_uring to make this faster, but that's more code.

maybe copying from mmap-ed memory is faster?

ship it 🚀

@KerfuffleV2
Copy link
Contributor

KerfuffleV2 commented Apr 7, 2023

@iacore I was experimenting with trying to clean it up also. It does seem like reading is way slower than the copying from mmap approach. I don't know why.

My change looks like this:

        let offset_curr = reader.stream_position()?;
        let offset_aligned: u64 = (offset_curr + 31) & !31;

        reader.seek_relative((offset_aligned - offset_curr) as i64)?;
        let td =
            unsafe { std::slice::from_raw_parts_mut(tensor.data() as *mut u8, tensor.nbytes()) };

        reader.read_exact(td)?;
        total_loaded_bytes += tensor.nbytes();

So, uhh... I guess maybe just keep mmap?

edit:

I can use io_uring to make this faster, but that's more code.

Also probably don't want OS specific optimizations.


I also experimented with making the BufReader capacity really big (up to 2GB) and it didn't seem to help the reading speed.

@iacore
Copy link
Contributor Author

iacore commented Apr 7, 2023

for me, seq. read is faster than mmap

I've made a branch ggjt-variant-copy-mmap for the solution

@KerfuffleV2
Copy link
Contributor

Are you saying you tried the copy_to_nonoverlapping version and it was slower than the current version that changed to BufReader?

@iacore
Copy link
Contributor Author

iacore commented Apr 7, 2023

Are you saying you tried the copy_to_nonoverlapping version and it was slower than the current version that changed to BufReader?

yes

@KerfuffleV2
Copy link
Contributor

Ahh, why can't life be simple for once?

What OS are you using, out of curiosity?

@iacore
Copy link
Contributor Author

iacore commented Apr 7, 2023

Linux.

I think this branch is done. I'll not touch it again.

@KerfuffleV2
Copy link
Contributor

I'm on Linux as well.

Sorry for the confusion. I've been trying both versions and I don't get consistent results. No sure what's going on, but I don't think there's a problem with the current approach.

I'll not touch it again.

How can you say that when Clippy is still sad?

You can probably just nuke the set_data method, it doesn't seem clear that there's even a way to successfully use it at the moment.

@iacore
Copy link
Contributor Author

iacore commented Apr 7, 2023

You can probably just nuke the set_data method, it doesn't seem clear that there's even a way to successfully use it at the moment.

it probably will be useful in the future?

@KerfuffleV2
Copy link
Contributor

it probably will be useful in the future?

We'd have to figure out how to actually use it in the future though. I don't think it currently can work at all, since you aren't even able to turn of need_alloc when creating a context so memory for tensors will always get allocated no matter what.

You'd think it would still be possible to set the tensor to point at a different chunk of memory but that didn't actually work: otherwise your first approach would have had no issues.

So my line of thinking is, it's a only a couple of lines of code to wrap a ggml function so it wouldn't be hard to add back in later on if it was actually needed/could be used but it's currently non-functional so it may as well be removed.

Just to be clear, this is just the opinion of some other random person on the internet. So take that for it's worth, I have no authority.

@jon-chuang
Copy link
Contributor

I also got this working on new ggjt file format.

@jon-chuang
Copy link
Contributor

jon-chuang commented Apr 12, 2023

Btw, from https://justine.lol/mmap/

Remember that progress bar which made you wait for weights to load each time you ran the command? We got rid of that. Linux users should expect a 100x improvement in load time. Windows and MacOS users should expect a 10x improvement. What this means is that tokens will start being produced effectively instantaneously when you run LLaMA, almost providing a similar UX to ChatGPT on the shell. It's important to note these improvements are due to an amortized cost. The first time you load a model after rebooting your computer, it's still going to go slow, because it has to load the weights from disk. However each time it's loaded afterwards, it should be fast (at least until memory pressure causes your file cache to be evicted).

The speedup is meant to be for subsequent loads. Did you guys check that?

Or just the initial load?

Btw, you may need to configure some settings like read_advise in mmap2 library to get some better prefetching.

The default MMAP way of reading is to trigger page faults.

@jon-chuang
Copy link
Contributor

Another thing to check: multiple processes utilizing the same mmaped file:

More Processes
You can now run multiple LLaMA processes simultaneously on your computer. Here's a video of Georgi having a conversation with four chatbots powered by four independent llama.cpp processes running on the same Mac. So llama.cpp is not only going to be a better friend to you, it can also serve as your artificial circle of friends too. The trick that makes it possible is mmap() lets us map the read-only weights using MAP_SHARED, which is the same technique that's traditionally been used for loading executable software. So we figured, why aren't we using it to load neural network software too? Now we can.

)?
}
ModelVersion::GGJT => {
let mmap = unsafe { Mmap::map(&file)? };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iacore
Copy link
Contributor Author

iacore commented Apr 12, 2023

Succeeded by #125

@iacore iacore closed this Apr 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants