Support the new `mmap`-able ggml format #93

philpax · 2023-03-30T20:01:30Z

Justine's managed to land her mad-lass mmap-optimised format into llama.cpp. We should support loading this format in - and if we're smart about it, we should also support mmaping it in. This should hopefully be easier to do in Rust than it is in C++!

The text was updated successfully, but these errors were encountered:

philpax · 2023-04-04T19:22:03Z

ggerganov/llama.cpp#711 (comment)

Narsil · 2023-04-05T14:50:01Z

@philpax would you be interested if I added proper support in huggingface/safetensors#197 ?

The format allows for mmap (or not, but it will currently align buffers for zero-copy loads).
But it is a "pure" rust format (which is also readable from Python)

philpax · 2023-04-05T19:39:19Z

Hm, don't see why not! It'll depend on model availability, but I imagine we'll start seeing models released in ST format.

katopz · 2023-04-06T05:01:19Z

FYI: note for mmap https://justine.lol/mmap/

KerfuffleV2 · 2023-04-06T19:24:58Z

FYI: note for mmap https://justine.lol/mmap/

Seems like some crazy misinformation to me. I've never even seen a multipart GGML file. The whole dataset is also needed for processing each token, so you can't practically use models larger than memory because it will require repeatedly loading the data from disk.

"100x faster" — maybe if the entire thing is already in the buffer cache, but that's only possible when you have enough memory available to load the whole model.

mmap is great and some overhead can be avoided (sometimes) when using it, but it's not magic.

Also, does anyone know exactly how the file format changed? Specifically, what is different from the previous version to the current one? Looking at the conversion script isn't that helpful since it just rewrites everything.

philpax · 2023-04-06T19:48:10Z

I've never even seen a multipart GGML file.

Converting any of the multipart .pths will result in multipart GGMLs.

"100x faster" — maybe if the entire thing is already in the buffer cache, but that's only possible when you have enough memory available to load the whole model.

Yes, I think the primary benefit is in repeated executions so that the model remains resident across executions.

Also, does anyone know exactly how the file format changed?

My understanding is that it's laying out the data so that it matches the layout in memory. See #114.

KerfuffleV2 · 2023-04-06T19:56:00Z

Converting any of the multipart .pths will result in multipart GGMLs.

I see. It may make a bigger difference in that case, but they also could have just changed the conversion process to make stuff contiguous without having to mess with the final file format.

Yes, I think the primary benefit is in repeated executions so that the model remains resident across executions.

Well, it's just the OS buffer cache. The OS will cache stuff whether you're using mmap or just reading it normally. mmaping may avoid some copying data, but loading the model was already very fast when the cache was hot.

My understanding is that it's laying out the data so that it matches the layout in memory. See #114.

Nice, someone else is dealing with it! Although, right now the new format part is just a todo!("new format here) so that doesn't really help with understanding the change at present.

philpax · 2023-04-06T20:07:04Z

It may make a bigger difference in that case, but they also could have just changed the conversion process to make stuff contiguous without having to mess with the final file format.

I think that's what they did - they just discovered that you can't go all the way without changing the format. The post has some details on this (something about the tensors being interleaved) but I just skimmed over it (I'm out right now)

Well, it's just the OS buffer cache. The OS will cache stuff whether you're using mmap or just reading it normally. mmaping may avoid some copying data, but loading the model was already very fast when the cache was hot.

Yeah. The main benefit is that you aren't pointlessly copying memory, which means the memory traffic is much lower (the cached pages can be used without needing to copy to other pages.)

Speed is also relative: it's slower on Windows than it is on macOS M1.

Although, right now the new format part is just a todo!("new format here) so that doesn't really help with understanding the change at present.

I think that's been addressed now 😅

KerfuffleV2 · 2023-04-06T20:15:50Z

I think that's what they did - they just discovered that you can't go all the way without changing the format.

That doesn't make sense to me, since there were already 13B and 30B parameter single file GGML models. So the format had to be able to handle that. If multipart models got converted in a way that made dealing with them inefficient, it could have been changed on the converter side.

The main benefit is that you aren't pointlessly copying memory, which means the memory traffic is much lower

It is something that just happens once at startup though. I don't notice a huge difference in load times between llama-rs and llama.cpp and it only took a couple seconds even for a 13B model.

I think that's been addressed now

Haha, my comment is a whole 15 minutes out of date it seems!

That's amazing though, nice work @iacore!

philpax · 2023-04-06T20:19:36Z

Yeah, I don't know. I haven't been following it that closely - I've just been trying to figure out what we need to get it working here.

iacore · 2023-04-06T20:24:07Z

hello hello

philpax · 2023-04-23T02:22:28Z

Done thanks to #125 🎉

philpax added the issue:enhancement New feature or request label Mar 30, 2023

This was referenced Apr 2, 2023

Parallel loading of the model tensors #79

Open

Fails to run with gpt4-x-alpaca-13b-native-ggml-model-q4_0 #102

Closed

philpax mentioned this issue Apr 5, 2023

Publish to crates.io #57

Closed

4 tasks

philpax mentioned this issue Apr 6, 2023

Update to latest llama.cpp #62

Closed

iacore mentioned this issue Apr 6, 2023

Add GGJT loader #114

Closed

philpax assigned iacore Apr 7, 2023

philpax added this to the 0.1 milestone Apr 10, 2023

philpax mentioned this issue Apr 18, 2023

Error rustformers/llmcord#8

Closed

philpax closed this as completed Apr 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support the new `mmap`-able ggml format #93

Support the new `mmap`-able ggml format #93

philpax commented Mar 30, 2023

philpax commented Apr 4, 2023

Narsil commented Apr 5, 2023

philpax commented Apr 5, 2023

katopz commented Apr 6, 2023

KerfuffleV2 commented Apr 6, 2023 •

edited

Loading

philpax commented Apr 6, 2023

KerfuffleV2 commented Apr 6, 2023

philpax commented Apr 6, 2023

KerfuffleV2 commented Apr 6, 2023

philpax commented Apr 6, 2023

iacore commented Apr 6, 2023

philpax commented Apr 23, 2023

Support the new mmap-able ggml format #93

Support the new mmap-able ggml format #93

Comments

philpax commented Mar 30, 2023

philpax commented Apr 4, 2023

Narsil commented Apr 5, 2023

philpax commented Apr 5, 2023

katopz commented Apr 6, 2023

KerfuffleV2 commented Apr 6, 2023 • edited Loading

philpax commented Apr 6, 2023

KerfuffleV2 commented Apr 6, 2023

philpax commented Apr 6, 2023

KerfuffleV2 commented Apr 6, 2023

philpax commented Apr 6, 2023

iacore commented Apr 6, 2023

philpax commented Apr 23, 2023

Support the new `mmap`-able ggml format #93

Support the new `mmap`-able ggml format #93

KerfuffleV2 commented Apr 6, 2023 •

edited

Loading