Skip to content
This repository has been archived by the owner on Jun 24, 2024. It is now read-only.

Support the new mmap-able ggml format #93

Closed
philpax opened this issue Mar 30, 2023 · 12 comments
Closed

Support the new mmap-able ggml format #93

philpax opened this issue Mar 30, 2023 · 12 comments
Assignees
Labels
issue:enhancement New feature or request
Milestone

Comments

@philpax
Copy link
Collaborator

philpax commented Mar 30, 2023

Justine's managed to land her mad-lass mmap-optimised format into llama.cpp. We should support loading this format in - and if we're smart about it, we should also support mmaping it in. This should hopefully be easier to do in Rust than it is in C++!

@philpax
Copy link
Collaborator Author

philpax commented Apr 4, 2023

@philpax philpax mentioned this issue Apr 5, 2023
4 tasks
@Narsil
Copy link

Narsil commented Apr 5, 2023

@philpax would you be interested if I added proper support in huggingface/safetensors#197 ?

The format allows for mmap (or not, but it will currently align buffers for zero-copy loads).
But it is a "pure" rust format (which is also readable from Python)

@philpax
Copy link
Collaborator Author

philpax commented Apr 5, 2023

Hm, don't see why not! It'll depend on model availability, but I imagine we'll start seeing models released in ST format.

@katopz
Copy link
Contributor

katopz commented Apr 6, 2023

FYI: note for mmap https://justine.lol/mmap/

@KerfuffleV2
Copy link
Contributor

KerfuffleV2 commented Apr 6, 2023

FYI: note for mmap https://justine.lol/mmap/

Seems like some crazy misinformation to me. I've never even seen a multipart GGML file. The whole dataset is also needed for processing each token, so you can't practically use models larger than memory because it will require repeatedly loading the data from disk.

"100x faster" — maybe if the entire thing is already in the buffer cache, but that's only possible when you have enough memory available to load the whole model.

mmap is great and some overhead can be avoided (sometimes) when using it, but it's not magic.


Also, does anyone know exactly how the file format changed? Specifically, what is different from the previous version to the current one? Looking at the conversion script isn't that helpful since it just rewrites everything.

@philpax
Copy link
Collaborator Author

philpax commented Apr 6, 2023

I've never even seen a multipart GGML file.

Converting any of the multipart .pths will result in multipart GGMLs.

"100x faster" — maybe if the entire thing is already in the buffer cache, but that's only possible when you have enough memory available to load the whole model.

Yes, I think the primary benefit is in repeated executions so that the model remains resident across executions.

Also, does anyone know exactly how the file format changed?

My understanding is that it's laying out the data so that it matches the layout in memory. See #114.

@KerfuffleV2
Copy link
Contributor

Converting any of the multipart .pths will result in multipart GGMLs.

I see. It may make a bigger difference in that case, but they also could have just changed the conversion process to make stuff contiguous without having to mess with the final file format.

Yes, I think the primary benefit is in repeated executions so that the model remains resident across executions.

Well, it's just the OS buffer cache. The OS will cache stuff whether you're using mmap or just reading it normally. mmaping may avoid some copying data, but loading the model was already very fast when the cache was hot.

My understanding is that it's laying out the data so that it matches the layout in memory. See #114.

Nice, someone else is dealing with it! Although, right now the new format part is just a todo!("new format here) so that doesn't really help with understanding the change at present.

@philpax
Copy link
Collaborator Author

philpax commented Apr 6, 2023

It may make a bigger difference in that case, but they also could have just changed the conversion process to make stuff contiguous without having to mess with the final file format.

I think that's what they did - they just discovered that you can't go all the way without changing the format. The post has some details on this (something about the tensors being interleaved) but I just skimmed over it (I'm out right now)

Well, it's just the OS buffer cache. The OS will cache stuff whether you're using mmap or just reading it normally. mmaping may avoid some copying data, but loading the model was already very fast when the cache was hot.

Yeah. The main benefit is that you aren't pointlessly copying memory, which means the memory traffic is much lower (the cached pages can be used without needing to copy to other pages.)

Speed is also relative: it's slower on Windows than it is on macOS M1.

Although, right now the new format part is just a todo!("new format here) so that doesn't really help with understanding the change at present.

I think that's been addressed now 😅

@KerfuffleV2
Copy link
Contributor

I think that's what they did - they just discovered that you can't go all the way without changing the format.

That doesn't make sense to me, since there were already 13B and 30B parameter single file GGML models. So the format had to be able to handle that. If multipart models got converted in a way that made dealing with them inefficient, it could have been changed on the converter side.

The main benefit is that you aren't pointlessly copying memory, which means the memory traffic is much lower

It is something that just happens once at startup though. I don't notice a huge difference in load times between llama-rs and llama.cpp and it only took a couple seconds even for a 13B model.

I think that's been addressed now

Haha, my comment is a whole 15 minutes out of date it seems!

That's amazing though, nice work @iacore!

@philpax
Copy link
Collaborator Author

philpax commented Apr 6, 2023

Yeah, I don't know. I haven't been following it that closely - I've just been trying to figure out what we need to get it working here.

@iacore
Copy link
Contributor

iacore commented Apr 6, 2023

hello hello

@philpax philpax added this to the 0.1 milestone Apr 10, 2023
@philpax
Copy link
Collaborator Author

philpax commented Apr 23, 2023

Done thanks to #125 🎉

@philpax philpax closed this as completed Apr 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
issue:enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants