-
Notifications
You must be signed in to change notification settings - Fork 363
Support the new mmap
-able ggml format
#93
Comments
@philpax would you be interested if I added proper support in huggingface/safetensors#197 ? The format allows for |
Hm, don't see why not! It'll depend on model availability, but I imagine we'll start seeing models released in ST format. |
FYI: note for |
Seems like some crazy misinformation to me. I've never even seen a multipart GGML file. The whole dataset is also needed for processing each token, so you can't practically use models larger than memory because it will require repeatedly loading the data from disk. "100x faster" — maybe if the entire thing is already in the buffer cache, but that's only possible when you have enough memory available to load the whole model. mmap is great and some overhead can be avoided (sometimes) when using it, but it's not magic. Also, does anyone know exactly how the file format changed? Specifically, what is different from the previous version to the current one? Looking at the conversion script isn't that helpful since it just rewrites everything. |
Converting any of the multipart .pths will result in multipart GGMLs.
Yes, I think the primary benefit is in repeated executions so that the model remains resident across executions.
My understanding is that it's laying out the data so that it matches the layout in memory. See #114. |
I see. It may make a bigger difference in that case, but they also could have just changed the conversion process to make stuff contiguous without having to mess with the final file format.
Well, it's just the OS buffer cache. The OS will cache stuff whether you're using mmap or just reading it normally. mmaping may avoid some copying data, but loading the model was already very fast when the cache was hot.
Nice, someone else is dealing with it! Although, right now the new format part is just a |
I think that's what they did - they just discovered that you can't go all the way without changing the format. The post has some details on this (something about the tensors being interleaved) but I just skimmed over it (I'm out right now)
Yeah. The main benefit is that you aren't pointlessly copying memory, which means the memory traffic is much lower (the cached pages can be used without needing to copy to other pages.) Speed is also relative: it's slower on Windows than it is on macOS M1.
I think that's been addressed now 😅 |
That doesn't make sense to me, since there were already 13B and 30B parameter single file GGML models. So the format had to be able to handle that. If multipart models got converted in a way that made dealing with them inefficient, it could have been changed on the converter side.
It is something that just happens once at startup though. I don't notice a huge difference in load times between llama-rs and llama.cpp and it only took a couple seconds even for a 13B model.
Haha, my comment is a whole 15 minutes out of date it seems! That's amazing though, nice work @iacore! |
Yeah, I don't know. I haven't been following it that closely - I've just been trying to figure out what we need to get it working here. |
hello hello |
Done thanks to #125 🎉 |
Justine's managed to land her mad-lass mmap-optimised format into llama.cpp. We should support loading this format in - and if we're smart about it, we should also support
mmap
ing it in. This should hopefully be easier to do in Rust than it is in C++!The text was updated successfully, but these errors were encountered: