-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding SqueezeLLM Support #3093
base: master
Are you sure you want to change the base?
Conversation
It seems this is an inference-only implementation? How can we quantize new/custom models with SqueezeLM? |
Interesting work! Can you try to update the python3 convert-llama-ggml-to-gguf.py --input models/llama-7b-v2/ggml-model-q4_sq.bin --output models/llama-7b-v2/ggml-model-q4_sq.gguf --squeezellm
* Using config: Namespace(input=PosixPath('models/llama-7b-v2/ggml-model-q4_sq.bin'), output=PosixPath('models/llama-7b-v2/ggml-model-q4_sq.gguf'), name=None, desc=None, gqa=1, eps='5.0e-06', context_length=2048, model_metadata_dir=None, vocab_dir=None, vocabtype='spm', squeezellm=True)
=== WARNING === Be aware that this conversion script is best-effort. Use a native GGUF model if possible. === WARNING ===
- Note: If converting LLaMA2, specifying "--eps 1e-5" is required. 70B models also need "--gqa 8".
* Scanning GGML input file
* File format: GGJTv1 with ftype MOSTLY_Q5_K_S
* GGML model hyperparameters: <Hyperparameters: n_vocab=32000, n_embd=4096, n_mult=256, n_head=32, n_layer=0, n_rot=128, n_ff=11008, ftype=MOSTLY_Q5_K_S>
=== WARNING === Special tokens may not be converted correctly. Use --model-metadata-dir if possible === WARNING ===
* Preparing to save GGUF file
* Adding model parameters and KV items
* Adding 32000 vocab item(s)
* Adding 291 tensor(s)
Traceback (most recent call last):
File "/Users/ggerganov/development/github/llama.cpp/convert-llama-ggml-to-gguf.py", line 458, in <module>
main()
File "/Users/ggerganov/development/github/llama.cpp/convert-llama-ggml-to-gguf.py", line 454, in main
converter.save()
File "/Users/ggerganov/development/github/llama.cpp/convert-llama-ggml-to-gguf.py", line 252, in save
self.add_tensors(gguf_writer)
File "/Users/ggerganov/development/github/llama.cpp/convert-llama-ggml-to-gguf.py", line 360, in add_tensors
assert mapped_name is not None, f'Bad name {name}'
^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Bad name layers.0.attention.wq.weight Edit: Also, double check your "Model Size" column. These are the correct numbers in bytes: ls -l models/llama-7b-v2/
total 197602496
-rw-r--r-- 1 ggerganov staff 13478104576 Aug 30 11:27 ggml-model-f16.gguf
-rw-r--r-- 1 ggerganov staff 26954272000 Aug 26 23:18 ggml-model-f32.gguf
-rw-r--r-- 1 ggerganov staff 2825940544 Aug 30 11:53 ggml-model-q2_k.gguf
-rw-r--r-- 1 ggerganov staff 3298004544 Aug 30 11:53 ggml-model-q3_k.gguf
-rw-r--r-- 1 ggerganov staff 2948304448 Sep 2 10:21 ggml-model-q3_k_s.gguf
-rw-r--r-- 1 ggerganov staff 3825806912 Aug 30 11:52 ggml-model-q4_0.gguf
-rw-r--r-- 1 ggerganov staff 4238749248 Aug 30 11:52 ggml-model-q4_1.gguf
-rw-r--r-- 1 ggerganov staff 4081004096 Aug 30 11:52 ggml-model-q4_k.gguf
-rw-r--r-- 1 ggerganov staff 3856739904 Sep 2 10:21 ggml-model-q4_k_s.gguf
-rw-r--r-- 1 ggerganov staff 3807322752 Sep 9 11:38 ggml-model-q4_sq.bin
-rw-r--r-- 1 ggerganov staff 4651691584 Aug 30 11:52 ggml-model-q5_0.gguf
-rw-r--r-- 1 ggerganov staff 5064633920 Aug 30 11:52 ggml-model-q5_1.gguf
-rw-r--r-- 1 ggerganov staff 4783156800 Aug 30 11:52 ggml-model-q5_k.gguf
-rw-r--r-- 1 ggerganov staff 4651691584 Sep 2 10:20 ggml-model-q5_k_s.gguf
-rw-r--r-- 1 ggerganov staff 5529194048 Aug 30 11:51 ggml-model-q6_k.gguf
-rw-r--r-- 1 ggerganov staff 7161089600 Aug 30 11:51 ggml-model-q8_0.gguf Divide by |
I think this is because the name doesn't need to be mapped and the name mapping stuff doesn't support an identity operation. We should probably fix that, it would be an easy change. edit: Fixed it with #3095 |
Thank you for trying this out! I fixed the steps listed in the comment above above - you also need to copy the config.json file as well as the tokenizer file before running the first conversion step ( |
Is there a reason to include a way to convert to GGML? As far as I know, there isn't really a use case for creating new GGML format files so you can probably make your life easier by just not having to worry about that. |
Adding SqeezeLLM to It would also be great if the table was updated with the actual perplexity values for the various existing quantization types listed in the table. For instance, the current LLaMA-v1-7B perplexity for As the provided SqueezeLLM perplexity is for LLaMA-v1-7B, I was curious to see how this approach will perform for LLaMA-v2-7B. I followed the instructions to create the GGUF model and ran the perplexity tool. Calculation is very slow (6+ hours on my M2 Max), so I stopped after 344 batches. At that point, SqueezeLLM perplexity was higher than On the bright side, I'm noticing that the provided model does not quantize the Update: I downloaded a quantized model from https://huggingface.co/TheBloke/LLaMa-7B-GGML/ and I see that, indeed, in these models |
Thank you for the feedback, I've updated the file conversion code to convert directly to |
Thank you for pointing this out! I've updated the table to match the updated perplexities for the existing quantization methods. I've also added a row to correspond to quantizing the input and output embeddings to Q8_0. I'll test quantizing the input embedding to Q4_0 to see if we can also do this with minimal degradation. |
@ggerganov @KerfuffleV2 @ikawrakow Just following up to see if there was any additional feedback or if the updates look good, please let me know what else we need to do to integrate this! |
I'm hesitant to post stuff in pulls that isn't really contributing since it spams peoples' notifications but I just wanted to say I'm not ignoring the @ . I'm just not really a person with the ability/authority to decide whether it gets accepted. Unfortunately, I also haven't really had a chance to look too closely at this yet so I don't have any other feedback to add right now. edit: I think I can make the checks run for you though. |
I'm hesitant to integrate this for a few reasons:
For now, we should keep this branch as a PoC |
SqueezeLLM Support
This PR adds support for the SqueezeLLM quantization method, which is described in the following preprint: https://arxiv.org/abs/2306.07629, and which has open-source GPU inference code available at: https://github.com/SqueezeAILab/SqueezeLLM. SqueezeLLM is a post-training quantization framework that allows for high-accuracy and runtime-efficient quantization at low bit precision.
This PR contains the inference code to run the 4-bit dense-only non-uniform quantization scheme outlined in the preprint, as well as the code required to convert the Huggingface (PyTorch) checkpoints to the required binary format in order to be compatible with the llama.cpp file loader.
SqueezeLLM leverages non-uniform quantization to better represent the underlying distribution by shifting the quantization signposts to the optimal positions. SqueezeLLM has promising performance both from an accuracy perspective as well as in terms of runtime efficiency relative to existing integrated quantization methods. The runtime per token was benchmarked on an M1 for 128 tokens without metal, using checkpoints from: https://huggingface.co/TheBloke/LLaMa-7B-GGML/:
(Edit - numbers updated to match with the comments below, and to also include the model size with 8-bit embedding quantization. Precision estimates are from the link above)
Quantized checkpoints are publicly available for a range of popular models, including LLaMA1/2, Vicuna 1.1/1.3/1.5, xGEN, and OPT: https://huggingface.co/squeeze-ai-lab
Example usage (for the 7B LLaMA-2 model):
LLAMA_NO_METAL=1 make
)python convert-sqllm-to-gguf.py --outtype q4_sq models/7B/sq-llama-2-7b-w4-s0/sq-llama-2-7b-w4-s0.pt --outfile models/7B/sq-llama-2-7b-w4-s0-fp16.gguf -squeezellm
. The--equant
flag can also be passed to quantize input and output embeddings to 8 bits../main -m models/7B/sq-llama-2-7b-w4-s0-fp16.gguf -n 128