AWQ Quantization support - New generic converter for all HF llama-like models - Tutorials - OpenNMT #389
Labels
ai-platform
model hosts and APIs
llm-inference-engines
Software to run inference on large language models
llm-quantization
All about Quantized LLM models and serving
llm-serving-optimisations
Tips, tricks and tools to speedup inference of large language models
New-Label
Choose this option if the existing labels are insufficient to describe the content accurately
Quantization and Acceleration
We have added support for already quantized models, and revamped the converter for all llama-like models, whether they are quantized or not. Here's an example of the syntax:
TheBloke/Nous-Hermes-Llama2-AWQ
: The name of the repository/model on the Hugging Face Hub.output
: Specifies the target directory and model name you want to save.format
: Optionally, you can save as safetensors.For llama-like models, we download the
tokenizer.model
and generate a vocab file during the process. If the model is a AWQ quantized model, we will convert it to an OpenNMT-py AWQ quantized model.After converting, you will need a config file to run
translate.py
orrun_mmlu_opnenmt.py
. Here's an example of the config:When considering your priority:
Please read more here: GitHub - casper-hansen/AutoAWQ
Important Note:
Offline Quantizer Script:
Enjoy!
VS: Fast Inference with vLLM
Recently, Mistral reported 100 tokens/second for Mistral-7B at batch size 1 and 1250 tokens/sec for a batch of 60 prompts using vLLM. When using Mistral-instruct-v0.2-onmt-awq, the performance was as follows:
This was with a GEMM model. To make a fair comparison, adjust the throughput for the step0 (prompt prefill) time.
Suggested labels
{ "key": "llm-quantization", "value": "Discussions and tools for handling quantized large language models" }
The text was updated successfully, but these errors were encountered: