AWQ Quantization support - New generic converter for all HF llama-like models - Tutorials - OpenNMT #389

irthomasthomas · 2024-01-18T14:37:18Z

AWQ Quantization support - New generic converter for all HF llama-like models - Tutorials - OpenNMT

Quantization and Acceleration

We have added support for already quantized models, and revamped the converter for all llama-like models, whether they are quantized or not. Here's an example of the syntax:

python tools/convert_HF_llamalike.py --model_dir "TheBloke/Nous-Hermes-Llama2-AWQ" --output "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" --format safetensors

TheBloke/Nous-Hermes-Llama2-AWQ: The name of the repository/model on the Hugging Face Hub.
output: Specifies the target directory and model name you want to save.
format: Optionally, you can save as safetensors.

For llama-like models, we download the tokenizer.model and generate a vocab file during the process. If the model is a AWQ quantized model, we will convert it to an OpenNMT-py AWQ quantized model.

After converting, you will need a config file to run translate.py or run_mmlu_opnenmt.py. Here's an example of the config:

transforms: [sentencepiece]

#### Subword
src_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"
tgt_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"

# Model info
model: "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt"

# Inference
# ...

When considering your priority:

For small model files to fit VRAM of your GPU, try AWQ, but it will be slow for large batch sizes.
AWQ models are faster than FP16 for batch size 1.

Please read more here: GitHub - casper-hansen/AutoAWQ

Important Note:

There are two AWQ toolkits (llm-awq and AutoAWQ) and AutoAWQ supports two flavors: GEMM / GEMV.
The original llm-awq from MIT is not maintained periodically, so we default to AutoAWQ.
If a model is tagged llm-awq on the HF hub, we use AutoAWQ/GEMV, which is compatible.

Offline Quantizer Script:

We will provide an offline quantizer script for OpenNMT-py generic models. However, for small NMT models, AWQ may make things slower, so it might not be relevant for NMT.

Enjoy!

VS: Fast Inference with vLLM

Recently, Mistral reported 100 tokens/second for Mistral-7B at batch size 1 and 1250 tokens/sec for a batch of 60 prompts using vLLM. When using Mistral-instruct-v0.2-onmt-awq, the performance was as follows:

Batch size 1: 80.5 tokens/second
Batch size 60: 98 tokens/second, with GEMV being 20-25% faster.

This was with a GEMM model. To make a fair comparison, adjust the throughput for the step0 (prompt prefill) time.

Suggested labels

{ "key": "llm-quantization", "value": "Discussions and tools for handling quantized large language models" }

The text was updated successfully, but these errors were encountered:

This was referenced Feb 27, 2024

Guide to choosing quants and engines : r/LocalLLaMA #641

Open

Qwen-1.5-8x7B : r/LocalLLaMA #647

Open

Qwen - supervised finetuning script and guide for SFT. #660

Open

ShellLM mentioned this issue Apr 22, 2024

AlpacaEval: Revolutionizing Model Evaluation with LLM-Based Automatic Tools #813

Open

1 task

This was referenced Aug 1, 2024

Mistral NeMo | Mistral AI | Frontier AI in your hands #851

Open

Codestral Mamba | Mistral AI | Frontier AI in your hands #852

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWQ Quantization support - New generic converter for all HF llama-like models - Tutorials - OpenNMT #389

AWQ Quantization support - New generic converter for all HF llama-like models - Tutorials - OpenNMT #389

irthomasthomas commented Jan 18, 2024

AWQ Quantization support - New generic converter for all HF llama-like models - Tutorials - OpenNMT #389

AWQ Quantization support - New generic converter for all HF llama-like models - Tutorials - OpenNMT #389

Comments

irthomasthomas commented Jan 18, 2024

Suggested labels

{ "key": "llm-quantization", "value": "Discussions and tools for handling quantized large language models" }