Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWQ Quantization support - New generic converter for all HF llama-like models - Tutorials - OpenNMT #389

Open
1 task
irthomasthomas opened this issue Jan 18, 2024 · 0 comments
Labels
ai-platform model hosts and APIs llm-inference-engines Software to run inference on large language models llm-quantization All about Quantized LLM models and serving llm-serving-optimisations Tips, tricks and tools to speedup inference of large language models New-Label Choose this option if the existing labels are insufficient to describe the content accurately

Comments

@irthomasthomas
Copy link
Owner

Quantization and Acceleration

We have added support for already quantized models, and revamped the converter for all llama-like models, whether they are quantized or not. Here's an example of the syntax:

python tools/convert_HF_llamalike.py --model_dir "TheBloke/Nous-Hermes-Llama2-AWQ" --output "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt" --format safetensors
  • TheBloke/Nous-Hermes-Llama2-AWQ: The name of the repository/model on the Hugging Face Hub.
  • output: Specifies the target directory and model name you want to save.
  • format: Optionally, you can save as safetensors.

For llama-like models, we download the tokenizer.model and generate a vocab file during the process. If the model is a AWQ quantized model, we will convert it to an OpenNMT-py AWQ quantized model.

After converting, you will need a config file to run translate.py or run_mmlu_opnenmt.py. Here's an example of the config:

transforms: [sentencepiece]

#### Subword
src_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"
tgt_subword_model: "/dataAI/llama2-7B/Hermes/tokenizer.model"

# Model info
model: "/dataAI/llama2-7B/Hermes/Nous-Hermes-onmt.pt"

# Inference
# ...

When considering your priority:

  • For small model files to fit VRAM of your GPU, try AWQ, but it will be slow for large batch sizes.
  • AWQ models are faster than FP16 for batch size 1.

Please read more here: GitHub - casper-hansen/AutoAWQ

Important Note:

  • There are two AWQ toolkits (llm-awq and AutoAWQ) and AutoAWQ supports two flavors: GEMM / GEMV.
  • The original llm-awq from MIT is not maintained periodically, so we default to AutoAWQ.
  • If a model is tagged llm-awq on the HF hub, we use AutoAWQ/GEMV, which is compatible.

Offline Quantizer Script:

  • We will provide an offline quantizer script for OpenNMT-py generic models. However, for small NMT models, AWQ may make things slower, so it might not be relevant for NMT.

Enjoy!


VS: Fast Inference with vLLM

Recently, Mistral reported 100 tokens/second for Mistral-7B at batch size 1 and 1250 tokens/sec for a batch of 60 prompts using vLLM. When using Mistral-instruct-v0.2-onmt-awq, the performance was as follows:

  • Batch size 1: 80.5 tokens/second
  • Batch size 60: 98 tokens/second, with GEMV being 20-25% faster.

This was with a GEMM model. To make a fair comparison, adjust the throughput for the step0 (prompt prefill) time.

Suggested labels

{ "key": "llm-quantization", "value": "Discussions and tools for handling quantized large language models" }

@irthomasthomas irthomasthomas added New-Label Choose this option if the existing labels are insufficient to describe the content accurately llm-inference-engines Software to run inference on large language models llm-quantization All about Quantized LLM models and serving ai-platform model hosts and APIs llm-serving-optimisations Tips, tricks and tools to speedup inference of large language models labels Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ai-platform model hosts and APIs llm-inference-engines Software to run inference on large language models llm-quantization All about Quantized LLM models and serving llm-serving-optimisations Tips, tricks and tools to speedup inference of large language models New-Label Choose this option if the existing labels are insufficient to describe the content accurately
Projects
None yet
Development

No branches or pull requests

1 participant