(Limited) comparison of popular model serving solutions
Solution | Inference backend | Serving backend | Advanced kernel support | Model support |
---|---|---|---|---|
Huggingface TGI | Pytorch | HF TGI (Rust) | Paged + Flash attention | Language |
Deepspeed MII | PyTorch | Deepspeed (Python) | DeepSpeed-Kernels | Language |
TensorRT-LLM | TensorRT-LLM | TensorRT-LLM (C++) | TensorRT XQA | Language |
vLLM | vLLM | vLLM (Python) | Paged + Flash attention | Language |
kv.run | PyTorch | HF TGI + more (Rust) | Paged + Flash attention, FlashInfer | Language, Diffusion models(soon) |
Install Rust:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
sudo apt-get install libssl-dev gcc -y
PROTOC_ZIP=protoc-21.12-linux-x86_64.zip
curl -OL https://github.com/protocolbuffers/protobuf/releases/download/v21.12/$PROTOC_ZIP
sudo unzip -o $PROTOC_ZIP -d /usr/local bin/protoc
sudo unzip -o $PROTOC_ZIP -d /usr/local 'include/*'
rm -f $PROTOC_ZIP
# Install FlashInfer
# For CUDA 12.1 & torch 2.3
pip install flashinfer==0.1.1 -i https://flashinfer.ai/whl/cu121/torch2.3
# For other CUDA & torch versions, please check https://docs.flashinfer.ai/installation.html
# Install Flash and Paged Attention
cd server && make install-flash-attention && make install-vllm-cuda && install-flash-attention-v2-cuda
make install
Dockerfile_kvrun
provides a docker image building script. We will provide pre-built docker images shortly.
text-generation-launcher --model-id tjluyao/llama-3-8b
You can use --disable-flashinfer
to force a classic TGI serving.
You can query the model either through curl
:
curl 127.0.0.1:3000/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"lora_id": "tjluyao/llama-3-8b-math", "max_new_tokens":20}}' -H 'Content-Type: application/json'
or using the Python client. Please refer to README.me.
cd server/examples && python test_local_api.py
(Inherited from Punica)
python server/examples/test_ui.py
demo.mp4
Add --quantize [Method] to the command above, for example:
text-generation-launcher --model-id TechxGenus/gemma-2b-GPTQ --lora-ids tjluyao/gemma-2b-it-math --quantize gptq
The supported quantization methods include:
- AWQ: 4-bit. Need specific quantized model.
- EETQ: 8-bit. Can work for any model.
- GPTQ: 4-bit. Need specific quantized model.
- Bitandbytes: 8-bit. Can work for any model.
For AWQ and EETQ quantization, you need to build their specific kernels:
# AWQ
cd server && make install-awq
git clone https://github.com/casper-hansen/AutoAWQ && cd AutoAWQ
pip install -e .
# EETQ
cd server && make install-eetq
# GTPQ
git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ
pip install -vvv --no-build-isolation -e .
- To load LoRA adapters, you may either (1) specify in the laucher argument using
lora-ids
:
text-generation-launcher --model-id tjluyao/llama-3-8b --lora-ids tjluyao/llama-3-8b-math;tjluyao/llama-3-8b-zh
Or, loading dynamically by the client after the model is launched:
curl 127.0.0.1:3000/download_lora_adapter -X POST -d '{"lora_id":"tjluyao/llama-3-8b-math"}' -H 'Content-Type: application/json'
- To query the model, similarly you can use
lora-id
in the parameters (make sure the adapter is loaded):
curl 127.0.0.1:3000/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"lora_id": "tjluyao/llama-3-8b-math", "max_new_tokens":20}}' -H 'Content-Type: application/json'
Testing Llama-2-7b on RTX 6000 ada (Vast AI):
Step | Batch Size | Average FlashInfer | Average TGI |
---|---|---|---|
Prefill | 1 | 52.16 tokens/secs | 41.14 tokens/secs |
2 | 101.64 tokens/secs | 78.69 tokens/secs | |
4 | 191.48 tokens/secs | 154.11 tokens/secs | |
8 | 323.21 tokens/secs | 290.82 tokens/secs | |
16 | 512.50 tokens/secs | 538.15 tokens/secs | |
32 | 697.89 tokens/secs | 783.61 tokens/secs | |
Decode | 1 | 56.55 tokens/secs | 40.84 tokens/secs |
2 | 108.55 tokens/secs | 77.85 tokens/secs | |
4 | 207.10 tokens/secs | 154.27 tokens/secs | |
8 | 383.92 tokens/secs | 297.53 tokens/secs | |
16 | 682.78 tokens/secs | 562.83 tokens/secs | |
32 | 1119.92 tokens/secs | 993.33 tokens/secs |
Testing Llama-2-7b on 3090 (Vast AI):
Step | Batch Size | Average FlashInfer | Average TGI |
---|---|---|---|
Prefill | 1 | 44.33 tokens/secs | 23.32 tokens/secs |
2 | 74.81 tokens/secs | 46.68 tokens/secs | |
4 | 133.93 tokens/secs | 90.51 tokens/secs | |
8 | 189.78 tokens/secs | 168.27 tokens/secs | |
16 | 231.24 tokens/secs | 218.12 tokens/secs | |
32 | 270.12 tokens/secs | 265.74tokens/secs | |
Decode | 1 | 50.21 tokens/secs | 23.13 tokens/secs |
2 | 89.70 tokens/secs | 47.26 tokens/secs | |
4 | 174.92 tokens/secs | 93.09 tokens/secs | |
8 | 324.06 tokens/secs | 175.21 tokens/secs | |
16 | 567.67 tokens/secs | 337.92 tokens/secs | |
32 | 861.50 tokens/secs | 601.03 tokens/secs |
Note: L = Language, I = Image
Model | MOE | Size | Modality | Flash & Page Attention | FlashInfer |
---|---|---|---|---|---|
Idefics | 9B | L, I ⇒ L | ✔ | ||
Idefics 2 | 8B | L, I ⇒ L | ✔ | ||
Llava Next (1.6) | 13B | L, I ⇒ L | ✔ | ||
Llama 2 | 7B | L ⇒ L | ✔ | ✔ | |
Llama 3 | 8B | L ⇒ L | ✔ | ✔ | |
Phi 1.5 | 2.7B | L ⇒ L | ✔ | ||
Phi 3 | 3.8B | L ⇒ L | ✔ | ✔ | |
Gemma | 2B | L ⇒ L | ✔ | ✔ | |
Cohere | 104B | L ⇒ L | ✔ | ||
Dbrx | ✔ | 132B | L ⇒ L | ✔ | |
Mamba | 2.8B | L ⇒ L | |||
Mistral | 7B | L ⇒ L | ✔ | ✔ | |
Mixtral | ✔ | 8x22B | L ⇒ L | ✔ | |
Gpt Bigcode | 1.1B | L ⇒ L | ✔ | ||
Baichuan | 7B | L ⇒ L | ✔ | ✔ | |
Falcon | 7B | L ⇒ L | ✔ | ||
StarCoder 2 | 15B | L ⇒ L | ✔ | ||
Qwen 2 | 7B | L ⇒ L | ✔ | ✔ | |
Qwen 1.5 | 7B | L ⇒ L | ✔ | ✔ | |
Opt | 6.7B | L ⇒ L | |||
T5 | 11B | L ⇒ L | |||
Galactica | 120B | L ⇒ L | |||
SantaCoder | 1.1B | L ⇒ L | ✔ | ||
Bloom | 560M | L ⇒ L | |||
Mpt | 7B | L ⇒ L | |||
Gpt2 | 124M | L ⇒ L | ✔ | ||
Gpt Neox | 20B | L ⇒ L | ✔ | ||
Yi 1.5 | 9B | L ⇒ L | ✔ | ✔ | |
ChatGLM 4 | 9B | L ⇒ L | ✔ | ✔ |