From e7cccbeeb2d63abe31db3cd6cc720216ecb7c4d6 Mon Sep 17 00:00:00 2001 From: Lianmin Zheng Date: Thu, 28 Nov 2024 23:14:06 -0800 Subject: [PATCH] Update backend.md --- docs/backend/backend.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/backend/backend.md b/docs/backend/backend.md index a2995455f3d..8f34eb7ce56 100644 --- a/docs/backend/backend.md +++ b/docs/backend/backend.md @@ -80,7 +80,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096 ``` - To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. This does not work for FP8 currently. -- To enable torchao quantization, add `--torchao-config int4wo-128`. It supports various quantization strategies. +- To enable torchao quantization, add `--torchao-config int4wo-128`. It supports other [quantization strategies (INT8/FP8)](https://github.com/sgl-project/sglang/blob/9a00e6f453e764c0b286e2a62f652a1202c0bf9c/python/sglang/srt/server_args.py#L671) as well. - To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments. - To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`. - If the model does not have a chat template in the Hugging Face tokenizer, you can specify a [custom chat template](../references/custom_chat_template.md).