Recommendation for quantizing models to fp8 that don't fit on a single GPU (Mixtral) #1777
Labels
Low Precision
Issue about lower bit quantization, including int8, int4, fp8
question
Further information is requested
triaged
Issue has been triaged by maintainers
Hi, I want to quantize Mixtral model to fp8. I have a single H100 GPU. The original model doesn't fit on a single GPU (but quantized does). Up until #1598, there was an option to run quantization with the
--device cpu
option, but it was removed.What was the motivation behind this decision?
I was able to bring back this functionality by changing the code a bit. I've successfully built the Mixtral engine in fp8 and served it with Triton.
I see some quality issues, as described in #1738 (comment), but I don't think they are related to this.
What should I expect from a model quantized on CPU? Is it a totally wrong approach? Are there any issues with it?
If it's not a proper way to quantize a model, could you suggest an approach I should take?
Probably using a machine with 2xH100 GPUs is a solution, but getting access to such a machine is very hard because of availability 😅
The text was updated successfully, but these errors were encountered: