Recommendation for quantizing models to fp8 that don't fit on a single GPU (Mixtral) #1777

bprus · 2024-06-13T12:26:21Z

Hi, I want to quantize Mixtral model to fp8. I have a single H100 GPU. The original model doesn't fit on a single GPU (but quantized does). Up until #1598, there was an option to run quantization with the --device cpu option, but it was removed.
What was the motivation behind this decision?
I was able to bring back this functionality by changing the code a bit. I've successfully built the Mixtral engine in fp8 and served it with Triton.
I see some quality issues, as described in #1738 (comment), but I don't think they are related to this.
What should I expect from a model quantized on CPU? Is it a totally wrong approach? Are there any issues with it?
If it's not a proper way to quantize a model, could you suggest an approach I should take?
Probably using a machine with 2xH100 GPUs is a solution, but getting access to such a machine is very hard because of availability 😅

The text was updated successfully, but these errors were encountered:

nv-guomingz · 2024-06-13T15:05:35Z

good question.

@syuoni would u please answer the question about removing --device cpu knob?

syuoni · 2024-06-14T05:24:39Z

Thanks @bprus for reporting this. I've started to bring back the device flag to quantize.py.

Just to double-check with you, running calibration on cpu is extremely slow for a large model like mixtral, right?

bprus · 2024-06-14T10:41:54Z

Thanks.
I wouldn't say it's extremely slow. It's much slower than using GPU, but I think the whole process of quantization and engine building takes under 1h. I'll measure the time on Monday.

bprus · 2024-06-17T10:13:02Z

Hi @syuoni
I measured the time, and it took:
Quantization done. Total time used: 3022.70 s.
and additional Total time used 157.01 s. for saving.
I'm using 24 cores AMD EPYC 9334 CPU.

nv-guomingz · 2024-06-17T11:46:03Z

Hi @syuoni I measured the time, and it took: Quantization done. Total time used: 3022.70 s. and additional Total time used 157.01 s. for saving. I'm using 24 cores AMD EPYC 9334 CPU.

At least,we can run fp8 quantization with single gpu card now.
The changes @syuoni made will be in coming weekly update and you may have a try again.

Could we close this ticket now?

bprus · 2024-06-17T12:24:33Z

Sure, thanks for all the help!

bprus mentioned this issue Jun 13, 2024

Inflight batching for fp8 Llama and Mixtral is broken #1738

Closed

4 tasks

nv-guomingz assigned syuoni Jun 13, 2024

nv-guomingz added triaged Issue has been triaged by maintainers waiting for feedback question Further information is requested Low Precision Issue about lower bit quantization, including int8, int4, fp8 labels Jun 14, 2024

bprus closed this as completed Jun 17, 2024

nv-guomingz removed the waiting for feedback label Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommendation for quantizing models to fp8 that don't fit on a single GPU (Mixtral) #1777

Recommendation for quantizing models to fp8 that don't fit on a single GPU (Mixtral) #1777

bprus commented Jun 13, 2024 •

edited

Loading

nv-guomingz commented Jun 13, 2024

syuoni commented Jun 14, 2024

bprus commented Jun 14, 2024

bprus commented Jun 17, 2024

nv-guomingz commented Jun 17, 2024

bprus commented Jun 17, 2024

Recommendation for quantizing models to fp8 that don't fit on a single GPU (Mixtral) #1777

Recommendation for quantizing models to fp8 that don't fit on a single GPU (Mixtral) #1777

Comments

bprus commented Jun 13, 2024 • edited Loading

nv-guomingz commented Jun 13, 2024

syuoni commented Jun 14, 2024

bprus commented Jun 14, 2024

bprus commented Jun 17, 2024

nv-guomingz commented Jun 17, 2024

bprus commented Jun 17, 2024

bprus commented Jun 13, 2024 •

edited

Loading