Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommendation for quantizing models to fp8 that don't fit on a single GPU (Mixtral) #1777

Closed
bprus opened this issue Jun 13, 2024 · 6 comments
Assignees
Labels
Low Precision Issue about lower bit quantization, including int8, int4, fp8 question Further information is requested triaged Issue has been triaged by maintainers

Comments

@bprus
Copy link
Contributor

bprus commented Jun 13, 2024

Hi, I want to quantize Mixtral model to fp8. I have a single H100 GPU. The original model doesn't fit on a single GPU (but quantized does). Up until #1598, there was an option to run quantization with the --device cpu option, but it was removed.
What was the motivation behind this decision?
I was able to bring back this functionality by changing the code a bit. I've successfully built the Mixtral engine in fp8 and served it with Triton.
I see some quality issues, as described in #1738 (comment), but I don't think they are related to this.
What should I expect from a model quantized on CPU? Is it a totally wrong approach? Are there any issues with it?
If it's not a proper way to quantize a model, could you suggest an approach I should take?
Probably using a machine with 2xH100 GPUs is a solution, but getting access to such a machine is very hard because of availability 😅

@nv-guomingz
Copy link
Collaborator

good question.

@syuoni would u please answer the question about removing --device cpu knob?

@syuoni
Copy link
Collaborator

syuoni commented Jun 14, 2024

Thanks @bprus for reporting this. I've started to bring back the device flag to quantize.py.

Just to double-check with you, running calibration on cpu is extremely slow for a large model like mixtral, right?

@bprus
Copy link
Contributor Author

bprus commented Jun 14, 2024

Thanks.
I wouldn't say it's extremely slow. It's much slower than using GPU, but I think the whole process of quantization and engine building takes under 1h. I'll measure the time on Monday.

@nv-guomingz nv-guomingz added triaged Issue has been triaged by maintainers waiting for feedback question Further information is requested Low Precision Issue about lower bit quantization, including int8, int4, fp8 labels Jun 14, 2024
@bprus
Copy link
Contributor Author

bprus commented Jun 17, 2024

Hi @syuoni
I measured the time, and it took:
Quantization done. Total time used: 3022.70 s.
and additional Total time used 157.01 s. for saving.
I'm using 24 cores AMD EPYC 9334 CPU.

@nv-guomingz
Copy link
Collaborator

Hi @syuoni I measured the time, and it took: Quantization done. Total time used: 3022.70 s. and additional Total time used 157.01 s. for saving. I'm using 24 cores AMD EPYC 9334 CPU.

At least,we can run fp8 quantization with single gpu card now.
The changes @syuoni made will be in coming weekly update and you may have a try again.

Could we close this ticket now?

@bprus
Copy link
Contributor Author

bprus commented Jun 17, 2024

Sure, thanks for all the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Low Precision Issue about lower bit quantization, including int8, int4, fp8 question Further information is requested triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

3 participants