-
Notifications
You must be signed in to change notification settings - Fork 600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shouldn't q8 work in 3060/12GB? #54
Comments
That is a good question, would have to look more into it. Maybe @LaurentMazare would have an opinion on this? |
Thanks a lot! Appreciate your time! |
I cannot really test this at the moment but I think it's somewhat expected. The weights are ~8.17GB but when in q8 mode we pre-allocate a kv-cache that is for 4096 steps (~5 mins of conversation) in f32 - we should aim at using bf16 instead but that's likely to require some changes on the candle side, the kv-cache is ~4GB, and activations + the mimi parts also have to be stored but they should be pretty small. So overall we're a bit above 12GB here. |
Thanks! It did work, but the speed was pretty low. I trust 3060 isn't enough to run this. Observations
How to apply the changeFor those who may want some help in applying the suggested change:
|
I've added an entry in the FAQ, https://github.com/kyutai-labs/moshi/blob/main/FAQ.md#can-i-run-on-a-12gb--8-gb-gpu- |
Due diligence
Topic
The Rust implementation
Question
System Config
Observations
Tried:
cargo run --bin moshi-backend -r -- --config moshi-backend/config-q8.json standalone
nvtop
shows that the model isn't loadingTried:
cargo run --features cuda --bin moshi-backend -r -- --config moshi-backend/config-q8.json standalone
The text was updated successfully, but these errors were encountered: