Shouldn't q8 work in 3060/12GB? #54

jikkuatwork · 2024-09-18T18:24:06Z

Due diligence

I have done my due diligence in trying to find the answer myself.

Topic

The Rust implementation

Question

System Config

Ubuntu 22
Rust (1.8)
nvcc

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

nvidia-smi

Wed Sep 18 23:30:35 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.107.02             Driver Version: 550.107.02     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        Off |   00000000:0A:00.0  On |                  N/A |
|  0%   46C    P8             15W /  170W |     840MiB /  12288MiB |      4%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      2039      G   /usr/lib/xorg/Xorg                            537MiB |
|    0   N/A  N/A      2269      G   /usr/bin/gnome-shell                           67MiB |
|    0   N/A  N/A      4224      G   ...9d0e33034f2368c6ed2015474b1d818a902        206MiB |
|    0   N/A  N/A      9191      G   alacritty                                       9MiB |
|    0   N/A  N/A     26191      G   /home/HOME/Apps/Telegram/Telegram               4MiB |
+-----------------------------------------------------------------------------------------+

Observations

Tried: cargo run --bin moshi-backend -r -- --config moshi-backend/config-q8.json standalone

The UI loads but the speed is unacceptably slow and the voice is distorted
nvtop shows that the model isn't loading

Tried: cargo run --features cuda --bin moshi-backend -r -- --config moshi-backend/config-q8.json standalone

Loading model to GPU fails! (I thought 12GB was enough to load the 7GB GGUF? GPU hardly had 1GB used)

❮ cargo run --features cuda --bin moshi-backend -r -- --config moshi-backend/config-q8.json standalone
warning: profiles for the non root package will be ignored, specify profiles at the workspace root:
package:   /home/HOME/Projects/outside_projects/moshi/rust/moshi-core/Cargo.toml
workspace: /home/HOME/Projects/outside_projects/moshi/rust/Cargo.toml
warning: profiles for the non root package will be ignored, specify profiles at the workspace root:
package:   /home/HOME/Projects/outside_projects/moshi/rust/moshi-backend/Cargo.toml
workspace: /home/HOME/Projects/outside_projects/moshi/rust/Cargo.toml
warning: profiles for the non root package will be ignored, specify profiles at the workspace root:
package:   /home/HOME/Projects/outside_projects/moshi/rust/moshi-cli/Cargo.toml
workspace: /home/HOME/Projects/outside_projects/moshi/rust/Cargo.toml
    Finished `release` profile [optimized] target(s) in 0.23s
     Running `target/release/moshi-backend --config moshi-backend/config-q8.json standalone`
2024-09-18T18:20:02.612428Z  INFO moshi_backend: build_info=BuildInfo { build_timestamp: "2024-09-18T16:57:00.763883182Z", build_date: "2024-09-18", git_branch: "main", git_timestamp: "2024-09-18T17:45:09.000000000+02:00", git_date: "2024-09-18", git_hash: "f3218c60a115b745b1848bb8297df5eb404a041a", git_describe: "f3218c6", rustc_host_triple: "x86_64-unknown-linux-gnu", rustc_version: "1.80.1", cargo_target_triple: "x86_64-unknown-linux-gnu" }
2024-09-18T18:20:02.612441Z  INFO moshi_backend: starting process with pid 752759
2024-09-18T18:20:02.612457Z  INFO hf_hub: Token file not found "/home/HOME/.cache/huggingface/token"
2024-09-18T18:20:02.682964Z  INFO hf_hub: Token file not found "/home/HOME/.cache/huggingface/token"
2024-09-18T18:20:07.910280Z  INFO moshi_backend::standalone: warming up the model
Error: DriverError(CUDA_ERROR_OUT_OF_MEMORY, "out of memory")
moshi/rust on  main [?] is 📦 v0.2.0 via 🦀 v1.80.1 took 6s

The text was updated successfully, but these errors were encountered:

adefossez · 2024-09-19T15:22:00Z

That is a good question, would have to look more into it. Maybe @LaurentMazare would have an opinion on this?

jikkuatwork · 2024-09-19T16:04:06Z

Thanks a lot! Appreciate your time!

LaurentMazare · 2024-09-19T23:02:56Z

I cannot really test this at the moment but I think it's somewhat expected. The weights are ~8.17GB but when in q8 mode we pre-allocate a kv-cache that is for 4096 steps (~5 mins of conversation) in f32 - we should aim at using bf16 instead but that's likely to require some changes on the candle side, the kv-cache is ~4GB, and activations + the mimi parts also have to be stored but they should be pretty small. So overall we're a bit above 12GB here.
One thing you could try is tweaking this line to be something like 1000 and see if it helps. You'll only be able to have short sessions with moshi but if it works we could consider making this configurable somehow.

jikkuatwork · 2024-09-20T02:44:16Z

Thanks! It did work, but the speed was pretty low. I trust 3060 isn't enough to run this.

Observations

The server wouldn't work in http; even though certificate will be rejected by most browsers, https needs to be specified.
Almost 11.2GB was allocated while loading the model
Interestingly, my voice was significantly erroneously transcribed. (I have an Indian accent)

How to apply the change

For those who may want some help in applying the suggested change:

Make the change
Go to ./rust/ and run rust clean in case you already built without the change
Then: cargo build --release
Finally: cargo run --features cuda --bin moshi-backend -r -- --config moshi-backend/config-q8.json standalone

adefossez · 2024-09-27T10:38:03Z

I've added an entry in the FAQ, https://github.com/kyutai-labs/moshi/blob/main/FAQ.md#can-i-run-on-a-12gb--8-gb-gpu-
I'll be closing this issue then.

jikkuatwork added the question Further information is requested label Sep 18, 2024

adefossez closed this as completed Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shouldn't q8 work in 3060/12GB? #54

Shouldn't q8 work in 3060/12GB? #54

jikkuatwork commented Sep 18, 2024 •

edited

Loading

adefossez commented Sep 19, 2024

jikkuatwork commented Sep 19, 2024

LaurentMazare commented Sep 19, 2024

jikkuatwork commented Sep 20, 2024

adefossez commented Sep 27, 2024

Shouldn't q8 work in 3060/12GB? #54

Shouldn't q8 work in 3060/12GB? #54

Comments

jikkuatwork commented Sep 18, 2024 • edited Loading

Due diligence

Topic

Question

System Config

Observations

adefossez commented Sep 19, 2024

jikkuatwork commented Sep 19, 2024

LaurentMazare commented Sep 19, 2024

jikkuatwork commented Sep 20, 2024

Observations

How to apply the change

adefossez commented Sep 27, 2024

jikkuatwork commented Sep 18, 2024 •

edited

Loading