Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shouldn't q8 work in 3060/12GB? #54

Closed
1 task done
jikkuatwork opened this issue Sep 18, 2024 · 5 comments
Closed
1 task done

Shouldn't q8 work in 3060/12GB? #54

jikkuatwork opened this issue Sep 18, 2024 · 5 comments
Labels
question Further information is requested

Comments

@jikkuatwork
Copy link

jikkuatwork commented Sep 18, 2024

Due diligence

  • I have done my due diligence in trying to find the answer myself.

Topic

The Rust implementation

Question

System Config

  • Ubuntu 22
  • Rust (1.8)
  • nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0
  • nvidia-smi
Wed Sep 18 23:30:35 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.107.02             Driver Version: 550.107.02     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        Off |   00000000:0A:00.0  On |                  N/A |
|  0%   46C    P8             15W /  170W |     840MiB /  12288MiB |      4%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      2039      G   /usr/lib/xorg/Xorg                            537MiB |
|    0   N/A  N/A      2269      G   /usr/bin/gnome-shell                           67MiB |
|    0   N/A  N/A      4224      G   ...9d0e33034f2368c6ed2015474b1d818a902        206MiB |
|    0   N/A  N/A      9191      G   alacritty                                       9MiB |
|    0   N/A  N/A     26191      G   /home/HOME/Apps/Telegram/Telegram               4MiB |
+-----------------------------------------------------------------------------------------+

Observations

Tried: cargo run --bin moshi-backend -r -- --config moshi-backend/config-q8.json standalone

  • The UI loads but the speed is unacceptably slow and the voice is distorted
  • nvtop shows that the model isn't loading

Tried: cargo run --features cuda --bin moshi-backend -r -- --config moshi-backend/config-q8.json standalone

  • Loading model to GPU fails! (I thought 12GB was enough to load the 7GB GGUF? GPU hardly had 1GB used)
❮ cargo run --features cuda --bin moshi-backend -r -- --config moshi-backend/config-q8.json standalone
warning: profiles for the non root package will be ignored, specify profiles at the workspace root:
package:   /home/HOME/Projects/outside_projects/moshi/rust/moshi-core/Cargo.toml
workspace: /home/HOME/Projects/outside_projects/moshi/rust/Cargo.toml
warning: profiles for the non root package will be ignored, specify profiles at the workspace root:
package:   /home/HOME/Projects/outside_projects/moshi/rust/moshi-backend/Cargo.toml
workspace: /home/HOME/Projects/outside_projects/moshi/rust/Cargo.toml
warning: profiles for the non root package will be ignored, specify profiles at the workspace root:
package:   /home/HOME/Projects/outside_projects/moshi/rust/moshi-cli/Cargo.toml
workspace: /home/HOME/Projects/outside_projects/moshi/rust/Cargo.toml
    Finished `release` profile [optimized] target(s) in 0.23s
     Running `target/release/moshi-backend --config moshi-backend/config-q8.json standalone`
2024-09-18T18:20:02.612428Z  INFO moshi_backend: build_info=BuildInfo { build_timestamp: "2024-09-18T16:57:00.763883182Z", build_date: "2024-09-18", git_branch: "main", git_timestamp: "2024-09-18T17:45:09.000000000+02:00", git_date: "2024-09-18", git_hash: "f3218c60a115b745b1848bb8297df5eb404a041a", git_describe: "f3218c6", rustc_host_triple: "x86_64-unknown-linux-gnu", rustc_version: "1.80.1", cargo_target_triple: "x86_64-unknown-linux-gnu" }
2024-09-18T18:20:02.612441Z  INFO moshi_backend: starting process with pid 752759
2024-09-18T18:20:02.612457Z  INFO hf_hub: Token file not found "/home/HOME/.cache/huggingface/token"
2024-09-18T18:20:02.682964Z  INFO hf_hub: Token file not found "/home/HOME/.cache/huggingface/token"
2024-09-18T18:20:07.910280Z  INFO moshi_backend::standalone: warming up the model
Error: DriverError(CUDA_ERROR_OUT_OF_MEMORY, "out of memory")
moshi/rust on  main [?] is 📦 v0.2.0 via 🦀 v1.80.1 took 6s
@jikkuatwork jikkuatwork added the question Further information is requested label Sep 18, 2024
@adefossez
Copy link
Collaborator

That is a good question, would have to look more into it. Maybe @LaurentMazare would have an opinion on this?

@jikkuatwork
Copy link
Author

Thanks a lot! Appreciate your time!

@LaurentMazare
Copy link
Member

I cannot really test this at the moment but I think it's somewhat expected. The weights are ~8.17GB but when in q8 mode we pre-allocate a kv-cache that is for 4096 steps (~5 mins of conversation) in f32 - we should aim at using bf16 instead but that's likely to require some changes on the candle side, the kv-cache is ~4GB, and activations + the mimi parts also have to be stored but they should be pretty small. So overall we're a bit above 12GB here.
One thing you could try is tweaking this line to be something like 1000 and see if it helps. You'll only be able to have short sessions with moshi but if it works we could consider making this configurable somehow.

@jikkuatwork
Copy link
Author

Thanks! It did work, but the speed was pretty low. I trust 3060 isn't enough to run this.

Observations

  • The server wouldn't work in http; even though certificate will be rejected by most browsers, https needs to be specified.
  • Almost 11.2GB was allocated while loading the model
  • Interestingly, my voice was significantly erroneously transcribed. (I have an Indian accent)

How to apply the change

For those who may want some help in applying the suggested change:

  • Make the change
  • Go to ./rust/ and run rust clean in case you already built without the change
  • Then: cargo build --release
  • Finally: cargo run --features cuda --bin moshi-backend -r -- --config moshi-backend/config-q8.json standalone

@adefossez
Copy link
Collaborator

I've added an entry in the FAQ, https://github.com/kyutai-labs/moshi/blob/main/FAQ.md#can-i-run-on-a-12gb--8-gb-gpu-
I'll be closing this issue then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants