-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: Cortex-cpp continues to have 1 layer offload to CPU while using GPU #1104
Comments
@nguyenhoangthuan99 I am linking this to #1151 as a sub-issue. Please let me know if already solved |
This should be resolved by changing |
Edit: set to the maximum |
@hahuyhoang411 @gabrielle-ong I recommend we create a separate issue in Models repo, for our
This would solve for the issue where people are unsure how many layers are in a model, which results in slow inference due to layers being left on CPU instead of being fully offloaded to GPU |
Ah yes in the https://huggingface.co/cortexso/qwen2.5/blob/main/model.yml |
Putting in investigating first - cc @imtuyethan / @louis-jan to update what is needed for Jan after Cortex's hardware API My naive understanding, please cmiiw / add on
|
Marking as complete since ngl can be configured on model.yaml |
Describe the bug
When generating responses using a local llm, cortex-cpp still seems to use CPU.
https://discord.com/channels/1107178041848909847/1149558035971321886/1253148982188838954
To Reproduce
Expected behavior
Since cortex-cpp is using a local llm and the CUDA toolkit, it should primarily use the GPU for processing and not consume as much CPU.
Desktop
Additional context
The logs indicate that 32 out of 33 layers are offloaded to the GPU, but 1 layer is still processed on the CPU. This behavior will be investigated further.
The text was updated successfully, but these errors were encountered: