bug: Cortex-cpp continues to have 1 layer offload to CPU while using GPU #1104

Van-QA · 2024-06-20T02:59:32Z

Describe the bug
When generating responses using a local llm, cortex-cpp still seems to use CPU.
https://discord.com/channels/1107178041848909847/1149558035971321886/1253148982188838954

To Reproduce

Install cortex-cpp and the CUDA toolkit locally.
Turn on GPU acceler‌‌atio‌n
Generate responses using a local llm.
Observe high CPU usage.

Expected behavior
Since cortex-cpp is using a local llm and the CUDA toolkit, it should primarily use the GPU for processing and not consume as much CPU.

Desktop

OS: Linux

Additional context
The logs indicate that 32 out of 33 layers are offloaded to the GPU, but 1 layer is still processed on the CPU. This behavior will be investigated further.

dan-menlo · 2024-09-08T04:53:15Z

@nguyenhoangthuan99 I am linking this to #1151 as a sub-issue. Please let me know if already solved

vansangpfiev · 2024-10-29T03:31:42Z

This should be resolved by changing ngl in model.yml

gabrielle-ong · 2024-10-29T04:43:50Z

Edit: set to the maximum ngl

dan-menlo · 2024-10-29T04:45:15Z

@hahuyhoang411 @gabrielle-ong I recommend we create a separate issue in Models repo, for our model.yaml to include num_layers:

Allow for user to use the slider in Jan to offload to GPU (i.e. "Max")
Allow for simple checking to prevent int out of bounds etc

This would solve for the issue where people are unsure how many layers are in a model, which results in slow inference due to layers being left on CPU instead of being fully offloaded to GPU

hahuyhoang411 · 2024-10-29T06:16:50Z

Ah yes in the model.yml for each member in a family I always include the max value for ngl E.g. Qwen2.5

https://huggingface.co/cortexso/qwen2.5/blob/main/model.yml

gabrielle-ong · 2024-11-12T09:21:54Z

Putting in investigating first - cc @imtuyethan / @louis-jan to update what is needed for Jan after Cortex's hardware API

My naive understanding, please cmiiw / add on

Pending Cortex hardware detection api: feat: Hardware API #1593
Cortex hardware detection API detects the GPU, layers to offset
Cortex reads model.yaml for max ngl for a model
[UI] Jan slider in engine parameters: sets ngl < max ngl (model.yaml) && ngl < detected hardware layers

gabrielle-ong · 2024-11-28T07:55:17Z

Marking as complete since ngl can be configured on model.yaml
Linking to separate story #1108 for Cortex then Jan to recommend ngl based on hardware

Van-QA assigned vansangpfiev Jun 20, 2024

Van-QA changed the title ~~bug: Cortex-cpp continues to have 1 layer offload to CPU why using GPU~~ bug: Cortex-cpp continues to have 1 layer offload to CPU while using GPU Jun 20, 2024

imtuyethan unassigned vansangpfiev Aug 28, 2024

imtuyethan added the type: bug Something isn't working label Sep 2, 2024

freelerobot transferred this issue from janhq/jan Sep 5, 2024

freelerobot moved this from Planning to Need Investigation in Menlo Sep 5, 2024

freelerobot added the category: engine management Related to engine abstraction label Sep 6, 2024

freelerobot assigned vansangpfiev Sep 6, 2024

dan-menlo moved this from Need Investigation to Scheduled in Menlo Sep 8, 2024

vansangpfiev closed this as completed Oct 29, 2024

github-project-automation bot moved this from Scheduled to Review + QA in Menlo Oct 29, 2024

gabrielle-ong added this to the v1.0.2 milestone Nov 5, 2024

gabrielle-ong removed this from the v1.0.2 milestone Nov 12, 2024

gabrielle-ong moved this from Review + QA to Investigating in Menlo Nov 12, 2024

gabrielle-ong unassigned vansangpfiev Nov 12, 2024

gabrielle-ong mentioned this issue Nov 14, 2024

epic: Implement Cortex Hardware API for Nvidia #1568

Closed

18 tasks

gabrielle-ong mentioned this issue Nov 28, 2024

Sprint 26 Planning #1735

Closed

gabrielle-ong moved this from Investigating to Completed in Menlo Nov 28, 2024

gabrielle-ong moved this from Completed to Discontinued in Menlo Nov 28, 2024

gabrielle-ong moved this from Discontinued to Completed in Menlo Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: Cortex-cpp continues to have 1 layer offload to CPU while using GPU #1104

bug: Cortex-cpp continues to have 1 layer offload to CPU while using GPU #1104

Van-QA commented Jun 20, 2024

dan-menlo commented Sep 8, 2024 •

edited

Loading

vansangpfiev commented Oct 29, 2024

gabrielle-ong commented Oct 29, 2024

dan-menlo commented Oct 29, 2024

hahuyhoang411 commented Oct 29, 2024 •

edited

Loading

gabrielle-ong commented Nov 12, 2024

gabrielle-ong commented Nov 28, 2024 •

edited

Loading

bug: Cortex-cpp continues to have 1 layer offload to CPU while using GPU #1104

bug: Cortex-cpp continues to have 1 layer offload to CPU while using GPU #1104

Comments

Van-QA commented Jun 20, 2024

dan-menlo commented Sep 8, 2024 • edited Loading

vansangpfiev commented Oct 29, 2024

gabrielle-ong commented Oct 29, 2024

dan-menlo commented Oct 29, 2024

hahuyhoang411 commented Oct 29, 2024 • edited Loading

gabrielle-ong commented Nov 12, 2024

gabrielle-ong commented Nov 28, 2024 • edited Loading

dan-menlo commented Sep 8, 2024 •

edited

Loading

hahuyhoang411 commented Oct 29, 2024 •

edited

Loading

gabrielle-ong commented Nov 28, 2024 •

edited

Loading