Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Cortex-cpp continues to have 1 layer offload to CPU while using GPU #1104

Closed
Van-QA opened this issue Jun 20, 2024 · 7 comments
Closed
Labels
category: engine management Related to engine abstraction type: bug Something isn't working

Comments

@Van-QA
Copy link
Contributor

Van-QA commented Jun 20, 2024

Describe the bug
When generating responses using a local llm, cortex-cpp still seems to use CPU.
https://discord.com/channels/1107178041848909847/1149558035971321886/1253148982188838954

To Reproduce

  1. Install cortex-cpp and the CUDA toolkit locally.
  2. Turn on GPU acceler‌‌atio‌n
  3. Generate responses using a local llm.
  4. Observe high CPU usage.

Expected behavior
Since cortex-cpp is using a local llm and the CUDA toolkit, it should primarily use the GPU for processing and not consume as much CPU.

Desktop

  • OS: Linux

Additional context
The logs indicate that 32 out of 33 layers are offloaded to the GPU, but 1 layer is still processed on the CPU. This behavior will be investigated further.
image
image

@Van-QA Van-QA changed the title bug: Cortex-cpp continues to have 1 layer offload to CPU why using GPU bug: Cortex-cpp continues to have 1 layer offload to CPU while using GPU Jun 20, 2024
@imtuyethan imtuyethan added the type: bug Something isn't working label Sep 2, 2024
@freelerobot freelerobot transferred this issue from janhq/jan Sep 5, 2024
@freelerobot freelerobot moved this from Planning to Need Investigation in Menlo Sep 5, 2024
@freelerobot freelerobot added the category: engine management Related to engine abstraction label Sep 6, 2024
@dan-menlo
Copy link
Contributor

dan-menlo commented Sep 8, 2024

@nguyenhoangthuan99 I am linking this to #1151 as a sub-issue. Please let me know if already solved

@dan-menlo dan-menlo moved this from Need Investigation to Scheduled in Menlo Sep 8, 2024
@vansangpfiev
Copy link
Contributor

This should be resolved by changing ngl in model.yml

@github-project-automation github-project-automation bot moved this from Scheduled to Review + QA in Menlo Oct 29, 2024
@gabrielle-ong
Copy link
Contributor

Edit: set to the maximum ngl

@dan-menlo
Copy link
Contributor

@hahuyhoang411 @gabrielle-ong I recommend we create a separate issue in Models repo, for our model.yaml to include num_layers:

  • Allow for user to use the slider in Jan to offload to GPU (i.e. "Max")
  • Allow for simple checking to prevent int out of bounds etc

This would solve for the issue where people are unsure how many layers are in a model, which results in slow inference due to layers being left on CPU instead of being fully offloaded to GPU

@hahuyhoang411
Copy link
Contributor

hahuyhoang411 commented Oct 29, 2024

Ah yes in the model.yml for each member in a family I always include the max value for ngl E.g. Qwen2.5

https://huggingface.co/cortexso/qwen2.5/blob/main/model.yml

Screenshot 2024-10-29 at 07 15 33

@gabrielle-ong gabrielle-ong added this to the v1.0.2 milestone Nov 5, 2024
@gabrielle-ong gabrielle-ong removed this from the v1.0.2 milestone Nov 12, 2024
@gabrielle-ong
Copy link
Contributor

Putting in investigating first - cc @imtuyethan / @louis-jan to update what is needed for Jan after Cortex's hardware API

My naive understanding, please cmiiw / add on

  • Pending Cortex hardware detection api: feat: Hardware API #1593
  • Cortex hardware detection API detects the GPU, layers to offset
  • Cortex reads model.yaml for max ngl for a model
  • [UI] Jan slider in engine parameters: sets ngl < max ngl (model.yaml) && ngl < detected hardware layers

@gabrielle-ong
Copy link
Contributor

gabrielle-ong commented Nov 28, 2024

Marking as complete since ngl can be configured on model.yaml
Linking to separate story #1108 for Cortex then Jan to recommend ngl based on hardware

@gabrielle-ong gabrielle-ong moved this from Investigating to Completed in Menlo Nov 28, 2024
@gabrielle-ong gabrielle-ong moved this from Completed to Discontinued in Menlo Nov 28, 2024
@gabrielle-ong gabrielle-ong moved this from Discontinued to Completed in Menlo Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: engine management Related to engine abstraction type: bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

7 participants