Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

epic: Implement Cortex Hardware API for Nvidia #1568

Closed
12 of 18 tasks
Tracked by #3908
vansangpfiev opened this issue Oct 29, 2024 · 10 comments
Closed
12 of 18 tasks
Tracked by #3908

epic: Implement Cortex Hardware API for Nvidia #1568

vansangpfiev opened this issue Oct 29, 2024 · 10 comments
Assignees
Labels
type: epic A major feature or initiative
Milestone

Comments

@vansangpfiev
Copy link
Contributor

vansangpfiev commented Oct 29, 2024

Implementation for #1165

  • Scope to Nvidia first (AMD, Intel, Qualcomm to subsequent sprints)

Tasklist

(Will fill in details when implement each task)

Hardware API

/engines

  • Modify engine initialization
  • Implement hardware passing to engines

/model/start

  • Handle ngl settings (in /models/start)
  • Implement RAM and VRAM detection
  • Implement fallback logic

Jan

  • System Monitor
  • Model Start
  • Engines Settings?

Bugs to Address

Related bugs:

Out-of-scope

  • AMD
  • Intel
  • Qualcomm/Snapdragon
@vansangpfiev vansangpfiev added the type: epic A major feature or initiative label Oct 29, 2024
@github-project-automation github-project-automation bot moved this to Investigating in Menlo Oct 29, 2024
@vansangpfiev vansangpfiev self-assigned this Oct 29, 2024
@vansangpfiev vansangpfiev moved this from Investigating to In Progress in Menlo Oct 29, 2024
@vansangpfiev
Copy link
Contributor Author

vansangpfiev commented Oct 31, 2024

Hardware API Documentation

Get hardware information

GET /v1/hardware

Response:

{
  "cpu": {
    "arch": "string",
    "cores": number,
    "model": "string",
    "instructions": ["string"]
  },
  "os": {
    "version": "string",
    "name": "string"
  },
  "ram": {
    "total": number,
    "available": number,
    "type": "string"
  },
  "storage": {
    "total": number,
    "available": number,
    "type": "string"
  },
  "gpus": [
    {
      "model": "string",
      "vram": "string",
      "driver_version": "string"
    }
  ],
  "power": {
    "battery_life": number,
    "charging_status": "string",
    "is_power_saving": boolean
  },
  "monitors": [
    {
      "resolution": "string",
      "refresh_rate": number,
      "resolution":"string"
    }
  ]
}

Hardware Activation

POST /v1/hardware/activate
{
"gpus": [0, 1]
}

@dan-menlo
Copy link
Contributor

Thanks @vansangpfiev. Will we be implementing deactivate this sprint?

@vansangpfiev
Copy link
Contributor Author

Thanks @vansangpfiev. Will we be implementing deactivate this sprint?

Since we have /activate endpoint, I think it is redundant to add /deactivate.
By default, we activate all the GPUs. We deactivate all GPUs that are not in request for /activate.

@dan-menlo
Copy link
Contributor

dan-menlo commented Nov 6, 2024

A few notes from our quick call:

Hardware Support

We will need to work with multiple hardware providers, but these can be dealt with in separate sprints:

  • For Intel, can we detect iGPUs, NPUs and CPUs? (i.e. Lunar Lake)
  • For AMD, can we detect iGPUs
  • For Qualcomm/ARM, can we detect Adreno etc

ngl settings

  • We detect hardware to recommend ngl setting to users
  • /models/start API will infer hardware info from database, and then recommend ngl
  • This is not part of Hardware API, but /models/start is using hwinfo

@dan-menlo dan-menlo changed the title epic: Implement Cortex Hardware API epic: Implement Cortex Hardware API for Nvidia Nov 8, 2024
@vansangpfiev
Copy link
Contributor Author

CLI Documentation:

Get hardware information

cortex hardware list --cpu --os --ram --storage --gpu --power --monitors

If no flag is specified, display all hardware information

Activate hardware

cortex hardware activate --gpus [gpu_list]

gpu_list is required, [] means deactivate all GPUs

Start model

cortex start [model_id] --gpus [gpu_list]

--gpus is optional, if not specified use all activated GPUs

Run

cortex run [model_id] --gpus [gpu_list]

--gpus is optional, if not specified use all activated GPUs

@vansangpfiev vansangpfiev moved this from In Progress to Review + QA in Menlo Nov 14, 2024
@gabrielle-ong
Copy link
Contributor

Nicely done @vansangpfiev! Testing it out now - 2 quick questions:

  1. I cant seem to deactivate the GPU to test without GPU -
cortex-nightly hardware activate --gpus []
Invalid GPU index provided.
  1. GPU information has Index=1, ID=0 for the same GPU, which is confusing - can we standardize to using Index like the other fields?
    image

@vansangpfiev
Copy link
Contributor Author

vansangpfiev commented Nov 14, 2024

Nicely done @vansangpfiev! Testing it out now - 2 quick questions:

  1. I cant seem to deactivate the GPU to test without GPU -
cortex-nightly hardware activate --gpus []
Invalid GPU index provided.
  1. GPU information has Index=1, ID=0 for the same GPU, which is confusing - can we standardize to using Index like the other fields?
    image

Thanks @gabrielle-ong

  1. Let me take a look. Would you mind sharing the cortex.log and cortex-cli.log?
  2. Sure, let me fix it. Actually, the ID is the GPU ID that nvidia-smi reports, it can be different from #index.

@gabrielle-ong
Copy link
Contributor

Thanks Sang!
2 - I see, understand. then it'll help to make it clear its the nvidia-smi ID through the help command

1- it just takes in the empty array, no error logs.
cortex-cli.log

20241114 06:32:23.404000 UTC 13784 INFO  CUDA Version: 12.4 - utils/system_info_utils.h:141
20241114 06:32:23.404000 UTC 18228 INFO  Will check for new update, time from last check: 2531 seconds - cortex_upd_cmd.cc:127
20241114 06:32:23.404000 UTC 18228 INFO  Engine release path: https://delta.jan.ai/cortex/latest/version.json - cortex_upd_cmd.cc:138
20241114 06:32:23.545000 UTC 18228 INFO  Got the latest release, update to the config file: v1.0.2-235 - cortex_upd_cmd.cc:175

cortex.log:

20241114 05:38:18.970000 UTC 3728 INFO  Origin:  - main.cc:160
20241114 05:38:19.139000 UTC 12684 INFO  Gpu Driver Version: 551.76 - utils/system_info_utils.h:116
20241114 05:38:19.279000 UTC 12684 INFO  CUDA Version: 12.4 - utils/system_info_utils.h:141
20241114 05:38:19.484000 UTC 12684 INFO  CUDA Version: 12.4 - utils/system_info_utils.h:141
20241114 05:38:19.531000 UTC 12684 INFO  Origin:  - main.cc:160
20241114 05:49:51.989000 UTC 7484 INFO  Origin:  - main.cc:160
20241114 05:49:51.989000 UTC 16792 INFO  activate: {
	"gpus" : 
	[
		0
	]
}
 - hardware.cc:38
20241114 05:49:51.989000 UTC 16792 INFO  No hardware activation changes -> No need to update - hardware_service.cc:211
20241114 05:49:51.989000 UTC 16792 INFO  Origin:  - main.cc:160
20241114 05:50:00.401000 UTC 1384 INFO  Origin:  - main.cc:160
20241114 05:50:00.542000 UTC 4276 INFO  Gpu Driver Version: 551.76 - utils/system_info_utils.h:116
20241114 05:50:00.682000 UTC 4276 INFO  CUDA Version: 12.4 - utils/system_info_utils.h:141
20241114 05:50:00.870000 UTC 4276 INFO  CUDA Version: 12.4 - utils/system_info_utils.h:141
20241114 05:50:00.964000 UTC 4276 INFO  Origin:  - main.cc:160
20241114 05:50:12.567000 UTC 17776 INFO  Origin:  - main.cc:160
20241114 05:50:12.567000 UTC 16092 INFO  activate: {
	"gpus" : 
	[
		1
	]
}
 - hardware.cc:38
20241114 05:50:12.567000 UTC 16092 INFO  Origin:  - main.cc:160
20241114 05:50:37.058000 UTC 15972 INFO  Origin:  - main.cc:160
20241114 05:50:37.058000 UTC 11104 INFO  activate: {
	"gpus" : []
}
 - hardware.cc:38
20241114 05:50:37.058000 UTC 11104 INFO  Origin:  - main.cc:160
20241114 06:32:23.404000 UTC 11156 INFO  Origin:  - main.cc:160
20241114 06:32:23.404000 UTC 6656 INFO  activate: {
	"gpus" : []
}
 - hardware.cc:38
20241114 06:32:23.404000 UTC 6656 INFO  Origin:  - main.cc:160

@vansangpfiev
Copy link
Contributor Author

@gabrielle-ong Can you please try again with nightly 236?

@gabrielle-ong gabrielle-ong modified the milestones: v1.0.4, v1.0.3 Nov 18, 2024
@gabrielle-ong
Copy link
Contributor

gabrielle-ong commented Nov 20, 2024

Thanks Sang! Successfully activate and deactivated GPUs with CLI and API, marking as complete

Using GPU

Image

Using CPU

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: epic A major feature or initiative
Projects
Archived in project
Development

No branches or pull requests

3 participants