planning: Cortex Model Compatibility API #1108

imtuyethan · 2024-08-30T05:59:18Z

Goal

Cortex can generate a model compatibility prediction, based on user's hardware and model.yaml
This should be an API that Jan can call (potentially as part of GET /models and GET /model/<model_id>)
Likely linked to planning: Cortex Hardware API #1165
Model compatibility should "compute" based on Active Hardware

Related Issues

Original Post

Specs

https://www.notion.so/jan-ai/Hardware-Detection-and-Recommendations-b04bc3109c2846d58572415125e0a9a5?pvs=4

Key user stories

Migrate Hardware Settings from Advanced Settings to its own settings page.
Missing Hardware Dependencies → Ask users to download dependencies to turn on GPU acceleration.
...

Design

https://www.figma.com/design/DYfpMhf8qiSReKvYooBgDV/Jan-App-(3rd-version)?node-id=5115-60038&t=OgzCw09qXKxZj3DC-4

The text was updated successfully, but these errors were encountered:

dan-menlo · 2024-09-01T09:51:48Z

Note: This should be driven by Cortex team, with Jan UI as one of the task items.

I think this is part of a larger "Hardware Detection, Config and Recommendations"

Cortex can detect what hardware the user has: CPU, GPU(s)
Cortex can be instructed to use certain hardware (e.g. CPU-only, GPU, and which GPU)
Cortex can run models with specific hardware (e.g. GPU 1)
Cortex can assess hardware compatibility for a given model

dan-menlo · 2024-09-05T11:54:54Z

This is also being discussed in janhq/jan#1089 - let's link both issues. We will need to scope this to something less ambigious

e.g. Cortex can provide "recommendations" in GET /models, based on activated/detected hardware

dan-menlo · 2024-09-15T10:48:38Z

Shifting to Sprint 21 to allow team to focus on Model Folder execution in Sprint 20

nguyenhoangthuan99 · 2024-11-11T07:32:17Z

To calculate the total number of memory buffer require for a model, firstly let break it into many parts:

Model weight (related to ngl setting)
KV cache reservation (related to ctx_len processing)
Buffer for preprocessing prompt (require extra GPU Vram when GPU mode is enable)

Model weight
A model weight has 3 part:

Token embeddings: shape (n_vocab, embedding length): This part is always allocated in CPU RAM and will be calculated by n_vocab * embedding_length * 2 * quant_bit/16 bytes, The quant_bit related to quantization level of the models (e.g Q4_K_M -> quant_bit for token embedding will be Q4_K = 4.5 bit).
repeated transformer layers: This part can be used to update the ngl setting and make the model fit with GPU Vram
Output layer: This part is treated as 1 layer in ngl settings, for example, if total ngl of model is 33 and we set ngl=32 -> the output layer will be loaded to CPU RAM and the rest repeated transformer layers will be loaded to GPU. The Output Layer will be calculated by n_vocab * embedding length * 2 * quant_bit/16. The quant_bit for output layer usually greater than the quantization level of model and we can't estimate exact quantization level for every model for this (e.g model quantization Q4_KM -> output layer Q6_K)
Summary, the equation for Model weight:

RAM = token_embeddings_size + ((total_ngl-ngl) >=1 ? Output_layer_size +  (total_ngl - ngl - 1 ) / (total_ngl-1) * (total_file_size - token_embeddings_size - Output_layer_size) : 0  )  (bytes)

VRAM = total_file_size - RAM (bytes)

KV cache

The kv cache is calculated by follow:

kv_cache_size = (ngl-1)/33 * ctx_len/8192 * hidden_dim/4096 * quant bit/16 * 1 (GB)

quant_bit for kv_cache has 3 mode (f16 = 16bits, q8_0 = 8 bits, q4_0 = 4.5 bits)

Buffer for preprocessing prompt

The buffer for preprocess prompts related to n_batch and n_ubatch:

VRAM = (min(n_batch, n_ubatch))/ 512 * 266 (MiB)

When we are not load all ngl to GPU, the buffer need to reserve an extra memory buffer for output layer, in this case

VRAM = (min(n_batch, n_ubatch))/ 512 * 266 (MiB) + Output_layer_size

the default n_batch and n_ubatch for cortex.llamacpp is 2048.

We also need to reserve extra 100 MiB -200 MiB of Ram for some small buffers during processing.

vansangpfiev · 2024-11-21T06:57:26Z

API documentation

GET /v1/models

Response

{
"data" : [
  {
    "model": "model_1",
    ...
    "recommendation": {
      "cpu_mode": {
        "ram": number
      },
      "gpu_mode": [{
        "ram": number,
        "vram": number,
        "ngl": number,
        "context_length": number,
        "recommend_ngl": number
      }]
    }
  }
]
}

vansangpfiev · 2024-11-22T08:56:46Z

CLI Documentation:

Get model list information

cortex model list --cpu_mode --gpu_mode

If no flag is specified, display only model id

TC117 · 2024-12-12T07:42:39Z

Updates models endpoint

imtuyethan added the type: epic A major feature or initiative label Aug 30, 2024

This was referenced Aug 30, 2024

bug: Not enough RAM for all models #1143

Closed

epic: Better Hardware Settings janhq/jan#2565

Closed

bug: "Recommended" labels don't autoupdate when user toggles CPU and GPU modes #1142

Closed

github-project-automation bot added this to Menlo Sep 5, 2024

freelerobot transferred this issue from janhq/jan Sep 5, 2024

freelerobot mentioned this issue Sep 6, 2024

epic: OS and hardware detection #1068

Closed

freelerobot added the category: hardware management Related to hardware & compute label Sep 6, 2024

dan-menlo changed the title ~~epic: Hardware Detection and Recommendations~~ epic: Cortex Model Compatibility Prediction Sep 8, 2024

dan-menlo changed the title ~~epic: Cortex Model Compatibility Prediction~~ epic: Cortex Model Compatibility API Sep 8, 2024

dan-menlo moved this to Scheduled in Menlo Sep 8, 2024

dan-menlo assigned nguyenhoangthuan99 and louis-jan Sep 8, 2024

dan-menlo moved this from In Progress to Scheduled in Menlo Sep 29, 2024

This was referenced Sep 30, 2024

question: Error render chat template: bad_expected_access. #1304

Closed

bug: cortex-nightly generate garbage output in Mac OS with long context model #1367

Closed

dan-menlo mentioned this issue Oct 15, 2024

planning: Cortex Hardware API #1165

Closed

11 tasks

dan-menlo changed the title ~~epic: Cortex Model Compatibility API~~ planning: Cortex Model Compatibility API Oct 21, 2024

dan-menlo assigned vansangpfiev and unassigned nguyenhoangthuan99 and louis-jan Oct 21, 2024

dan-menlo mentioned this issue Oct 30, 2024

roadmap: Jan has Hardware Controls and System Monitor and Prioritization janhq/jan#3908

Open

11 tasks

vansangpfiev mentioned this issue Nov 22, 2024

feat: model compatibility API #1715

Merged

3 tasks

gabrielle-ong mentioned this issue Nov 28, 2024

bug: models start the model imported can not work. #1439

Closed

6 tasks

This was referenced Nov 28, 2024

bug: Cortex-cpp continues to have 1 layer offload to CPU while using GPU #1104

Closed

bug: insufficient handling of insufficient memory #1457

Open

Sprint 26 Planning #1735

Closed

vansangpfiev moved this from In Review to Review + QA in Menlo Dec 2, 2024

gabrielle-ong added this to the v1.0.5 milestone Dec 6, 2024

TC117 moved this from QA to Completed in Menlo Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

planning: Cortex Model Compatibility API #1108

planning: Cortex Model Compatibility API #1108

imtuyethan commented Aug 30, 2024 •

edited by dan-menlo

Loading

dan-menlo commented Sep 1, 2024 •

edited

Loading

dan-menlo commented Sep 5, 2024 •

edited

Loading

dan-menlo commented Sep 15, 2024

nguyenhoangthuan99 commented Nov 11, 2024 •

edited

Loading

vansangpfiev commented Nov 21, 2024 •

edited

Loading

vansangpfiev commented Nov 22, 2024

TC117 commented Dec 12, 2024 •

edited

Loading

planning: Cortex Model Compatibility API #1108

planning: Cortex Model Compatibility API #1108

Comments

imtuyethan commented Aug 30, 2024 • edited by dan-menlo Loading

Goal

Related Issues

Original Post

Specs

Key user stories

Design

dan-menlo commented Sep 1, 2024 • edited Loading

dan-menlo commented Sep 5, 2024 • edited Loading

dan-menlo commented Sep 15, 2024

nguyenhoangthuan99 commented Nov 11, 2024 • edited Loading

vansangpfiev commented Nov 21, 2024 • edited Loading

API documentation

vansangpfiev commented Nov 22, 2024

CLI Documentation:

TC117 commented Dec 12, 2024 • edited Loading

imtuyethan commented Aug 30, 2024 •

edited by dan-menlo

Loading

dan-menlo commented Sep 1, 2024 •

edited

Loading

dan-menlo commented Sep 5, 2024 •

edited

Loading

nguyenhoangthuan99 commented Nov 11, 2024 •

edited

Loading

vansangpfiev commented Nov 21, 2024 •

edited

Loading

TC117 commented Dec 12, 2024 •

edited

Loading