Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

planning: Cortex Model Compatibility API #1108

Open
2 of 4 tasks
Tracked by #3908
imtuyethan opened this issue Aug 30, 2024 · 7 comments
Open
2 of 4 tasks
Tracked by #3908

planning: Cortex Model Compatibility API #1108

imtuyethan opened this issue Aug 30, 2024 · 7 comments
Assignees
Labels
category: hardware management Related to hardware & compute type: epic A major feature or initiative
Milestone

Comments

@imtuyethan
Copy link
Contributor

imtuyethan commented Aug 30, 2024

Goal

  • Cortex can generate a model compatibility prediction, based on user's hardware and model.yaml
  • This should be an API that Jan can call (potentially as part of GET /models and GET /model/<model_id>)
  • Likely linked to planning: Cortex Hardware API #1165
  • Model compatibility should "compute" based on Active Hardware

Related Issues

Original Post

Specs

https://www.notion.so/jan-ai/Hardware-Detection-and-Recommendations-b04bc3109c2846d58572415125e0a9a5?pvs=4

Key user stories

  • Migrate Hardware Settings from Advanced Settings to its own settings page.
  • Missing Hardware Dependencies → Ask users to download dependencies to turn on GPU acceleration.
  • ...

Design

https://www.figma.com/design/DYfpMhf8qiSReKvYooBgDV/Jan-App-(3rd-version)?node-id=5115-60038&t=OgzCw09qXKxZj3DC-4

@dan-menlo
Copy link
Contributor

dan-menlo commented Sep 1, 2024

Note: This should be driven by Cortex team, with Jan UI as one of the task items.

I think this is part of a larger "Hardware Detection, Config and Recommendations"

  • Cortex can detect what hardware the user has: CPU, GPU(s)
  • Cortex can be instructed to use certain hardware (e.g. CPU-only, GPU, and which GPU)
  • Cortex can run models with specific hardware (e.g. GPU 1)
  • Cortex can assess hardware compatibility for a given model

@freelerobot freelerobot transferred this issue from janhq/jan Sep 5, 2024
@dan-menlo
Copy link
Contributor

dan-menlo commented Sep 5, 2024

This is also being discussed in janhq/jan#1089 - let's link both issues. We will need to scope this to something less ambigious

  • e.g. Cortex can provide "recommendations" in GET /models, based on activated/detected hardware

@freelerobot freelerobot added the category: hardware management Related to hardware & compute label Sep 6, 2024
@dan-menlo dan-menlo changed the title epic: Hardware Detection and Recommendations epic: Cortex Model Compatibility Prediction Sep 8, 2024
@dan-menlo dan-menlo changed the title epic: Cortex Model Compatibility Prediction epic: Cortex Model Compatibility API Sep 8, 2024
@dan-menlo dan-menlo moved this to Scheduled in Menlo Sep 8, 2024
@dan-menlo
Copy link
Contributor

Shifting to Sprint 21 to allow team to focus on Model Folder execution in Sprint 20

@dan-menlo dan-menlo moved this from In Progress to Scheduled in Menlo Sep 29, 2024
@dan-menlo dan-menlo changed the title epic: Cortex Model Compatibility API planning: Cortex Model Compatibility API Oct 21, 2024
@nguyenhoangthuan99
Copy link
Contributor

nguyenhoangthuan99 commented Nov 11, 2024

To calculate the total number of memory buffer require for a model, firstly let break it into many parts:

  • Model weight (related to ngl setting)
  • KV cache reservation (related to ctx_len processing)
  • Buffer for preprocessing prompt (require extra GPU Vram when GPU mode is enable)

Model weight
A model weight has 3 part:

  • Token embeddings: shape (n_vocab, embedding length): This part is always allocated in CPU RAM and will be calculated by n_vocab * embedding_length * 2 * quant_bit/16 bytes, The quant_bit related to quantization level of the models (e.g Q4_K_M -> quant_bit for token embedding will be Q4_K = 4.5 bit).

  • repeated transformer layers: This part can be used to update the ngl setting and make the model fit with GPU Vram

  • Output layer: This part is treated as 1 layer in ngl settings, for example, if total ngl of model is 33 and we set ngl=32 -> the output layer will be loaded to CPU RAM and the rest repeated transformer layers will be loaded to GPU. The Output Layer will be calculated by n_vocab * embedding length * 2 * quant_bit/16. The quant_bit for output layer usually greater than the quantization level of model and we can't estimate exact quantization level for every model for this (e.g model quantization Q4_KM -> output layer Q6_K)

  • Summary, the equation for Model weight:

RAM = token_embeddings_size + ((total_ngl-ngl) >=1 ? Output_layer_size +  (total_ngl - ngl - 1 ) / (total_ngl-1) * (total_file_size - token_embeddings_size - Output_layer_size) : 0  )  (bytes)

VRAM = total_file_size - RAM (bytes)

KV cache

The kv cache is calculated by follow:

kv_cache_size = (ngl-1)/33 * ctx_len/8192 * hidden_dim/4096 * quant bit/16 * 1 (GB)

quant_bit for kv_cache has 3 mode (f16 = 16bits, q8_0 = 8 bits, q4_0 = 4.5 bits)

Buffer for preprocessing prompt

The buffer for preprocess prompts related to n_batch and n_ubatch:

VRAM = (min(n_batch, n_ubatch))/ 512 * 266 (MiB)

When we are not load all ngl to GPU, the buffer need to reserve an extra memory buffer for output layer, in this case

VRAM = (min(n_batch, n_ubatch))/ 512 * 266 (MiB) + Output_layer_size

the default n_batch and n_ubatch for cortex.llamacpp is 2048.

We also need to reserve extra 100 MiB -200 MiB of Ram for some small buffers during processing.

@vansangpfiev
Copy link
Contributor

vansangpfiev commented Nov 21, 2024

API documentation

GET /v1/models

Response

{
"data" : [
  {
    "model": "model_1",
    ...
    "recommendation": {
      "cpu_mode": {
        "ram": number
      },
      "gpu_mode": [{
        "ram": number,
        "vram": number,
        "ngl": number,
        "context_length": number,
        "recommend_ngl": number
      }]
    }
  }
]
}

@vansangpfiev
Copy link
Contributor

CLI Documentation:

Get model list information

cortex model list --cpu_mode --gpu_mode

If no flag is specified, display only model id

@TC117
Copy link

TC117 commented Dec 12, 2024

  • Updates models endpoint
    image

@TC117 TC117 moved this from QA to Completed in Menlo Dec 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: hardware management Related to hardware & compute type: epic A major feature or initiative
Projects
Archived in project
Development

No branches or pull requests

8 participants