Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

planning: Jan's path to cortex.cpp? #3690

Closed
3 tasks
Tracked by #3786
dan-menlo opened this issue Sep 17, 2024 · 11 comments
Closed
3 tasks
Tracked by #3786

planning: Jan's path to cortex.cpp? #3690

dan-menlo opened this issue Sep 17, 2024 · 11 comments
Assignees
Labels
category: cortex.cpp Related to cortex.cpp category: providers Local & remote inference providers P1: important Important feature / fix type: planning Discussions, specs and decisions stage
Milestone

Comments

@dan-menlo
Copy link
Contributor

dan-menlo commented Sep 17, 2024

Goal

  • Jan should be able to seamlessly move from Nitro to cortex.cpp
  • What is the scope of change?
    • Different inference extensions? (e.g. nitro-extension, and cortex-extension?)
    • Data Structures (old legacy folders, vs. new?)
    • Separation of concerns (e.g. Jan used to be in charge of model downloads, now calls cortex.cpp instead?)
  • What is our strategy?
    • Parallel: support both legacy and new
    • Migration: move from old Nitro to new cortex.cpp?

Tasklist

  • Clearly articulate the architectural change that needs to happen
  • Clearly articulate the scope of changes we need to account for
  • Figure out our migration strategy
@dan-menlo dan-menlo added this to Menlo Sep 17, 2024
@dan-menlo dan-menlo converted this from a draft issue Sep 17, 2024
@dan-menlo dan-menlo changed the title epic: Jan migration from Nitro to cortex.cpp epic: Jan to start using cortex.cpp in addition to Nitro Sep 17, 2024
@dan-menlo dan-menlo changed the title epic: Jan to start using cortex.cpp in addition to Nitro epic: Jan's path to cortex.cpp? Sep 17, 2024
@imtuyethan imtuyethan added the P1: important Important feature / fix label Sep 18, 2024
@louis-jan
Copy link
Contributor

louis-jan commented Sep 19, 2024

Scope of changes

  • Nitro Inference Extension
  • Model Extension
  • Monitoring Extension

Nitro inference extension

Current implementation

  • Register Models (pre-populate model.json files)
    Any extensions register models on load will pre-populate model.json under /models/[model-id]/model.json
sequenceDiagram
    participant ModelExtension
    participant BaseExtension
    participant FileSystem

    ModelExtension->>BaseExtension: Register Models
    BaseExtension->>BaseExtension: Pre-populate Data
    BaseExtension->>FileSystem: Write to /models
Loading
  • Load Model:
    • Set additional .dll/.so PATH (for engine loading)
    • Hardware Information (to decide engine binary)
    • Run nitro server
    • Parse prompt template
    • Load a GGUF model with its file path and model settings (passed from App)
sequenceDiagram
    participant App
    participant NitroInferenceExtension
    participant NitroServer

    App->>NitroInferenceExtension: loadModel
    NitroInferenceExtension->>NitroInferenceExtension: killProcess
    NitroInferenceExtension->>NitroInferenceExtension: fetch hardware information
    NitroInferenceExtension->>child_process: spawn Nitro process
    NitroInferenceExtension->>NitroServer: wait for server healthy
    NitroInferenceExtension->>NitroInferenceExtension: parsePromptTemplate
    NitroInferenceExtension->>NitroServer: send loadModel request
    NitroInferenceExtension->>NitroServer: wait for model loaded
Loading
  • Inference (inheritance - OAIEngine.ts)
    Any extensions inheriting from the Base OAI Engine class will forward requests to their respective inference endpoints.
sequenceDiagram
    participant App
    participant NitroInferenceExtension
    participant NitroServer

    App->>NitroInferenceExtension: inference
    NitroInferenceExtension->>NitroInferenceExtension: transform payload
    NitroInferenceExtension->>NitroServer: chat/completions
     
Loading

Possible Changes

Current ❌ Upcoming ✅
Run Nitro server on model load Run cortex.cpp daemon service on start
Kill nitro process on pre-model-load and pre-app-exit Keep cortex-cpp alive, daemon process, stop on exit
Heavy hardware detection & prompt processing Just send a request
So many requests (check port, check health, model load status) One request to do the whole thing
Mixing of model management and inference - Multiple responsibilities Single responsibility

Model extension

Current implementation

  • Download Model (ModelFile as payload)
  • Delete Model (ModelFile as payload)
  • Get Models (Scan through models folder and return ModelFile[])
  • Import Model (Generate ModelFile and download)
  • Fetch HF Repo Data (for HF model import selection)

App retrieves pre-populated models:

sequenceDiagram

App ->> ModelExtension: get available models
ModelExtension ->> FS: read /models
FS --> ModelExtension : ModelFile
Loading

App downloads a model:

sequenceDiagram

App ->> ModelExtension: downloads
ModelExtension ->> Networking : request
Networking ->> FileSystem : filestream
Networking --> ModelExtension : progress

Loading

App imports a model

sequenceDiagram

App ->> ModelExtension: downloads
ModelExtension ->> Networking : request
ModelExtension ->> model.json :generate
Networking ->> FileSystem : filestream
Networking --> ModelExtension : progress
Loading

App deletes a model

graph LR

App --> |remove| Model_Extension
Model_Extension --> |FS unlink| /models/model/__files__
Loading

Possible Changes

Current ❌ Upcoming ✅
Implementation - Depends on FS Abstraction - API Forwarding
List Available Models: Scan through Model Folder GET /models
Delete: Unlink FS DELETE /models
Download: Download POST & Progress /models/pulls
Broken Model Import - Using default model.json cortex.cpp handles the model metadata
Model prediction depends on model size & available RAM/VRAM only cortex.cpp predicts base on hardware and model.yaml

System Monitoring extension

Current implementation

  • Get GPU Settings
  • Get System Information

App get resources information

graph LR

App --> |getResourcesInfo| Model_Extension
Model_Extension --> |fetch| node-os-utils
Model_Extension --> |getCurrentLoad| nvidia-smi
Loading

Possible Changes

Current ❌ Upcoming ✅
Implementation - Depends on FS & CMD Abstraction - API Forwarding
Execute CMD GET - Hardware Information Endpoint

Overview

Current ❌
Image
Upcoming ✅
Image

Assumption

  • cortex.cpp bundles multiple engines (different CPU instructions and CUDA versions)
  • cortex.cpp support /models APIs
    • GET: /models (available, active status, compatibility prediction)
    • POST: /models/pull (& progress?)
    • DELETE: /models
  • cortex.cpp support /hardware-information API

Challenges of moving Nitro to cortex.cpp

  • Different Data (Folder & File) structures
  • Backward / Forward compatibility

The migration

  • How to seamlessly move from Nitro to cortex.cpp, where:
    • cortex.cpp works with new Data Folder structure
    • cortex.cpp works with model.yaml
    • cortex.cpp works with models.list
  • How to maintain the data folder when users switch back to older versions?
    • Older versions rely on model-extension, which searches for a model.json file within the Data Folder.
    • Newer versions rely on cortex-extension, which searches for a model.yaml file within the Data Folder.

Let's think about a couple of our principles.

  1. We don't remove or manipulate user data.
  2. Rollback should always work.
  3. Minimal migration

What are some of the main concerns here?

  1. Can we use model.json and model.yaml side by side?
    1. We should. Since the model folder can contain anything, from README.md, .gitignore, GGUF, model.yaml to model.json.
    2. Older versions will still function with legacy model.json files.
    3. Newer versions will work with the latest model.yaml files.
  2. How to sync between those two?
    1. It's hard to sync between those two, since different structures could break the app.
    2. We just try to migrate once when there is no models.list available. This is a good flag for migration triggering.
    3. After migrating, each app version works independently with its own model file format.
  3. How about model pre-population? In other words, Model Hub.
    1. Model pre-population is an anti-pattern. Pre-populated models do not work with versioning or create unwanted data that confuse users. How about our Model Hub list thousands of models?
    2. We implemented model import, which replaces the need for a model file. Users can just import with the HF repo ID. Users do not have any reason to duplicate or edit a pre-populated model.json.
    3. Model listing can be done from the extension.
    4. In short, in the next version, we don't pre-populate unwanted files to the Data Folder. Only when users decide to download.
    5. Users deleting a model means deleting the persisted model.yaml & model files.
  4. How do other extensions work with their models? E.g., OpenAI
    1. Remote models can be populated during the build, not persisted. registerModels now persists in-memory model DTO.
    2. We don't pre-populate remote models, which is not necessary. Users are better setting them from Extension Settings. It's more or less an Extension configuration, not Model population.
  5. Migration complexity and UX
    1. We don't convert model.json to model.yaml. Instead, import with symlink. It could be faster and avoid new logic added from Jan, which is redundant. Lightweight migration with less risk. Maintain the Model ID is key; otherwise, all threads break.
    2. We don't move any files, which could drag the migration process long. E.g., GGUF
    3. How about new/manual adding GGUFs? The model symlink feature is always there for that.
    4. There are bad migration experiences in the past that we can avoid such as:
      1. Migrate all pre-populated models
      2. Heavy file movement drags the duration long
      3. Migrate everything at once
    5. Now we just migrate downloaded models:
      1. Import downloaded models only as symlinks (no file movement)
      2. Don't update the ID, which will kill us on data inconsistency
      3. Another thought: Do we really need to wait for model.yaml creation during migration?
        1. cortex.cpp can work with the models.list to provide available models?
        2. model.yaml generation is an asynchronous operation so:
          1. It generates model.yaml as soon as user try to get or load.
          2. It generates model.yaml as soon as user try to import.
          3. Don't block the client GUI; model list can be done with just the models.list contents. Any further operations with a certain model can generate a model.yaml later.
          4. Client will prioritize the active thread's model then others to not blocking users working threads.
          5. If something goes wrong, the GGUF file will still be there and can be generated later on other operations. The model.yaml file is not strictly required to be available, but just the cache of model file metadata?
  6. Better cache mechanism
    1. Model list and detail have worked with the File System before, and now they're sending an API request to cortex.cpp.
    2. To prevent slow loading, the client should cache accordingly on the frontend.

Summary

In short, the entire migration process is just to create symlinks to downloaded models from models.list. No model.yaml or folder manipulation involved. It should be done instantly?

Migrate indicator: models.list exist.

Don't pre-populate models. Remote Extensions work with their own settings instead of pre-populated models. Cortex Extension registers in-memory available to pull models (templates).

cortex.cpp is a daemon process that should be started alongside the app.

Jan migration from 0.5.x to 0.6.0
Image

@louis-jan
Copy link
Contributor

louis-jan commented Sep 19, 2024

Bundled Engines

Is it possible that, cortex-cpp bundles multiple engines, but expose only 1 gateway?

E.g.
The client requests to load a llama.cpp model, but cortex.cpp can predict the hardware compatible and run an efficient binary.

So:

  • Clients do not need to send any extra engine parameters or minimal (type).
  • Clients don't need to parse prompt templates, that's something the model should handle.
  • cortex.cpp owns the model metadata, allowing it to operate independently.
  • cortex.cpp masks up the complex binary distribution, exposing a simple interface.
  • GPU ON/OFF - GPU Selections can be done via engine /settings?

Eventually, that's all it needs to work with – the Model ID (aka model name).

Simplify model load / chat completions request
Image

@louis-jan
Copy link
Contributor

louis-jan commented Sep 19, 2024

Incremental Path

  1. We do what's not related to cortex.cpp first - Remote Extensions & Pre-populated Models
    1. Rather than pre-populate, enhance the model configurations.
    2. registerModels now lists available models for download, don't persist model.json.
  2. Better data caching
    1. Data retrieved from extensions should be cached on the frontend for subsequent loads.
    2. Reduce direct API requests and perform more data synchronization operations.
    3. Implementing a good cache layer would save a bad user experience during migration later, where the app doesn't need to scan through the models list, but can just dump cached data and imports right away. It won't interrupt users' working threads since asynchronous operations take care of data persistence (model.yaml), and model load requests are typically long-delayed responses.
  3. Minimal Migration Steps (cortex-cpp ready)
    1. Generate models.list based on cached data, do not need to scan the Model Folder, which can be costly.
    2. Send model import or symlink requests to generate models.list. It would be great if cortex.cpp could support batch symlinks (import), as that would only require creating a models.list file. The model.yaml files can be generated asynchronously. (This would cover the case user edits the models.list manually)
    3. Update extensions to redirect requests.
    4. The worst-case scenario is when users update from significantly older versions that lack cache improvements. Go through model folders and send import requests. During app update.
sequenceDiagram
    participant App as "App"
    participant Models as "Models"
    participant ModelList as "Model List"
    participant ModelYaml as "Model YAML"

    App->>Models: import
    activate Models
    Models->>ModelList: update models.list
    activate ModelList
    ModelList->>Models: return data
    deactivate ModelList
    Models->>ModelYaml: generate (async)
    activate ModelYaml
    Note right of ModelYaml: generate model.yaml asynchronously
    ModelYaml->>Models: (async) generated
    deactivate ModelYaml
    deactivate Models
Loading

@freelerobot
Copy link
Contributor

freelerobot commented Sep 29, 2024

This is really well thought through.

Questions @louis-jan :

  1. What are the specific attributes Jan needs from the get hardware info endpoint?
    • OS info
    • CPU info
    • RAM size (total, utilized)
    • GPU SKU
    • VRAM size (total, utilized)
    • What else? (do you need storage info, or additional unforeseen stats?)
  2. Do you need hardware configuration endpoints? i.e. does Jan ever need to let users change some hardware level configuration.
  3. Dumb question, but do you need engine status endpoints? Or is the level of abstraction good enough at the model level
  4. What are the Cortex sub-process endpoints needed? /keepalive /healthcheck?

@louis-jan
Copy link
Contributor

@0xSage All great questions, related issue that we discussed Hardware Info endpoint. janhq/cortex.cpp#1165 (comment)

  1. Jan does not let users change hardware level configuration but Engine settings (CPU/GPU mode, CPU threads ...)
  2. Engine status and model status were previously supported in cortex.cpp, but I don't see a clear use case from Jan's work, such as implementing a switch or update mechanism for the engine.
  3. /healthcheck is needed, and implemented from cortex.cpp

The one blocking this is download progress sync, which we aligned with a socket approach.

@freelerobot
Copy link
Contributor

freelerobot commented Sep 30, 2024 via email

@louis-jan
Copy link
Contributor

louis-jan commented Oct 2, 2024

Package cortex.cpp into Jan app:

  1. The app will bundle all available cortex.cpp binaries, just like the cortex.cpp installer.
  2. During update, it executes cortex engines install --sources [path_to_binaries] or via API, cortex will detect hardware and install accordingly (from sources).
  3. App Settings
  • Use cortex engines get to see the installed variant -> Update UI accordingly, e.g. GPU On/Off.
  • cortex.cpp will include a flag to select the variant, letting users choose their GPU from the app settings. Then, it will install the appropriate cortex engine version accordingly. Since all binaries are included, it's simply about switching between variants. (idea: Options to select CPU and GPU binaries for llama-cpp engine cortex.cpp#1390)
  • Investigating on Multiple GPUs support... TBD

Pros

  • No internet connection is required.
  • The same user experience.

Cons

  • The package has become a bit larger because it now includes the Cuda DLLs. However, no additional downloads are needed.

cortex.cpp configurations on spawn

  • Config path
  • Host & Port
  • Log paths
  • Data folder path
  • HF Token (via Env?)
  • Proxy Configs (via Env?) - KIV for now

@louis-jan
Copy link
Contributor

louis-jan commented Oct 2, 2024

Seamless Models Imports & Run

Since we no longer use models.list, it's now simply based on the engine name for imports.

  1. App 0.6.x opens (given that users have updated to 0.5.5)
  2. Trigger get models on load as usual (extension - asynchronously - background).
  3. Go through the downloaded models (from the cache) — very fast since only previously downloaded models are involved.
  4. Send model import requests for nitro models.
  5. Persist the cache.

This operation runs asynchronously, won't affect the UX, since it works with cached data. Even the model is not imported yet, it should still function normally (with stateless model load endpoint). There will be a door for broken requests or attempts.

In case users are updating from a very old version, we run a scan on model.json and persist the cache (as the current - legacy logics) -> Continue with (1)

@dan-menlo
Copy link
Contributor Author

Current issues being faced:

  • Jan/Cortex Data Folder issues (backward compatible)

@dan-menlo dan-menlo changed the title epic: Jan's path to cortex.cpp? architecture: Jan's path to cortex.cpp? Oct 14, 2024
@freelerobot freelerobot pinned this issue Oct 14, 2024
@imtuyethan imtuyethan removed this from the v0.5.7 milestone Oct 15, 2024
@dan-menlo dan-menlo changed the title architecture: Jan's path to cortex.cpp? discussion: Jan's path to cortex.cpp? Oct 17, 2024
@dan-menlo dan-menlo moved this from In Progress to Planning in Menlo Oct 17, 2024
@dan-menlo
Copy link
Contributor Author

@dan-homebrew: will create an Implementation Issue:
@louis-jan Can you link the implementation-related issues here:

@louis-jan
Copy link
Contributor

I think we can close this and follow up on #3895

@github-project-automation github-project-automation bot moved this from Planning to Review + QA in Menlo Nov 26, 2024
@imtuyethan imtuyethan added this to the v0.5.8 milestone Nov 27, 2024
@imtuyethan imtuyethan moved this from Review + QA to Completed in Menlo Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: cortex.cpp Related to cortex.cpp category: providers Local & remote inference providers P1: important Important feature / fix type: planning Discussions, specs and decisions stage
Projects
Archived in project
Development

No branches or pull requests

4 participants