Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: cortex run model(:gguf) #1076

Closed
5 tasks
Tracked by #3614
vansangpfiev opened this issue Aug 21, 2024 · 13 comments · Fixed by #1351
Closed
5 tasks
Tracked by #3614

feat: cortex run model(:gguf) #1076

vansangpfiev opened this issue Aug 21, 2024 · 13 comments · Fixed by #1351
Assignees
Labels
category: model management Model pull, yaml, model state category: model running Inference ux, handling context/parameters, runtime P1: important Important feature / fix type: epic A major feature or initiative
Milestone

Comments

@vansangpfiev
Copy link
Contributor

vansangpfiev commented Aug 21, 2024

Goals

  • Cortex run should load and run model
  • Cortex should run model with relevant parameters
  • Cortex should display informative error messages if it fails

Tasklist

  • cortex run model:llamacpp
  • Ensure dylib issue is fixed bug: libengine.dylib not found #953
  • Decision: should we uncouple engine install? (e.g. error message, directing user to install engine with separate command)
  • Informative Error messages for major failure modes (e.g. OOM)
  • Tests
  • Docs
@vansangpfiev vansangpfiev self-assigned this Aug 22, 2024
@vansangpfiev
Copy link
Contributor Author

vansangpfiev commented Aug 29, 2024

run is a chain command, it will do:

  • models pull if model is not existed
  • engines install if engine is not existed
  • models start
  • chat

@imtuyethan imtuyethan transferred this issue from another repository Sep 2, 2024
@freelerobot freelerobot added category: model running Inference ux, handling context/parameters, runtime category: model management Model pull, yaml, model state P1: important Important feature / fix labels Sep 6, 2024
@freelerobot freelerobot changed the title cortex run feat: cortex run Sep 6, 2024
@dan-menlo dan-menlo added category: model management Model pull, yaml, model state and removed category: model management Model pull, yaml, model state labels Sep 6, 2024
@dan-menlo dan-menlo changed the title feat: cortex run feat: cortex run for llama.cpp Sep 8, 2024
@dan-menlo dan-menlo changed the title feat: cortex run for llama.cpp feat: cortex run model:gguf Sep 8, 2024
@dan-menlo dan-menlo added the type: epic A major feature or initiative label Sep 8, 2024
@freelerobot
Copy link
Contributor

see: #1303

@dan-menlo dan-menlo changed the title feat: cortex run model:gguf feat: cortex run model(:gguf) Sep 26, 2024
@gabrielle-ong
Copy link
Contributor

Image

Can I check

  • I've downloaded tinyllama:gguf
  • cortex run tinyllama asks me to choose the variant -> chose gguf
  • reinitiates download of tinyllama:gguf

@dan-menlo
Copy link
Contributor

dan-menlo commented Sep 26, 2024

@vansangpfiev @0xSage Can I ask about our understanding of cortex run - I thought we decided on run being a non-interactive command, to optimize for DevOps engineers.

I thought the previous Cortex JS implementation had the following logic:

# Run command
cortex run tinyllama (runs in background, advises user to use `cortex chat` for interactive)
cortex run tinyllama --chat (for interactive chat)
cortex chat tinyllama (shorthand for run and --chat, for interactive shell)

# Chat completion
cortex chat completion tinyllama -m <message>

The idea was that if you are running Cortex in a docker container, you would not want to have an interactive chat shell. This would be similar to docker run, i.e. runs in the background.

I'm open to either path - however, I think we need to think from a UX point of view on this.

@vansangpfiev
Copy link
Contributor Author

vansangpfiev commented Sep 26, 2024

We can

Image

Can I check

  • I've downloaded tinyllama:gguf
  • cortex run tinyllama asks me to choose the variant -> chose gguf
  • reinitiates download of tinyllama:gguf

I think the model id is tinyllama:gguf, so when you run cortex run tinyllama, we don't have that tinyllama as a model id yet.
You need to run cortex run tinyllama:gguf instead

@vansangpfiev
Copy link
Contributor Author

vansangpfiev commented Sep 26, 2024

@vansangpfiev @0xSage Can I ask about our understanding of cortex run - I thought we decided on run being a non-interactive command, to optimize for DevOps engineers.

I thought the previous Cortex JS implementation had the following logic:

# Run command
cortex run tinyllama (runs in background, advises user to use `cortex chat` for interactive)
cortex run tinyllama --chat (for interactive chat)
cortex chat tinyllama (shorthand for run and --chat, for interactive shell)

# Chat completion
cortex chat completion tinyllama -m <message>

The idea was that if you are running Cortex in a docker container, you would not want to have an interactive chat shell. This would be similar to docker run, i.e. runs in the background.

I'm open to either path - however, I think we need to think from a UX point of view on this.

@dan-homebrew I'm good with the old CortexJs design. Let me change the implementation.
However, I would like to confirm:

  • cortex chat will also be a command chain? Which includes cortex start cortex models start?
  • we don't have cortex chat completion command

Actually, 2 commands to do the same task does not make sense to me

cortex run tinyllama --chat (for interactive chat)
cortex chat tinyllama (shorthand for run and --chat, for interactive shell)

@dan-menlo
Copy link
Contributor

dan-menlo commented Sep 27, 2024

@vansangpfiev This is how I see it:

cortex run

  • This is similar to docker run, i.e. runs it in background
  • It should have a success/error message
  • It should advise the user to use chat for interactive shell
# Run command
> cortex run tinyllama 

tinyllama model started successfully. Use `cortex chat tinyllama` for interactive chat shell. 

cortex chat

  • This is a syntactic sugar on top of cortex run --chat
  • You should be able to re-use the logic from cortex run (i.e. don't reimplement)
  • Thus, it is also a command chan that includes cortex start and cortex models start
# Same things
cortex chat tinyllama (shorthand for `cortex run tinyllama --chat`)
cortex run tinyllama --chat` (re-uses logic for cortex run)

cortex chat completion

  • This is in reference to @0xSage's previous post, where Cortex could do a single inference command with chat -m
  • Her rationale was that DevOps and QA may use it to do single test outputs (imho this is not common)
  • Since we have already implemented it, I recommend this be a cortex chat completion command, in line with the API call
cortex chat completion tinyllama -m <message> 

cc @gabrielle-ong: Please note, you will be updating Docs for this
cc @0xSage: FYI

@vansangpfiev
Copy link
Contributor Author

Got it. Let me change the implementation.

@vansangpfiev
Copy link
Contributor Author

@dan-homebrew Since new command cortex chat completion is unclear, we don't know user want to send a message to a model or want to chat with completion model, and syntax error handling in this case is quite complicated. I think we can change that command to cortex chat-completion. What do you think?

@freelerobot
Copy link
Contributor

@dan-homebrew Since new command cortex chat completion is unclear, we don't know user want to send a message to a model or want to chat with completion model, and syntax error handling in this case is quite complicated. I think we can change that command to cortex chat-completion. What do you think?

Yes, sorry to block on this, chat-completion makes a lot of sense 🙏

@dan-menlo
Copy link
Contributor

dan-menlo commented Sep 27, 2024

@dan-homebrew Since new command cortex chat completion is unclear, we don't know user want to send a message to a model or want to chat with completion model, and syntax error handling in this case is quite complicated. I think we can change that command to cortex chat-completion. What do you think?

@vansangpfiev @0xSage I think the correct way to do this, is to overload the cortex chat command

# Chat (interactive)
cortex chat tinyllama
cortex run tinyllama --chat (underlying command)

# Chat (single message)
cortex chat tinyllama -m <message>

I am not very much in favor of creating an additional CLI method for chat-completions, would prefer to just simplify.

@dan-menlo
Copy link
Contributor

dan-menlo commented Sep 27, 2024

I think the model id is tinyllama:gguf, so when you run cortex run tinyllama, we don't have that tinyllama as a model id yet.

You need to run cortex run tinyllama:gguf instead

@vansangpfiev This is probably not the correct UX. We should have a "main" branch equivalent,

cortex run tinyllama
cortex run llama3.2 (pulls :main)

Let me discuss with @gabrielle-ong. We will likely have a main branch on the Cortex Model repo to support this use case. We will discuss and have spec ready by Monday.

In the long term, this will likely be replaced by something that automatically detects users' hardware and picks the correct engine.

@vansangpfiev vansangpfiev moved this from Review + QA to In Progress in Menlo Sep 30, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Review + QA in Menlo Sep 30, 2024
@gabrielle-ong
Copy link
Contributor

QA:
✅ Mac
✅ Windows
✅ Ubuntu

Commands as listed on #1351:

  1. cortex run tinyllama:gguf -> runs in background, advises user to use cortex chat for interactive
  2. cortex run tinyllama:gguf --chat -> interactive chat
  3. cortex chat tinyllama:gguf -> interactive chat
  4. cortex chat tinyllama:gguf -m "write a poem"

Clarified: There is no cortex chat completion command.

To be followed by #1362 - dont require :gguf in model name
(which requires changes to the Hugging Face repo branches, out of scope of this implementation issue)

@gabrielle-ong gabrielle-ong added this to the v1.0.0 milestone Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: model management Model pull, yaml, model state category: model running Inference ux, handling context/parameters, runtime P1: important Important feature / fix type: epic A major feature or initiative
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants