Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image to text support? #5

Open
saket424 opened this issue Sep 16, 2024 · 7 comments
Open

Image to text support? #5

saket424 opened this issue Sep 16, 2024 · 7 comments

Comments

@saket424
Copy link

I see text to image as a supported feature. How about image to text. There are quite a few capable multimodal self-host models these days such as moondream2 and minicpm2.6 that are supported in ollama and similar.

Is that functionality implicitly supported!

@saket424
Copy link
Author

saket424 commented Sep 16, 2024

localai supports multimodal chat completions with gpt-4-vision-preview . can i try baibot with gpt-4-vision-preview instead of gpt-4 ?

      - id: localai
        provider: localai
        config:
          base_url: http://172.17.0.1:8080/v1
          api_key: null
          text_generation:
            model_id: gpt-4-vision-preview
            prompt: You are a brief, but helpful bot.
            temperature: 1.0
            max_response_tokens: 16384
            max_context_tokens: 128000

name: gpt-4-vision-preview

roles:
  user: "USER:"
  assistant: "ASSISTANT:"
  system: "SYSTEM:"

mmproj: llava-v1.6-7b-mmproj-f16.gguf
parameters:
  model: llava-v1.6-mistral-7b.Q5_K_M.gguf
  temperature: 0.2
  top_k: 40
  top_p: 0.95
  seed: -1

template:
  chat: |
    A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
    {{.Input}}
    ASSISTANT:

download_files:
- filename: llava-v1.6-mistral-7b.Q5_K_M.gguf
  uri: huggingface://cjpais/llava-1.6-mistral-7b-gguf/llava-v1.6-mistral-7b.Q5_K_M.gguf
- filename: llava-v1.6-7b-mmproj-f16.gguf
  uri: huggingface://cjpais/llava-1.6-mistral-7b-gguf/mmproj-model-f16.gguf

usage: |
    curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
        "model": "gpt-4-vision-preview",
        "messages": [{"role": "user", "content": [{"type":"text", "text": "What is in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'
curl http://172.17.0.1:8080/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "gpt-4-vision-preview", "messages": [{"role": "user", "content": [{"type":"text", "text": "What is in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'
{"created":1726522282,"object":"chat.completion","id":"3a66a0dd-9899-49df-93c4-a2d36309642e","model":"gpt-4-vision-preview","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"The image shows a wooden pathway leading through a field of tall grass. The pathway appears to be a simple, unpaved trail, possibly in a rural or natural setting. The sky is clear and blue, suggesting a sunny day. There are no visible landmarks or distinctive features in the background, which gives the impression of a peaceful, open landscape. \u003c/s\u003e"}}],"usage":{"prompt_tokens":1,"completion_tokens":76,"total_tokens":77}}

@saket424
Copy link
Author

gpt-4-vision-preview does not appear to be supported by baibot -- only gpt-4 for the moment

~/baibot/src/agent/provider/localai$ cat mod.rs 
// LocalAI is based on OpenAI (async-openai), because it seems to be fully compatible.
// Moreover, openai_api_rust does not support speech-to-text, so if we wish to use this feature
// we need to stick to async-openai.

use super::openai_compat::Config;

pub fn default_config() -> Config {
    let mut config = Config {
        base_url: "http://my-localai-self-hosted-service:8080/v1".to_owned(),

        ..Default::default()
    };

    if let Some(ref mut config) = config.text_generation.as_mut() {
        config.model_id = "gpt-4".to_owned();
        config.max_context_tokens = 128_000;
        config.max_response_tokens = 4096;
    }

    if let Some(ref mut config) = config.text_to_speech.as_mut() {
        config.model_id = "tts-1".to_owned();
    }

    if let Some(ref mut config) = config.speech_to_text.as_mut() {
        config.model_id = "whisper-1".to_owned();
    }

    if let Some(ref mut config) = config.image_generation.as_mut() {
        config.model_id = "stablediffusion".to_owned();
    }

    config
} 

@spantaleev
Copy link
Contributor

This is a valid feature request.

baibot currently ignores all images sent by you. It doesn't support feeding them to a model yet.

@spantaleev
Copy link
Contributor

To address your previous comment:

gpt-4-vision-preview does not appear to be supported by baibot -- only gpt-4 for the moment

You're pasting an excerpt from the code which defines the default configuration for models created on the localai provider.
This configuration inherits from the "OpenAI compatible" provider and customizes the models to some sane defaults for the LocalAI provider.

The fact that gpt-4 is hardcoded in the default configuration does not mean you can't change it. When creating a new agent dynamically (e.g. !bai agent create-room-local localai my-new-localai-agent), you will be shown the default configuration (which specifies the gpt-4 model), but you can change it however you'd like. You can also define the agent statically (in your YAML configuration).

Perhaps specifying a gpt-4-vision-preview model would make LocalAI route your queries to a different agent.

Regardless, baibot cannot send images to the model, so what you're trying to do cannot be done yet.


For completeness, it should be noted that for the actual OpenAI API (recommended to be used via the openai provider), gpt-4-vision-preview is no longer a valid model.

If you try to use it, you get an error:

invalid_request_error: The model gpt-4-vision-preview has been deprecated, learn more here: https://platform.openai.com/docs/deprecations (code: model_not_found)

Here's the relevant part:

On June 6th, 2024, we notified developers using gpt-4-32k and gpt-4-vision-preview of their upcoming deprecations in one year and six months respectively. As of June 17, 2024, only existing users of these models will be able to continue using them.

Using gpt-4o is the new equivalent to using gpt-4-vision-preview.

@saket424
Copy link
Author

saket424 commented Sep 18, 2024

Thanks @spantaleev . In preparation for this new feature request for baibot. I will open an issue with localAI to let them know that gpt-4-vision-preview is deprecated and to instead name it gpt-4o in compliance with OpenAI API compatibility. This should get mapped to the llava-1.6-mistral model that the stock docker cuda12 localAI v2.20.1 image comes pre installed with.

References to gpt-4-vision-preview in https://github.com/mudler/LocalAI/blob/master/aio/gpu-8g/vision.yaml and

https://github.com/mudler/LocalAI/blob/master/aio/cpu/vision.yaml and

https://github.com/mudler/LocalAI/blob/master/aio/intel/vision.yaml

need to be changed to gpt-4o as you point out

@saket424
Copy link
Author

saket424 commented Sep 18, 2024

I opened this LocalAI issue mudler/LocalAI#3596

@saket424
Copy link
Author

saket424 commented Nov 8, 2024

@spantaleev
Any progress on this ? I would love for baibot to weigh in when an image and associated prompt is uploaded. This should be relatively straightforward to support as this is an extended multimodal use of the existing text chat completion api

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants