Image to text support? #5

saket424 · 2024-09-16T12:45:10Z

I see text to image as a supported feature. How about image to text. There are quite a few capable multimodal self-host models these days such as moondream2 and minicpm2.6 that are supported in ollama and similar.

Is that functionality implicitly supported!

saket424 · 2024-09-16T21:34:06Z

localai supports multimodal chat completions with gpt-4-vision-preview . can i try baibot with gpt-4-vision-preview instead of gpt-4 ?

      - id: localai
        provider: localai
        config:
          base_url: http://172.17.0.1:8080/v1
          api_key: null
          text_generation:
            model_id: gpt-4-vision-preview
            prompt: You are a brief, but helpful bot.
            temperature: 1.0
            max_response_tokens: 16384
            max_context_tokens: 128000

name: gpt-4-vision-preview

roles:
  user: "USER:"
  assistant: "ASSISTANT:"
  system: "SYSTEM:"

mmproj: llava-v1.6-7b-mmproj-f16.gguf
parameters:
  model: llava-v1.6-mistral-7b.Q5_K_M.gguf
  temperature: 0.2
  top_k: 40
  top_p: 0.95
  seed: -1

template:
  chat: |
    A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.
    {{.Input}}
    ASSISTANT:

download_files:
- filename: llava-v1.6-mistral-7b.Q5_K_M.gguf
  uri: huggingface://cjpais/llava-1.6-mistral-7b-gguf/llava-v1.6-mistral-7b.Q5_K_M.gguf
- filename: llava-v1.6-7b-mmproj-f16.gguf
  uri: huggingface://cjpais/llava-1.6-mistral-7b-gguf/mmproj-model-f16.gguf

usage: |
    curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
        "model": "gpt-4-vision-preview",
        "messages": [{"role": "user", "content": [{"type":"text", "text": "What is in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'

curl http://172.17.0.1:8080/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "gpt-4-vision-preview", "messages": [{"role": "user", "content": [{"type":"text", "text": "What is in the image?"}, {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" }}], "temperature": 0.9}]}'
{"created":1726522282,"object":"chat.completion","id":"3a66a0dd-9899-49df-93c4-a2d36309642e","model":"gpt-4-vision-preview","choices":[{"index":0,"finish_reason":"stop","message":{"role":"assistant","content":"The image shows a wooden pathway leading through a field of tall grass. The pathway appears to be a simple, unpaved trail, possibly in a rural or natural setting. The sky is clear and blue, suggesting a sunny day. There are no visible landmarks or distinctive features in the background, which gives the impression of a peaceful, open landscape. \u003c/s\u003e"}}],"usage":{"prompt_tokens":1,"completion_tokens":76,"total_tokens":77}}

saket424 · 2024-09-17T20:50:42Z

gpt-4-vision-preview does not appear to be supported by baibot -- only gpt-4 for the moment

~/baibot/src/agent/provider/localai$ cat mod.rs 
// LocalAI is based on OpenAI (async-openai), because it seems to be fully compatible.
// Moreover, openai_api_rust does not support speech-to-text, so if we wish to use this feature
// we need to stick to async-openai.

use super::openai_compat::Config;

pub fn default_config() -> Config {
    let mut config = Config {
        base_url: "http://my-localai-self-hosted-service:8080/v1".to_owned(),

        ..Default::default()
    };

    if let Some(ref mut config) = config.text_generation.as_mut() {
        config.model_id = "gpt-4".to_owned();
        config.max_context_tokens = 128_000;
        config.max_response_tokens = 4096;
    }

    if let Some(ref mut config) = config.text_to_speech.as_mut() {
        config.model_id = "tts-1".to_owned();
    }

    if let Some(ref mut config) = config.speech_to_text.as_mut() {
        config.model_id = "whisper-1".to_owned();
    }

    if let Some(ref mut config) = config.image_generation.as_mut() {
        config.model_id = "stablediffusion".to_owned();
    }

    config
}

spantaleev · 2024-09-18T05:58:48Z

This is a valid feature request.

baibot currently ignores all images sent by you. It doesn't support feeding them to a model yet.

spantaleev · 2024-09-18T06:05:42Z

To address your previous comment:

gpt-4-vision-preview does not appear to be supported by baibot -- only gpt-4 for the moment

You're pasting an excerpt from the code which defines the default configuration for models created on the localai provider.
This configuration inherits from the "OpenAI compatible" provider and customizes the models to some sane defaults for the LocalAI provider.

The fact that gpt-4 is hardcoded in the default configuration does not mean you can't change it. When creating a new agent dynamically (e.g. !bai agent create-room-local localai my-new-localai-agent), you will be shown the default configuration (which specifies the gpt-4 model), but you can change it however you'd like. You can also define the agent statically (in your YAML configuration).

Perhaps specifying a gpt-4-vision-preview model would make LocalAI route your queries to a different agent.

Regardless, baibot cannot send images to the model, so what you're trying to do cannot be done yet.

For completeness, it should be noted that for the actual OpenAI API (recommended to be used via the openai provider), gpt-4-vision-preview is no longer a valid model.

If you try to use it, you get an error:

invalid_request_error: The model gpt-4-vision-preview has been deprecated, learn more here: https://platform.openai.com/docs/deprecations (code: model_not_found)

Here's the relevant part:

On June 6th, 2024, we notified developers using gpt-4-32k and gpt-4-vision-preview of their upcoming deprecations in one year and six months respectively. As of June 17, 2024, only existing users of these models will be able to continue using them.

Using gpt-4o is the new equivalent to using gpt-4-vision-preview.

saket424 · 2024-09-18T09:14:00Z

Thanks @spantaleev . In preparation for this new feature request for baibot. I will open an issue with localAI to let them know that gpt-4-vision-preview is deprecated and to instead name it gpt-4o in compliance with OpenAI API compatibility. This should get mapped to the llava-1.6-mistral model that the stock docker cuda12 localAI v2.20.1 image comes pre installed with.

References to gpt-4-vision-preview in https://github.com/mudler/LocalAI/blob/master/aio/gpu-8g/vision.yaml and

https://github.com/mudler/LocalAI/blob/master/aio/cpu/vision.yaml and

https://github.com/mudler/LocalAI/blob/master/aio/intel/vision.yaml

need to be changed to gpt-4o as you point out

saket424 · 2024-09-18T09:29:12Z

I opened this LocalAI issue mudler/LocalAI#3596

saket424 · 2024-11-08T11:33:05Z

@spantaleev
Any progress on this ? I would love for baibot to weigh in when an image and associated prompt is uploaded. This should be relatively straightforward to support as this is an extended multimodal use of the existing text chat completion api

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Image to text support? #5

Image to text support? #5

saket424 commented Sep 16, 2024

saket424 commented Sep 16, 2024 •

edited

Loading

saket424 commented Sep 17, 2024

spantaleev commented Sep 18, 2024

spantaleev commented Sep 18, 2024

saket424 commented Sep 18, 2024 •

edited

Loading

saket424 commented Sep 18, 2024 •

edited

Loading

saket424 commented Nov 8, 2024

Image to text support? #5

Image to text support? #5

Comments

saket424 commented Sep 16, 2024

saket424 commented Sep 16, 2024 • edited Loading

saket424 commented Sep 17, 2024

spantaleev commented Sep 18, 2024

spantaleev commented Sep 18, 2024

saket424 commented Sep 18, 2024 • edited Loading

saket424 commented Sep 18, 2024 • edited Loading

saket424 commented Nov 8, 2024

saket424 commented Sep 16, 2024 •

edited

Loading

saket424 commented Sep 18, 2024 •

edited

Loading

saket424 commented Sep 18, 2024 •

edited

Loading