Research issue: gather examples of multi-modal API calls from different LLMs #557

simonw · 2024-08-26T00:27:22Z

To aid in the design for both of these:

I'm going to gather a bunch of examples of how different LLMs accept multi-modal inputs. I'm particularly interested in the following:

What kind of files do they accept?
Do they accept file uploads, base64 inline files, URL references or a selection?
How are these interspersed with text prompts? This will help inform the database schema design for Design new LLM database schema #556
If included with a text prompt does it go before or after the files?
How many files can be attached at once?
Is extra information such as the mimetype needed? If so, this helps inform how the CLI design looks (can I do --file filename.ext or do I need some other mechanism that helps provide the type as well?)

The text was updated successfully, but these errors were encountered:

simonw · 2024-08-26T00:27:55Z

Simple GPT-4o example from https://simonwillison.net/2024/Aug/25/covidsewage-alt-text/

import base64, openai

client = openai.OpenAI()
with open("/tmp/covid.png", "rb") as image_file:
    encoded_image = base64.b64encode(image_file.read()).decode("utf-8")
messages = [
    {
        "role": "system",
        "content": "Return the concentration levels in the sewersheds - single paragraph, no markdown",
    },
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {"url": "data:image/png;base64," + encoded_image},
            }
        ],
    },
]
completion = client.chat.completions.create(model="gpt-4o", messages=messages)
print(completion.choices[0].message.content)

simonw · 2024-08-26T00:28:48Z

Claude image example from https://github.com/simonw/tools/blob/0249ab83775861f549abb1aa80af0ca3614dc5ff/haiku.html

        const requestBody = {
          model: "claude-3-haiku-20240307",
          max_tokens: 1024,
          messages: [
            {
              role: "user",
              content: [
                {
                  type: "image",
                  source: {
                    type: "base64",
                    media_type: "image/jpeg",
                    data: base64Image,
                  },
                },
                { type: "text", text: "Return a haiku inspired by this image" },
              ],
            },
          ],
        };
        fetch("https://api.anthropic.com/v1/messages", {
          method: "POST",
          headers: {
            "x-api-key": apiKey,
            "anthropic-version": "2023-06-01",
            "content-type": "application/json",
            "anthropic-dangerous-direct-browser-access": "true"
          },
          body: JSON.stringify(requestBody),
        })
          .then((response) => response.json())
          .then((data) => {
            console.log(JSON.stringify(data, null, 2));
            const haiku = data.content[0].text;
            responseElement.innerText += haiku + "\n\n";
          })
          .catch((error) => {
            console.error("Error sending image to the Anthropic API:", error);
          })
          .finally(() => {
            // Hide "Generating..." message
            generatingElement.style.display = "none";
          });

simonw · 2024-08-26T00:45:18Z

Basic Gemini example from https://github.com/simonw/llm-gemini/blob/4195c4396834e5bccc3ce9a62647591e1b228e2e/llm_gemini.py (my images branch):

        messages = []
        if conversation:
            for response in conversation.responses:
                messages.append(
                    {"role": "user", "parts": [{"text": response.prompt.prompt}]}
                )
                messages.append({"role": "model", "parts": [{"text": response.text()}]})
        if prompt.images:
            for image in prompt.images:
                messages.append(
                    {
                        "role": "user",
                        "parts": [
                            {
                                "inlineData": {
                                    "mimeType": "image/jpeg",
                                    "data": base64.b64encode(image.read()).decode(
                                        "utf-8"
                                    ),
                                }
                            }
                        ],
                    }
                )
        messages.append({"role": "user", "parts": [{"text": prompt.prompt}]})

simonw · 2024-08-26T00:49:42Z

Example from Google AI Studio:

API_KEY="YOUR_API_KEY"

# TODO: Make the following files available on the local file system.
FILES=("image.jpg")
MIME_TYPES=("image/jpeg")
for i in "${!FILES[@]}"; do
  NUM_BYTES=$(wc -c < "${FILES[$i]}")
  curl "https://generativelanguage.googleapis.com/upload/v1beta/files?key=${API_KEY}" \
    -H "X-Goog-Upload-Command: start, upload, finalize" \
    -H "X-Goog-Upload-Header-Content-Length: ${NUM_BYTES}" \
    -H "X-Goog-Upload-Header-Content-Type: ${MIME_TYPES[$i]}" \
    -H "Content-Type: application/json" \
    -d "{'file': {'display_name': '${FILES[$i]}'}}" \
    --data-binary "@${FILES[$i]}"
  # TODO: Read the file.uri from the response, store it as FILE_URI_${i}
done

# Adjust safety settings in generationConfig below.
# See https://ai.google.dev/gemini-api/docs/safety-settings
curl \
  -X POST https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-pro-exp-0801:generateContent?key=${API_KEY} \
  -H 'Content-Type: application/json' \
  -d @<(echo '{
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "fileData": {
            "fileUri": "${FILE_URI_0}",
            "mimeType": "image/jpeg"
          }
        }
      ]
    },
    {
      "role": "user",
      "parts": [
        {
          "text": "Describe image in detail"
        }
      ]
    }
  ],
  "generationConfig": {
    "temperature": 1,
    "topK": 64,
    "topP": 0.95,
    "maxOutputTokens": 8192,
    "responseMimeType": "text/plain"
  }
}')

simonw · 2024-08-26T03:08:13Z

Here's Gemini Pro accepting multiple images at once: https://ai.google.dev/gemini-api/docs/vision?lang=python#prompt-multiple

import PIL.Image

sample_file = PIL.Image.open('sample.jpg')
sample_file_2 = PIL.Image.open('piranha.jpg')
sample_file_3 = PIL.Image.open('firefighter.jpg')

model = genai.GenerativeModel(model_name="gemini-1.5-pro")

prompt = (
  "Write an advertising jingle showing how the product in the first image "
  "could solve the problems shown in the second two images."
)

response = model.generate_content([prompt, sample_file, sample_file_2, sample_file_3])

print(response.text)

It says:

When the combination of files and system instructions that you intend to send is larger than 20MB in size, use the File API to upload those files, as previously shown. Smaller files can instead be called locally from the Gemini API:

simonw · 2024-08-26T03:13:25Z

I just saw Gemini has been trained to returning bounding boxes. https://ai.google.dev/gemini-api/docs/vision?lang=python#bbox

I tried this:

>>> import google.generativeai as genai
>>> genai.configure(api_key="...")
>>> model = genai.GenerativeModel(model_name="gemini-1.5-pro-latest")
>>> import PIL.Image
>>> pelicans = PIL.Image.open('/tmp/pelicans.jpeg')
>>> prompt = 'Return bounding boxes for every pelican in this photo - for each one return [ymin, xmin, ymax, xmax]'
>>> response = model.generate_content([pelicans, prompt])
>>> print(response.text)
I found the following bounding boxes:
- [488, 945, 519, 999]
- [460, 259, 487, 307]
- [472, 574, 498, 612]
- [459, 431, 483, 476]
- [530, 519, 555, 560]
- [445, 733, 470, 769]
- [493, 805, 516, 850]
- [418, 545, 441, 581]
- [400, 428, 425, 466]
- [593, 519, 616, 543]
- [428, 93, 451, 135]
- [431, 224, 456, 266]
- [586, 941, 609, 964]
- [602, 711, 623, 735]
- [397, 500, 419, 535]
I could not find any other pelicans in this image.

Against this photo:

It got 15 - I count 20.

simonw · 2024-08-26T03:29:48Z

I don't think those bounding boxes are in the right places. I built a Claude Artifact to render them, and I may not have built it right, but I got this:

Code here: https://static.simonwillison.net/static/2024/gemini-bounding-box-tool.html

Transcript: https://gist.github.com/simonw/40ff639e96d55a1df7ebfa7db1974b92

simonw · 2024-08-26T03:32:37Z

Tried it again with this photo of goats and got slightly more convincing result:

>>> goats = PIL.Image.open("/tmp/goats.jpeg")
>>> prompt = 'Return bounding boxes around every goat, for each one return [ymin, xmin, ymax, xmax]'
>>> response = model.generate_content([goats, prompt])
print(response.text)
>>> print(response.text)
- 200 90 745 527 goat
- 300 610 904 937 goat

simonw · 2024-08-26T03:40:17Z

Oh! I tried different varieties of coordinate and it turned out this one rendered correctly:

[255, 473, 800, 910]
[96, 63, 700, 390]

Rendered:

simonw · 2024-08-26T03:59:54Z

I mucked around a bunch and came up with this, which seems to work: https://static.simonwillison.net/static/2024/gemini-bounding-box-tool-fixed.html

It does a better job with the pelicans, though clearly those boxes aren't right. The goats are spot on though!

simonw · 2024-08-26T04:02:27Z

Fun, with this heron it found the reflection too:

>>> heron = PIL.Image.open("/tmp/heron.jpeg")
>>> prompt = 'Return bounding boxes around every heron, [ymin, xmin, ymax, xmax]'
>>> response = model.generate_content([heron, prompt])
>>> print(response.text)
- [431, 478, 625, 575]
- [224, 493, 411, 606]

Prompts: https://gist.github.com/simonw/c9b55757c8959b2c84d0d9bddd020f4d Which used https://gist.github.com/simonw/40ff639e96d55a1df7ebfa7db1974b92 See also: simonw/llm#557 (comment)

simonw · 2024-08-26T04:20:24Z

Based on all of that, I built this tool: https://tools.simonwillison.net/gemini-bbox

You have to paste in a Gemini API key when you use it, which gets stashed in localStorage (like my Haiku tool).

See full blog post here: https://simonwillison.net/2024/Aug/26/gemini-bounding-box-visualization/

simonw · 2024-08-26T19:31:25Z

I'd like to run an image model in llama-cpp-python - this one would be good: https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf/tree/main

The docs at https://llama-cpp-python.readthedocs.io/en/latest/#multi-modal-models seem to want a path to a CLIP model though, which I'm not sure how to obtain.

simonw · 2024-08-26T22:23:57Z

https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf would be a good one to figure out the Python / llama-cpp-python recipe for too.

saket424 · 2024-08-29T21:21:32Z

I'd like to run an image model in llama-cpp-python - this one would be good: https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf/tree/main

The docs at https://llama-cpp-python.readthedocs.io/en/latest/#multi-modal-models seem to want a path to a CLIP model though, which I'm not sure how to obtain.

According to perplexity.ai "the mmproj model is essentially equivalent to the CLIP model in the context of llama-cpp-python and GGUF (GGML Unified Format) files for multimodal models like LLaVA and minicpm2.6"

https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf/resolve/main/mmproj-model-f16.gguf?download=true

It appears the underlying embedding model used is google/siglip-base-patch16-224

saket424 · 2024-08-29T21:29:49Z

i have used MiniCPM-V-2_6- with bleeding edge llama.cpp and it works quite well

ffmpeg -i ./clip.mp4 \
  -vf fps=1/3,scale=480:480:force_original_aspect_ratio=decrease \
  -q:v 2 ./f/frame_%04d.jpg

./llama-minicpmv-cli \
  -m ./mini2.6/ggml-model-Q4_K_M.gguf \
  --mmproj ./mini2.6/mmproj-model-f16.gguf  \
  --image ./f/frame_0001.jpg \
  --image ./f/frame_0002.jpg \
  --image ./f/frame_0003.jpg \
  --image ./f/frame_0004.jpg \
  --temp 0.1 \
  -p "describe the images in detail in english language" \
  -c 4096

saket424 · 2024-08-29T21:36:28Z

wow it appears this functionality just got added to llama-cpp-python just yesterday. eagerly looking forward to MiniCPM-V-2_6-gguf as a supported llm multimodal model

abetlen/llama-cpp-python@ad2deaf

saket424 · 2024-08-29T22:56:39Z

@simonw

I tried this newest 2.90 version of llama-cpp-python and it works! Instead of ggml-model-f16.gguf you can use ggml-model-Q4_K_M.gguf if you prefer

from llama_cpp import Llama
from llama_cpp.llama_chat_format import MiniCPMv26ChatHandler

chat_handler = MiniCPMv26ChatHandler.from_pretrained(
  repo_id="openbmb/MiniCPM-V-2_6-gguf",
  filename="*mmproj*",
)

llm = Llama.from_pretrained(
  repo_id="openbmb/MiniCPM-V-2_6-gguf",
  filename="ggml-model-f16.gguf",
  chat_handler=chat_handler,
  n_ctx=4096, # n_ctx should be increased to accommodate the image embedding
)

response = llm.create_chat_completion(
    messages = [
        {
            "role": "user",
            "content": [
                {"type" : "text", "text": "What's in this image?"},
                {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg" } }

            ]
        }
    ]
)
print(response["choices"][0])
print(response["choices"][0]["message"]["content"])

simonw · 2024-08-29T23:00:37Z

Thank you! That’s exactly what I needed to know.

helix84 · 2024-09-16T12:56:45Z

ollama 0.3.10, captured HTTP conversation to /api/chat via the ollama CLI client, prompt was: "/tmp/image.jpg OCR the text from the image."

POST /api/chat HTTP/1.1
Host: 127.0.0.1:11434
User-Agent: ollama/0.3.10 (amd64 linux) Go/go1.22.5
Content-Length: 1370164
Accept: application/x-ndjson
Content-Type: application/json
Accept-Encoding: gzip

{"model":"minicpm-v","messages":[{"role":"user","content":"  OCR the text from the image.","images":["/9j/2wC<truncated base64>/9k="]}],"format":"","options":{}}

same JSON pretty-printed:

{
    "model":"minicpm-v",
    "messages":[
        {
            "role":"user",
            "content":"  OCR the text from the image.",
            "images":[
                "/9j/2wC<truncated base64>/9k="
            ]
        }
    ],
    "format":"",
    "options":{
    }
}

simonw added the research label Aug 26, 2024

Repository owner deleted a comment from NimaJafariComp Aug 26, 2024

Repository owner deleted a comment Aug 26, 2024

AlexanderYastrebov mentioned this issue Aug 29, 2024

File upload support? #523

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research issue: gather examples of multi-modal API calls from different LLMs #557

Research issue: gather examples of multi-modal API calls from different LLMs #557

simonw commented Aug 26, 2024

simonw commented Aug 26, 2024

simonw commented Aug 26, 2024 •

edited

Loading

simonw commented Aug 26, 2024

simonw commented Aug 26, 2024

simonw commented Aug 26, 2024 •

edited

Loading

simonw commented Aug 26, 2024

simonw commented Aug 26, 2024

simonw commented Aug 26, 2024

simonw commented Aug 26, 2024

simonw commented Aug 26, 2024

simonw commented Aug 26, 2024

simonw commented Aug 26, 2024 •

edited

Loading

simonw commented Aug 26, 2024

simonw commented Aug 26, 2024

saket424 commented Aug 29, 2024 •

edited

Loading

saket424 commented Aug 29, 2024 •

edited by simonw

Loading

saket424 commented Aug 29, 2024

saket424 commented Aug 29, 2024

simonw commented Aug 29, 2024

helix84 commented Sep 16, 2024

Research issue: gather examples of multi-modal API calls from different LLMs #557

Research issue: gather examples of multi-modal API calls from different LLMs #557

Comments

simonw commented Aug 26, 2024

simonw commented Aug 26, 2024

simonw commented Aug 26, 2024 • edited Loading

simonw commented Aug 26, 2024

simonw commented Aug 26, 2024

simonw commented Aug 26, 2024 • edited Loading

simonw commented Aug 26, 2024

simonw commented Aug 26, 2024

simonw commented Aug 26, 2024

simonw commented Aug 26, 2024

simonw commented Aug 26, 2024

simonw commented Aug 26, 2024

simonw commented Aug 26, 2024 • edited Loading

simonw commented Aug 26, 2024

simonw commented Aug 26, 2024

saket424 commented Aug 29, 2024 • edited Loading

saket424 commented Aug 29, 2024 • edited by simonw Loading

saket424 commented Aug 29, 2024

saket424 commented Aug 29, 2024

simonw commented Aug 29, 2024

helix84 commented Sep 16, 2024

simonw commented Aug 26, 2024 •

edited

Loading

simonw commented Aug 26, 2024 •

edited

Loading

simonw commented Aug 26, 2024 •

edited

Loading

saket424 commented Aug 29, 2024 •

edited

Loading

saket424 commented Aug 29, 2024 •

edited by simonw

Loading