-
-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Research issue: gather examples of multi-modal API calls from different LLMs #557
Comments
Simple GPT-4o example from https://simonwillison.net/2024/Aug/25/covidsewage-alt-text/ import base64, openai
client = openai.OpenAI()
with open("/tmp/covid.png", "rb") as image_file:
encoded_image = base64.b64encode(image_file.read()).decode("utf-8")
messages = [
{
"role": "system",
"content": "Return the concentration levels in the sewersheds - single paragraph, no markdown",
},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {"url": "data:image/png;base64," + encoded_image},
}
],
},
]
completion = client.chat.completions.create(model="gpt-4o", messages=messages)
print(completion.choices[0].message.content) |
Claude image example from https://github.com/simonw/tools/blob/0249ab83775861f549abb1aa80af0ca3614dc5ff/haiku.html const requestBody = {
model: "claude-3-haiku-20240307",
max_tokens: 1024,
messages: [
{
role: "user",
content: [
{
type: "image",
source: {
type: "base64",
media_type: "image/jpeg",
data: base64Image,
},
},
{ type: "text", text: "Return a haiku inspired by this image" },
],
},
],
};
fetch("https://api.anthropic.com/v1/messages", {
method: "POST",
headers: {
"x-api-key": apiKey,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
"anthropic-dangerous-direct-browser-access": "true"
},
body: JSON.stringify(requestBody),
})
.then((response) => response.json())
.then((data) => {
console.log(JSON.stringify(data, null, 2));
const haiku = data.content[0].text;
responseElement.innerText += haiku + "\n\n";
})
.catch((error) => {
console.error("Error sending image to the Anthropic API:", error);
})
.finally(() => {
// Hide "Generating..." message
generatingElement.style.display = "none";
}); |
Basic Gemini example from https://github.com/simonw/llm-gemini/blob/4195c4396834e5bccc3ce9a62647591e1b228e2e/llm_gemini.py (my messages = []
if conversation:
for response in conversation.responses:
messages.append(
{"role": "user", "parts": [{"text": response.prompt.prompt}]}
)
messages.append({"role": "model", "parts": [{"text": response.text()}]})
if prompt.images:
for image in prompt.images:
messages.append(
{
"role": "user",
"parts": [
{
"inlineData": {
"mimeType": "image/jpeg",
"data": base64.b64encode(image.read()).decode(
"utf-8"
),
}
}
],
}
)
messages.append({"role": "user", "parts": [{"text": prompt.prompt}]}) |
Example from Google AI Studio: API_KEY="YOUR_API_KEY"
# TODO: Make the following files available on the local file system.
FILES=("image.jpg")
MIME_TYPES=("image/jpeg")
for i in "${!FILES[@]}"; do
NUM_BYTES=$(wc -c < "${FILES[$i]}")
curl "https://generativelanguage.googleapis.com/upload/v1beta/files?key=${API_KEY}" \
-H "X-Goog-Upload-Command: start, upload, finalize" \
-H "X-Goog-Upload-Header-Content-Length: ${NUM_BYTES}" \
-H "X-Goog-Upload-Header-Content-Type: ${MIME_TYPES[$i]}" \
-H "Content-Type: application/json" \
-d "{'file': {'display_name': '${FILES[$i]}'}}" \
--data-binary "@${FILES[$i]}"
# TODO: Read the file.uri from the response, store it as FILE_URI_${i}
done
# Adjust safety settings in generationConfig below.
# See https://ai.google.dev/gemini-api/docs/safety-settings
curl \
-X POST https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-pro-exp-0801:generateContent?key=${API_KEY} \
-H 'Content-Type: application/json' \
-d @<(echo '{
"contents": [
{
"role": "user",
"parts": [
{
"fileData": {
"fileUri": "${FILE_URI_0}",
"mimeType": "image/jpeg"
}
}
]
},
{
"role": "user",
"parts": [
{
"text": "Describe image in detail"
}
]
}
],
"generationConfig": {
"temperature": 1,
"topK": 64,
"topP": 0.95,
"maxOutputTokens": 8192,
"responseMimeType": "text/plain"
}
}') |
Here's Gemini Pro accepting multiple images at once: https://ai.google.dev/gemini-api/docs/vision?lang=python#prompt-multiple import PIL.Image
sample_file = PIL.Image.open('sample.jpg')
sample_file_2 = PIL.Image.open('piranha.jpg')
sample_file_3 = PIL.Image.open('firefighter.jpg')
model = genai.GenerativeModel(model_name="gemini-1.5-pro")
prompt = (
"Write an advertising jingle showing how the product in the first image "
"could solve the problems shown in the second two images."
)
response = model.generate_content([prompt, sample_file, sample_file_2, sample_file_3])
print(response.text) It says:
|
I just saw Gemini has been trained to returning bounding boxes. https://ai.google.dev/gemini-api/docs/vision?lang=python#bbox I tried this: >>> import google.generativeai as genai
>>> genai.configure(api_key="...")
>>> model = genai.GenerativeModel(model_name="gemini-1.5-pro-latest")
>>> import PIL.Image
>>> pelicans = PIL.Image.open('/tmp/pelicans.jpeg')
>>> prompt = 'Return bounding boxes for every pelican in this photo - for each one return [ymin, xmin, ymax, xmax]'
>>> response = model.generate_content([pelicans, prompt])
>>> print(response.text)
I found the following bounding boxes:
- [488, 945, 519, 999]
- [460, 259, 487, 307]
- [472, 574, 498, 612]
- [459, 431, 483, 476]
- [530, 519, 555, 560]
- [445, 733, 470, 769]
- [493, 805, 516, 850]
- [418, 545, 441, 581]
- [400, 428, 425, 466]
- [593, 519, 616, 543]
- [428, 93, 451, 135]
- [431, 224, 456, 266]
- [586, 941, 609, 964]
- [602, 711, 623, 735]
- [397, 500, 419, 535]
I could not find any other pelicans in this image. Against this photo: It got 15 - I count 20. |
I don't think those bounding boxes are in the right places. I built a Claude Artifact to render them, and I may not have built it right, but I got this: Code here: https://static.simonwillison.net/static/2024/gemini-bounding-box-tool.html Transcript: https://gist.github.com/simonw/40ff639e96d55a1df7ebfa7db1974b92 |
Tried it again with this photo of goats and got slightly more convincing result: >>> goats = PIL.Image.open("/tmp/goats.jpeg")
>>> prompt = 'Return bounding boxes around every goat, for each one return [ymin, xmin, ymax, xmax]'
>>> response = model.generate_content([goats, prompt])
print(response.text)
>>> print(response.text)
- 200 90 745 527 goat
- 300 610 904 937 goat |
I mucked around a bunch and came up with this, which seems to work: https://static.simonwillison.net/static/2024/gemini-bounding-box-tool-fixed.html It does a better job with the pelicans, though clearly those boxes aren't right. The goats are spot on though! |
Fun, with this heron it found the reflection too: >>> heron = PIL.Image.open("/tmp/heron.jpeg")
>>> prompt = 'Return bounding boxes around every heron, [ymin, xmin, ymax, xmax]'
>>> response = model.generate_content([heron, prompt])
>>> print(response.text)
- [431, 478, 625, 575]
- [224, 493, 411, 606] |
Based on all of that, I built this tool: https://tools.simonwillison.net/gemini-bbox You have to paste in a Gemini API key when you use it, which gets stashed in See full blog post here: https://simonwillison.net/2024/Aug/26/gemini-bounding-box-visualization/ |
I'd like to run an image model in The docs at https://llama-cpp-python.readthedocs.io/en/latest/#multi-modal-models seem to want a path to a CLIP model though, which I'm not sure how to obtain. |
https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf would be a good one to figure out the Python / |
According to perplexity.ai "the mmproj model is essentially equivalent to the CLIP model in the context of llama-cpp-python and GGUF (GGML Unified Format) files for multimodal models like LLaVA and minicpm2.6" https://huggingface.co/openbmb/MiniCPM-V-2_6-gguf/resolve/main/mmproj-model-f16.gguf?download=true It appears the underlying embedding model used is google/siglip-base-patch16-224 |
i have used
|
wow it appears this functionality just got added to llama-cpp-python just yesterday. eagerly looking forward to MiniCPM-V-2_6-gguf as a supported llm multimodal model |
I tried this newest 2.90 version of llama-cpp-python and it works! Instead of ggml-model-f16.gguf you can use ggml-model-Q4_K_M.gguf if you prefer
|
Thank you! That’s exactly what I needed to know. |
ollama 0.3.10, captured HTTP conversation to /api/chat via the ollama CLI client, prompt was: "/tmp/image.jpg OCR the text from the image." POST /api/chat HTTP/1.1
Host: 127.0.0.1:11434
User-Agent: ollama/0.3.10 (amd64 linux) Go/go1.22.5
Content-Length: 1370164
Accept: application/x-ndjson
Content-Type: application/json
Accept-Encoding: gzip
{"model":"minicpm-v","messages":[{"role":"user","content":" OCR the text from the image.","images":["/9j/2wC<truncated base64>/9k="]}],"format":"","options":{}} same JSON pretty-printed: {
"model":"minicpm-v",
"messages":[
{
"role":"user",
"content":" OCR the text from the image.",
"images":[
"/9j/2wC<truncated base64>/9k="
]
}
],
"format":"",
"options":{
}
} |
To aid in the design for both of these:
I'm going to gather a bunch of examples of how different LLMs accept multi-modal inputs. I'm particularly interested in the following:
--file filename.ext
or do I need some other mechanism that helps provide the type as well?)The text was updated successfully, but these errors were encountered: