-
-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-modal support for vision models such as GPT-4 vision #331
Comments
Indeed this would be awesome. Does it require changes to |
I suspect we'll be seeing more multimodal models so inclusion in core makes sense, but I defer to @simonw on this! |
I've been thinking about this a lot. The challenge here is that we need to be able to mix both text and images together in the same prompt - because you can call GPT-4 vision with this kind of thing:
My first instinct was to support syntax like this: llm -m gpt-4-vision \
"Take a look at this image:" \
-i image1.jpeg \
"Now compare it to this:" \
-i https://example.com/image2.png Note that the But... I don't think I can implement this, because Click really, really doesn't want to provide a mechanism for storing and retrieving the order of different arguments and parameters relative to each other:
I spent some time trying to get this to work with a custom Click command class and So now I'm considering the following instead:
The trick here is that new On a related note: llm chat -m gpt-4-vision
look at this image
!image image.jpeg For multi-lined chats you would use the existing llm chat -m gpt-4-vision
!multi
look at this image
!image image.jpeg
and compare it with
!image https://example.com/image.png
!end |
Crucially, I want to leave the door open for other LLM models provided by plugins - like maybe https://github.com/SkunkworksAI/BakLLaVA - to also support multi-modal inputs like this. So I think the model class would have a |
One note about the llm -m gpt-4-vision "Caption for this image" -i image.jpeg |
This work is blocked on: |
Would be amazing to get this working with a Bakllava local model - relevant example code using llama.cpp here https://github.com/cocktailpeanut/mirror/blob/main/app.py |
Another claimed bakllava example (not tried it yet), this one using [Actually uses |
@simonw how about f-strings/templating style?
def _infiles_to_dict(
ctx: click.Context, attribute: click.Option, infiles: tuple[str, ...]) -> dict[str, str]:
return {k:v for k,v in (f.split("=") for f in infiles)}
@click.command()
@click.option(
"-i",
"--infile",
multiple=True,
callback=_infiles_to_dict,
help="Input files in the form key=filename. Multiple files can be included."
) Misc thoughts:
|
https://github.com/tbckr/sgptSGPT additionally facilitates the utilization of the GPT-4 Vision API. Include input images using the
It is also possible to combine URLs and local images:
|
I built a prototype of this today, in the I gave it this image: And ran this: llm -m 4v 'describe this image' -i image.jpg -o max_tokens 200 And got back:
|
Lots still to do on this - I want it to support either URLs or file paths or Maybe have a thing with Pillow as an optional dependency which can resize the images before sending them? Have to decide what to do about logs. I think I need to log the images to the SQLite database (maybe in a new |
I am going to pass around an image object that has a That way plugins like OpenAI that can be sent URLs can use I'm tempted to offer a |
Idea: rather than store the images in the database, I'll store the path to the files on disk. If you attempt to continue a conversation where the file paths no longer resolve to existing images, you'll get an error. |
Would be nice if the API server gave you a reference for every uploaded image, that you could just refer back to |
came here looking for non-text API endpoints... i was hoping to have a direct view into the audio and text-to-speech API endpoints, in particular. so while it would be nice to have llm have a chat-like interface to interleave images, maybe an easier first step would be to have just a simple "prompt-to-image", "prompt-to-audio", "audio-to-text" kind of commands? |
Quick survey on Twitter: https://twitter.com/simonw/status/1768445876274635155 Consensus is loosely to do image and then text, rather than text then image:
|
Claude 3 Haiku is cheaper than GPT-3.5 Turbo and supports image inputs - a great incentive to finally get this feature shipped! |
https://twitter.com/invisiblecomma/status/1768561708090417603
https://docs.anthropic.com/claude/docs/vision#image-best-practices
|
Should I enforce this for the Claude model? Easiest to let Claude API return an error at first. I'm not yet sure if LLM should depend on Pillow and use it to resize large images before sending them. Maybe a plugin hook to allow things like resizing and HEIC conversion would be useful? |
|
IMO BTW, OpenAI supports both |
I made a simple cli for vision, if anyone needs it before llm-vision is ready. Only supports GPT4 for now. :( It supports specifying an output format that prompts the model to generate markdown, or json in addition to plain text. One thing odd about gpt-4-vision is that it doesn't know you have given it an image, and sometimes doesn't believe it has vision capabilities unless you give it a phrase like 'describe the image'. But, if you want to extract an image to json, then a text description isn't very useful. So, I prompt it with 'describe the image in your head, then write the json document'. There's also a work-in-progress gpt4-vision-screen-compare.py - this takes a screenshot every few seconds and compares the similarity with the last screenshot and if different enough it sends it to the model asking it explain the changes between them. And here's a demo of what you can do with it: https://twitter.com/xundecidability/status/1763219017160867840
Solution: A little bash script that: |
Current status:
The main sticking point is what to do with the SQLite logging mechanism It's important that Some ways that could work:
|
Very nice! I'm not sure I'd want to include the image in every turn,
though. I send a lot of full screenshots and my poor connection doesn't
help. What I do currently is generate the description with a python script
and pipe that to llm to chat about it. If it's important I might include
the file path in the prompt. Then the llm can act on the file, and I can I
search for the file in the logs dB.
Cheers,
Thomas
…On Thu, 4 Apr 2024, 02:42 Simon Willison, ***@***.***> wrote:
Current status:
- Branch has -i support
- I have GPT-4 Vision support, plus branches of llm-gemini and
llm-claude-3
The main sticking point is what to do with the SQLite logging mechanism
It's important that llm -c "..." works for sending follow-up prompts.
This means it needs to be able to send the image again.
Some ways that could work:
- For images on disk, store the path to that image on disk. Use that
again in follow-up prompts, and throw a hard error if the file is no longer
visible.
- Some models support URLs. For public URLs to images I can store
those URLs, and let the APIs themselves error if the URLs are 404ing
- Images fed in to standard input could be stored in the database,
maybe as BLOB columns
- But since being able to compare prompts responses is so useful,
maybe I should store images from disk in BLOB too? The cost in terms
of SQLite space taken up may be worth it.
—
Reply to this email directly, view it on GitHub
<#331 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AE476NAPPZBB4BQDLUGVRU3Y3SV2HAVCNFSM6AAAAAA7AKZ476VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZVHE3DANJZGU>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
@simonw Just add an option |
Another open question: how should this work in chat? I'm inclined to add But then should it be submitted the moment you hit enter, or should you get the opportunity to add a prompt afterwards? I think adding a prompt afterwards makes sense. Also should |
Yeah I'm beginning to think I may need to had a whole settings/preferences mechanism to help solve this. |
Perhaps you can use a TUI hotkey? E.g., Ctrl-i for inserting images. The ideal case is to be able to just paste, and detect images from the clipboard. But this seems impossible to do using native paste. Perhaps you can add a custom hotkey for pasting that checks the clipboard. I have some functions for macOS that paste images, e.g.,
|
For pasting I think I'll hold off until I have a web UI working - much easier to handle paste there (e.g. https://tools.simonwillison.net/ocr does that) than figure it out for the terminal. It would be good to get this working though: pbpaste | llm -m claude-3-opus 'describe this image' -i - Oh, that's frustrating: it looks like ChatGPT did come up with this recipe which seems to work: osascript -e 'set theImage to the clipboard as «class PNGf»' \
-e 'set theFile to open for access POSIX file "/tmp/clipboard.png" with write permission' \
-e 'write theImage to theFile' \
-e 'close access theFile' \
&& cat /tmp/clipboard.png && rm /tmp/clipboard.png I imagine there are cleaner implementations than that. Would be easy to wrap one into a little |
I saved this in #!/bin/zsh
# Generate a unique temporary filename
tempfile=$(mktemp -t clipboard.XXXXXXXXXX.png)
# Save the clipboard image to the temporary file
osascript -e 'set theImage to the clipboard as «class PNGf»' \
-e "set theFile to open for access POSIX file \"$tempfile\" with write permission" \
-e 'write theImage to theFile' \
-e 'close access theFile'
# Output the image data to stdout
cat "$tempfile"
# Delete the temporary file
rm "$tempfile" Opus conversation here: https://gist.github.com/simonw/736bcc9bcfaef40a55deaa959fd57ca8 |
Turned that into a TIL: https://til.simonwillison.net/macos/impaste |
@simonw I was inspired by your TIL to try a little Swift. Here's a executable that does roughly the same thing: https://github.com/paulsmith/pbimg Also used Claude Opus to help get started. |
OK, design decision regarding logging of images. All models will support URL input. If the model can handle URLs directly those will be passed to the model - for models that can't retrieve URLs themselves LLM will fetch the content and pass it to the model. If you provide a URL, then just that URL string will be logged to the database. If you provide a path to a file on disk, the full resolved path will be stored. If you pipe an image into the tool (with You can also pass image file names and use a --save-images option to write them tot hat table too. This is mainly useful if you are building a research database of prompts and responses and want to pass that around. |
@simonw I guess you should add a command to clean the database from image blobs, and automatically purge blobs older than |
The option for storing the images should be Line 145 in 12e027d
|
I'm not sure if the discussion about ways to pass in multiple images is still open, but what about just using markdown? E.g. if This has the advantage that a lot of existing markdown files in the wild could just be passed in without modification, and the llm would "see" what the human sees. |
@cmungall this is the simplest approach for sure. Could even support tags or raw URLs/file paths. The downside is you're now making network calls based on the input text. You also need a way of turning the feature off, and also escaping whatever syntax is used. |
I'm going to change this to |
... or maybe not. I don't actually know how all of the model that handle images/audio/video are going to work. If they need me to pass them inputs as a specific type - a URL to a video that's marked as video, or to audio that's marked as audio, then bundling everything together as So maybe I do |
yes, I think we need to use many variant extensions of the attachments instead. Only --attachment is very good but I think we need to implement a list filter to choose right processor for each extension |
I also like |
No update on this? The lack of multimodality is really a major reason I'm not using llm as much anymore :/ |
Really want multi-modal as well |
I'm going to do this in here instead: |
https://platform.openai.com/docs/guides/vision
I think this is best handled by command line options
--image
and--image-urls
to either encode and pass as base64, or to pass a URL.The text was updated successfully, but these errors were encountered: