Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ImageToText & AnswerToImage #2444

Closed
ZanSara opened this issue Apr 21, 2022 · 9 comments
Closed

ImageToText & AnswerToImage #2444

ZanSara opened this issue Apr 21, 2022 · 9 comments
Labels
topic:images type:feature New feature or request

Comments

@ZanSara
Copy link
Contributor

ZanSara commented Apr 21, 2022

[Part of #2418]

What
ImageToText would be a node that takes a list of paths to images and captions them. The captions are then stored as Documents, with the path to the image in their metadata. The captions will be processed as regular documents, so no radical changes are expected in the core of the framework.

Why
ImageToText could be a nice test of how Haystack could take images as input in indexing pipelines, and help opening the path for image support in general.

These changes should be added as separate PRs.

@ZanSara ZanSara self-assigned this Apr 21, 2022
@ZanSara ZanSara added type:feature New feature or request topic:images labels Apr 21, 2022
@ZanSara ZanSara changed the title ImageToText ImageToText & AnswerToImage Apr 21, 2022
@ZanSara ZanSara mentioned this issue Apr 21, 2022
8 tasks
@masci masci assigned vblagoje and unassigned ZanSara Aug 3, 2022
@anakin87
Copy link
Member

As I was starting to familiarize myself with images/multimodal support, I ran into this issue.

Are these features still desirable?
Has any detail changed?

@masci
Copy link
Contributor

masci commented Nov 25, 2022

@anakin87 yes, still something we want to do. We'll keep this issue updated should we change anything.

@anakin87
Copy link
Member

anakin87 commented Dec 1, 2022

As you can see in this Space, Transformers models for image captioning are available nowadays.

@ZanSara if you can provide more details about the architecture/design of this node, it shouldn't be too difficult to develop it.

Bonus point: there are also several Transformers models for OCR.
So in the future, this node could be somewhat adapted also for this task, if the user wants to use such models.

@ZanSara
Copy link
Contributor Author

ZanSara commented Dec 5, 2022

Hello @anakin87! So the idea of these two nodes were fairly basic.

ImageToText

This is indeed a simple node doing image captioning. Input are Image documents and output should be Text documents with a path to the source image in their metadata. Not a lot of magic involved: any image captioning model would do.

About OCR:

I had imagined this node to do image captioning on pictures that contain way more than text, but as you mention, OCR also fits this description. I would not optimize for OCR here: that would be more the task of a DocumentConverter. However, I don't think we should explicitly ban this usage. Maybe someone will find it useful.

(Generative)AnswerToImage

This instead is a more fun one 😁 The base idea is that if I put this node at the end of my pipeline, I want it to produce an image, even though the answer I got was text. So this node would take an Answer, pass the answer's text to an image generation model (DALL-E, Stable Diffusion, etc)... and return the same Answer with a path to such image in the metadata. Considering that diffusion models usually needs to tweak the prompt quite a bit, consider letting the users prefix and suffix the prompt with style directives and other tweaks.

Future steps

In the future we should consider having nodes like ExtractiveAnswerToImage that would support VisualQA. Something like: if the Answer object contains a mask, ExtractiveAnswerToImage it should be able to crop the source image and return the content of the mask only. However, we have no nodes generating image masks for VisualQA right now, so such node is premature.

If you want to play more with image based pipelines there are many more ideas to consider (namely, implement VisualQA), so let me know!

@anakin87
Copy link
Member

Hey @ZanSara!

Speaking of the ImageToText node, what should the input be in your opinion?

  • an image Document, already imported in Haystack
  • an image file path

@ZanSara
Copy link
Contributor Author

ZanSara commented Jan 12, 2023

Hello @anakin87! That's a really good question... but maybe we don't have to choose? 😁 Do you think we can make it work for both? I'd imagine it can, but let me know if you face issues or you don't like the idea.

If we need to select just one, however, I'd lean towards Document.

@ZanSara
Copy link
Contributor Author

ZanSara commented Mar 15, 2023

ImageToText is done, AnswerToImage left to do! Contributions are welcome 😊

@anakin87
Copy link
Member

ImageToText is done, AnswerToImage left to do! Contributions are welcome 😊

Implementation related to AnswerToImage in fastRAG: https://github.com/IntelLabs/fastRAG#retrieval-oriented-answer-image-generation

@TuanaCelik played with it recently.

@masci
Copy link
Contributor

masci commented Dec 13, 2023

Closing as superseded by the changes introduced in Haystack 2.x

@masci masci closed this as completed Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:images type:feature New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants