`ImageToText` & `AnswerToImage` #2444

ZanSara · 2022-04-21T15:19:58Z

[Part of #2418]

What
ImageToText would be a node that takes a list of paths to images and captions them. The captions are then stored as Documents, with the path to the image in their metadata. The captions will be processed as regular documents, so no radical changes are expected in the core of the framework.

Why
ImageToText could be a nice test of how Haystack could take images as input in indexing pipelines, and help opening the path for image support in general.

These changes should be added as separate PRs.

The text was updated successfully, but these errors were encountered:

anakin87 · 2022-11-24T11:16:23Z

As I was starting to familiarize myself with images/multimodal support, I ran into this issue.

Are these features still desirable?
Has any detail changed?

masci · 2022-11-25T09:31:12Z

@anakin87 yes, still something we want to do. We'll keep this issue updated should we change anything.

anakin87 · 2022-12-01T15:40:35Z

As you can see in this Space, Transformers models for image captioning are available nowadays.

@ZanSara if you can provide more details about the architecture/design of this node, it shouldn't be too difficult to develop it.

Bonus point: there are also several Transformers models for OCR.
So in the future, this node could be somewhat adapted also for this task, if the user wants to use such models.

ZanSara · 2022-12-05T10:28:46Z

Hello @anakin87! So the idea of these two nodes were fairly basic.

`ImageToText`

This is indeed a simple node doing image captioning. Input are Image documents and output should be Text documents with a path to the source image in their metadata. Not a lot of magic involved: any image captioning model would do.

About OCR:

I had imagined this node to do image captioning on pictures that contain way more than text, but as you mention, OCR also fits this description. I would not optimize for OCR here: that would be more the task of a DocumentConverter. However, I don't think we should explicitly ban this usage. Maybe someone will find it useful.

`(Generative)AnswerToImage`

This instead is a more fun one 😁 The base idea is that if I put this node at the end of my pipeline, I want it to produce an image, even though the answer I got was text. So this node would take an Answer, pass the answer's text to an image generation model (DALL-E, Stable Diffusion, etc)... and return the same Answer with a path to such image in the metadata. Considering that diffusion models usually needs to tweak the prompt quite a bit, consider letting the users prefix and suffix the prompt with style directives and other tweaks.

Future steps

In the future we should consider having nodes like ExtractiveAnswerToImage that would support VisualQA. Something like: if the Answer object contains a mask, ExtractiveAnswerToImage it should be able to crop the source image and return the content of the mask only. However, we have no nodes generating image masks for VisualQA right now, so such node is premature.

If you want to play more with image based pipelines there are many more ideas to consider (namely, implement VisualQA), so let me know!

anakin87 · 2023-01-11T20:17:36Z

Hey @ZanSara!

Speaking of the ImageToText node, what should the input be in your opinion?

an image Document, already imported in Haystack
an image file path

ZanSara · 2023-01-12T10:31:27Z

Hello @anakin87! That's a really good question... but maybe we don't have to choose? 😁 Do you think we can make it work for both? I'd imagine it can, but let me know if you face issues or you don't like the idea.

If we need to select just one, however, I'd lean towards Document.

ZanSara · 2023-03-15T10:28:02Z

ImageToText is done, AnswerToImage left to do! Contributions are welcome 😊

anakin87 · 2023-03-15T10:56:08Z

ImageToText is done, AnswerToImage left to do! Contributions are welcome 😊

Implementation related to AnswerToImage in fastRAG: https://github.com/IntelLabs/fastRAG#retrieval-oriented-answer-image-generation

@TuanaCelik played with it recently.

masci · 2023-12-13T18:07:22Z

Closing as superseded by the changes introduced in Haystack 2.x

ZanSara self-assigned this Apr 21, 2022

ZanSara added type:feature New feature or request topic:images labels Apr 21, 2022

ZanSara changed the title ~~ImageToText~~ ImageToText & AnswerToImage Apr 21, 2022

ZanSara mentioned this issue Apr 21, 2022

Add support for images #2418

Closed

8 tasks

masci assigned vblagoje and unassigned ZanSara Aug 3, 2022

masci unassigned vblagoje Oct 19, 2022

anakin87 mentioned this issue Jan 14, 2023

feat: ImageToText (caption generator) #3859

Merged

6 tasks

masci closed this as completed Dec 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ImageToText` & `AnswerToImage` #2444

`ImageToText` & `AnswerToImage` #2444

ZanSara commented Apr 21, 2022 •

edited

Loading

anakin87 commented Nov 24, 2022

masci commented Nov 25, 2022

anakin87 commented Dec 1, 2022

ZanSara commented Dec 5, 2022 •

edited

Loading

anakin87 commented Jan 11, 2023

ZanSara commented Jan 12, 2023

ZanSara commented Mar 15, 2023

anakin87 commented Mar 15, 2023

masci commented Dec 13, 2023

ImageToText & AnswerToImage #2444

ImageToText & AnswerToImage #2444

Comments

ZanSara commented Apr 21, 2022 • edited Loading

anakin87 commented Nov 24, 2022

masci commented Nov 25, 2022

anakin87 commented Dec 1, 2022

ZanSara commented Dec 5, 2022 • edited Loading

ImageToText

About OCR:

(Generative)AnswerToImage

Future steps

anakin87 commented Jan 11, 2023

ZanSara commented Jan 12, 2023

ZanSara commented Mar 15, 2023

anakin87 commented Mar 15, 2023

masci commented Dec 13, 2023

`ImageToText` & `AnswerToImage` #2444

`ImageToText` & `AnswerToImage` #2444

ZanSara commented Apr 21, 2022 •

edited

Loading

ZanSara commented Dec 5, 2022 •

edited

Loading

`ImageToText`

`(Generative)AnswerToImage`