[Pipelines] Problems with an image-to-text fine-tuned model #21514

sayakpaul · 2023-02-08T10:22:22Z

I have fine-tuned the microsoft/git-base model on image cpationing (Colab Notebook).

I am trying to use the model with 🤗 Pipelines:

from transformers import pipeline

captioner = pipeline(model="sayakpaul/git-base-pokemon", tokenizer=processor.tokenizer, feature_extractor=processor)
captioner("https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/pokemon.png")

It only spits:

[{'generated_text': 'https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/pokemon.png'}]

If you check the Colab Notebook, you will notice that it works okay when the inference is performed explicitly i.e., without pipelines. Is it because the architecture tagged with the model is GitForCausalLM?

Also note that on the model repo, there is a tag "Image To Text" WHICH I HAVE MANUALLY ADDED to see if that has any effect. By default, the model gets tagged as a text generation model.

@Narsil is it out of scope to support this model in an image to text generation pipeline?

The text was updated successfully, but these errors were encountered:

Narsil · 2023-02-08T10:28:13Z

I'm not well versed with Git as a model.

Pipelines are usually agnostic to actual models. As long as model X is AutoModelForVision2Seq it should work out of the box.

If the architecture is different, we can discuss what's done and how to implement.
The rule of thumb:

Most models should get supported out of the box through sheer consistency with other models (pipeline being agnostic).
The more popular a model, the more likely we add support (Who would have guessed)
Pipeline is defined by I/O without parameter, as long as the model can provide, we can support.
Extra parameters are to be added regarding the task at hand, not really what model do . (For instance timestamps in audio recognition is useful to do automated subtitling, it doesn't matter how models actually implement it)

sayakpaul · 2023-02-08T10:32:13Z

Fair enough!

I guess the main reason the pipeline is acting weird could be that the model is loaded into AutoModelForCasualLM. Looking at the source code, it's not implemented for AutoModelForVision2Seq I think.

Narsil · 2023-02-08T10:34:37Z

Yes that's exactly it. In the absence of tags the hub will check the config and assign a pipeline based on architecture format ForXX, just like the pipeline does.

Narsil · 2023-02-08T10:34:54Z

Do you have a sample script to make it work for captionning ?

sayakpaul · 2023-02-08T10:58:26Z

Do you have a sample script to make it work for captionning ?

If you check the Colab Notebook I linked above, you will see it at the end (the inference section).

NielsRogge · 2023-02-08T11:17:12Z

The pipeline currently only supports classes that are instances of VisionEncoderDecoderModel.

GitForCausalLM isn't supported, as that would require extending the image-to-text pipeline.

Narsil · 2023-02-08T11:20:22Z

Seems to me that the colab does pretty much what the pipeline does:

https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/image_to_text.py#L114

Any reason not to implement ForVision2Seq ?

sayakpaul · 2023-02-08T11:21:28Z

Yeah that is what my understanding is as well. Maybe @NielsRogge can provide more on

Any reason not to implement ForVision2Seq ?

NielsRogge · 2023-02-08T11:27:05Z

Any reason not to implement ForVision2Seq ?

The image-to-text pipeline currently only supports the MODEL_FOR_VISION_2_SEQ_MAPPING as seen here (hence, the AutoModelForVision2Seq class), however GIT is a special model that is part of the MODEL_FOR_CAUSAL_LM_MAPPING. The reason it's only defined in this mapping is because GitForCausalLM can technically also be used as a text-only generative model like GPT-2.

But in practice, GitForCausalLM is only used for image (and video)-to-text use cases. It is a custom model but has the same API as the AutoModelForVision2Seq class (it takes pixel_values as input to the generate method). So it wouldn't make sense to add the MODEL_FOR_CAUSAL_LM_MAPPING to this pipeline, rather it probably makes sense to have GitForCausalLM as a standalone class in this pipeline, but I'm not sure this is feasible/allowed.

Narsil · 2023-02-08T14:32:17Z

It is a custom model but has the same API as the AutoModelForVision2Seq class

So make it ForVision2Seq, no ? As long as it upholds the invariant (signature + return type) then by definition it's ok to add it...

NielsRogge · 2023-02-27T09:38:21Z

I've added BLIP and BLIP-2 to the ForVision2Seq mapping, making them usable with the image-to-text pipeline: #21802.

However, GIT can't be added out-of-the-box, due to GitForCausalLM not having pixel_values as the first argument in its forward call, but rather input_ids (unlike models like VisionEncoderDecoderModel, BlipForConditionalGeneration, etc).

Narsil · 2023-02-27T10:21:59Z

What are those input_ids corresponding to ?

If they are LongTensor like regular input ids, where did the image go ? Does it need a combination or not ?

NielsRogge · 2023-02-27T10:28:14Z

GIT is a bit special in the sense that it can be viewed as a GPT-2 model, taking input_ids as input (the text prompt), and one can optionally provide pixel_values to condition the model on an image.

If you only provide pixel_values, then it will caption the image.
If you only provide input_ids, then it will behave like GPT-2.

Narsil · 2023-02-27T10:55:19Z

Shouldn't we implement GitForVision2Seq in the first case ? It's a classic encoder-decoder case, correct ?

NielsRogge · 2023-02-27T12:30:38Z

No GitForCausalLM should ideally be added directly to the Vision2Seq mapping. It's a decoder-only Transformer. The only reason we can't add it directly is that it doesn't take pixel_values as first keyword argument, which is what the image-to-text pipeline expects.

Narsil · 2023-02-27T13:25:22Z

IT seems input_ids is not necessary: https://colab.research.google.com/drive/1sLiSY2ixv52yqlw9EUW6BPwCq9XPLc9R#scrollTo=b3XKBvEcU_PR&line=2&uniqifier=1

No ?
If it's a regular decoder for the text, then the decoder_input_ids should automatically be set by generate making GitForVision2Seq possible. No ?

NielsRogge · 2023-02-27T13:52:58Z

Yes correct, input_ids is an optional argument, which can serve as text prompt. If you don't provide one, the model will start generating text from the BOS token with id = 1.

But the problem is here. The inputs, prepared using the image processor, will be pixel_values, but GitForCausalLM has input_ids as first keyword argument.

Narsil · 2023-02-27T14:27:15Z

Oh this code is already not looking pretty, there could be a way to make it better.

But we could always add

GitForVision2Seq(GitForCausalLM):
    def forward(self, pixel_values, ***):
         return super().formward(self, pixel_values=pixel_values)

for isntance?

github-actions · 2023-03-24T15:02:25Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

sgugger · 2023-03-27T19:14:15Z

I have removed the Good first issue label as there is no clear plan explaining to a beginner what to do to solve this issue. Please add this if you want to re-put that label.

NielsRogge · 2023-03-31T15:29:52Z

If @Narsil agrees, we can add the GitForVision2Seq class to make GIT be supported by the pipeline.

github-actions · 2023-04-25T15:02:40Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

NielsRogge · 2023-05-10T17:09:41Z

To fix this issue, one can:

add GitForVision2Seq as explained above to transformers/models/git/modeling_git.py
add this class to the MODEL_FOR_VISION_2_SEQ_MAPPING here

github-actions · 2023-06-04T15:02:28Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

NielsRogge · 2023-06-04T17:33:53Z

This was fixed by #23362

sayakpaul added the Core: Pipeline Internals of the library; Pipeline. label Feb 8, 2023

sayakpaul mentioned this issue Feb 9, 2023

[Tasks] Adds image captioning #21512

Merged

NielsRogge added the Good First Issue label Mar 26, 2023

sgugger removed the Good First Issue label Mar 27, 2023

NielsRogge mentioned this issue Apr 18, 2023

Add support for BLIP and GIT in image-to-text and VQA pipelines #21110

Open

github-actions bot closed this as completed May 3, 2023

NielsRogge reopened this May 10, 2023

NielsRogge closed this as completed Jun 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Pipelines] Problems with an image-to-text fine-tuned model #21514

[Pipelines] Problems with an image-to-text fine-tuned model #21514

sayakpaul commented Feb 8, 2023

Narsil commented Feb 8, 2023

sayakpaul commented Feb 8, 2023

Narsil commented Feb 8, 2023

Narsil commented Feb 8, 2023

sayakpaul commented Feb 8, 2023

NielsRogge commented Feb 8, 2023

Narsil commented Feb 8, 2023

sayakpaul commented Feb 8, 2023

NielsRogge commented Feb 8, 2023 •

edited

Loading

Narsil commented Feb 8, 2023

NielsRogge commented Feb 27, 2023

Narsil commented Feb 27, 2023

NielsRogge commented Feb 27, 2023

Narsil commented Feb 27, 2023

NielsRogge commented Feb 27, 2023 •

edited

Loading

Narsil commented Feb 27, 2023

NielsRogge commented Feb 27, 2023 •

edited

Loading

Narsil commented Feb 27, 2023

github-actions bot commented Mar 24, 2023

sgugger commented Mar 27, 2023

NielsRogge commented Mar 31, 2023

github-actions bot commented Apr 25, 2023

NielsRogge commented May 10, 2023 •

edited

Loading

github-actions bot commented Jun 4, 2023

NielsRogge commented Jun 4, 2023

[Pipelines] Problems with an image-to-text fine-tuned model #21514

[Pipelines] Problems with an image-to-text fine-tuned model #21514

Comments

sayakpaul commented Feb 8, 2023

Narsil commented Feb 8, 2023

sayakpaul commented Feb 8, 2023

Narsil commented Feb 8, 2023

Narsil commented Feb 8, 2023

sayakpaul commented Feb 8, 2023

NielsRogge commented Feb 8, 2023

Narsil commented Feb 8, 2023

sayakpaul commented Feb 8, 2023

NielsRogge commented Feb 8, 2023 • edited Loading

Narsil commented Feb 8, 2023

NielsRogge commented Feb 27, 2023

Narsil commented Feb 27, 2023

NielsRogge commented Feb 27, 2023

Narsil commented Feb 27, 2023

NielsRogge commented Feb 27, 2023 • edited Loading

Narsil commented Feb 27, 2023

NielsRogge commented Feb 27, 2023 • edited Loading

Narsil commented Feb 27, 2023

github-actions bot commented Mar 24, 2023

sgugger commented Mar 27, 2023

NielsRogge commented Mar 31, 2023

github-actions bot commented Apr 25, 2023

NielsRogge commented May 10, 2023 • edited Loading

github-actions bot commented Jun 4, 2023

NielsRogge commented Jun 4, 2023

NielsRogge commented Feb 8, 2023 •

edited

Loading

NielsRogge commented Feb 27, 2023 •

edited

Loading

NielsRogge commented Feb 27, 2023 •

edited

Loading

NielsRogge commented May 10, 2023 •

edited

Loading