-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for BLIP and GIT in image-to-text and VQA pipelines #21110
Comments
Hi @NielsRogge , can work on it? |
Sure @atturaioe feel free to start working on it |
I am writing to inquire about the possibility of me starting work on this issue. @NielsRogge can I contribute? |
@NielsRogge is this issue still open for contribution? |
Yes |
@NielsRogge If nobody is working on it, I would like to pick up the issue. |
I would like to pick the issue if its still available. |
@NielsRogge is this issue still open to contribute . I would like to work on it |
Support for BLIP in the image-to-text pipeline has been added in #21904. GIT can be added as explained in this comment, feel free to open a PR. Support for the VQA pipeline still needs to be added for both, also there contributions are welcome. |
@NielsRogge can I work on this issue?? |
Hello @NielsRogge ! I would like to work on this issue (add support for VQA to GIT model) as a first contribution. But before I start, I have a question : Currently the only model implementing the VQA pipeline is However in GIT paper they say that :
So I wonder if I should implement it as a classifier or should I follow the paper ? Thanks |
Hi @marechaux, we will need to implement the 2 different approaches in the VQA pipeline. ViLT and GIT indeed solve VQA entirely different (ViLT is a classifier whereas GIT is a generative GPT-like model). |
Hey @NielsRogge, took a shot at this. Am I correct in understanding that the ideal implementation of "microsoft/git-base" in the image-to-text pipeline would look something like this? from transformers import AutoProcessor, GitForVision2Seq
processor = AutoProcessor.from_pretrained("microsoft/git-base")
model = GitForVision2Seq.from_pretrained("microsoft/git-base")
pipe = pipeline("image-to-text", model=model, image_processor=processor.image_processor, tokenizer=processor.tokenizer)
print(pipe("https://www.wideopenpets.com/wp-content/uploads/sites/6/2021/12/Popular-Horse-Feature-Image.png")) If so, I got this to work by:
So the GITImageProcessor.preprocesor ends with this: data = {"pixel_values": images}
return_data = BatchFeature(data=data, tensor_type=return_tensors)
return_data['input_ids'] = None
return return_data rather than the CLIPImageProcessor.preprocessor method returning this data = {"pixel_values": images}
return BatchFeature(data=data, tensor_type=return_tensors) Curious your thoughts on this approach. How would this would affect other GIT image processing workflows (i.e. VQA, etc.)? Could we can use a conditional to account for those? |
Thanks for taking a stab at this. I'm fine with adding a |
Exactly this - 'only to include "input_ids" which are set to None?' I see how adding an entirely new GITImageProcessor seems excessive when all it would do is add the Input_ids : None key value pair to data being returned from the .preprocess method. As you describe here, #21514 (comment), Once we hit the preprocess method in ImageToTextPipeline and map the model to git, the model_inputs are returned (via the CLIPImageProcessor through the GITProcessor in processing_git.py) without the input_ids key. So AFAIK, the best we can do is modify the return value of the CLIPImageProcessor.preprocess method without changing the CLIPImageProcessor class by replicating the CLIPImageProcessor, rebranding it as a GITImageProcessor, and modify the .preprocess method. Let me know if that works or if you feel there is a better approach. Is the idea that there would be some way to do this within GitForVision2Seq? As an aside, I read some best practices for working in the transformers library (https://huggingface.co/transformers/v4.10.1/add_new_model.html#general-overview-of-transformers). Would it be preferable to copy the entire CLIPImageProcessor class as GITImageProcessor within processing_git.py or do something more like this within processing_git.py. class GITImageProcessor(CLIPImageProcessor):
def preprocess(self, *args, **kwargs):
# Call the original preprocess method
return_data = super().preprocess(*args, **kwargs)
# Add 'input_ids' key to the data
return_data['input_ids'] = None
return return_data |
Hmm I don't get why This example shows that you only need to pass |
Hello, it seems that the BLIP for the image to text pipeline has been completed, however that the VQA pipeline for both BLIP & GIT are not complete, along with the image to text pipeline for GIT. @marechaux how is the VQA going for GIT? |
Hi! I'm also interested in helping out if we can divide the work :) |
Hey @NielsRogge , I was working on VQA pipeline for BLIP but i am confused how can i give |
Hi @Tanmaypatil123 there's already this PR: #23348. Feel free to take it over/improve it |
Hello, can I work on this? |
Hi Team, Can I start working on it ? |
Hi @NielsRogge, I would like to try to add GIT for VQA as my first contribution, is it ok? I believe the input_ids or its lenght could be used in the postprocess of VisualQuestionAnsweringPipeline to remove the prompt/prefix, like in TextGenerationPipeline, but it will require to do a refactor in _forward in VisualQuestionAnsweringPipeline to return also the input_ids. e.g. transformers/src/transformers/pipelines/text_generation.py Lines 294 to 305 in fe3c8ab
|
Is this still open for contribution? Would love to help out. |
Hi @astern21, I started a draft PR, but didn't get to finish it, and now I'm not really working on it. |
Feature request
BLIP and GIT are 2 recent additions in the library, providing state-of-the-art performance for tasks like image captioning and visual question answering (VQA). GIT is even capable of video captioning and video QA.
Hence it makes sense to support them in our image-to-text and VQA pipelines.
Motivation
Having support for better models in pipelines is very desired!
See also a request for it here: https://discuss.huggingface.co/t/support-for-different-models-in-text-to-image-pipeline/29504
Your contribution
I can assist in adding support, see #18446 as a very similar case
The text was updated successfully, but these errors were encountered: