Impossible to use any of the GIT models from Microsoft #4301

danielbichuetti · 2023-02-28T19:11:55Z

Describe the bug
When attempting to use the image to text node from Haystack, we are getting an error saying KeyError : git.

Error message
KeyError: git

Expected behavior
Use the GIT models from Microsoft

Additional context
Add any other context about the problem here, like document types / preprocessing steps / settings of reader etc.

To Reproduce

from haystack.nodes.image_to_text.transformers import TransformersImageToText

converter = TransformersImageToText(model_name_or_path='microsoft/git-large-textcaps')
data = converter.generate_captions(['pic.jpg'])

FAQ Check

Have you had a look at our new FAQ page?

System:

OS:
GPU/CPU:
Haystack version (commit or version number):
DocumentStore:
Reader:
Retriever:

The text was updated successfully, but these errors were encountered:

anakin87 · 2023-02-28T22:55:22Z

Hello @danielbichuetti!

Transformers: transformers.ImageToTextPipeline doesn't support some ImageToText models.
I asked on the forum to support for BLIP and GIT and @NielsRogge opened Add support for BLIP and GIT in image-to-text and VQA pipelines huggingface/transformers#21110 (that is in progress)

I implemented a check in TransformersImageToText:

haystack/haystack/nodes/image_to_text/transformers.py

Lines 98 to 110 in 2a1d73e

    
           self.model = pipeline( 
        
               task="image-to-text", 
        
               model=model_name_or_path, 
        
               revision=model_version, 
        
               device=self.devices[0], 
        
               use_auth_token=use_auth_token, 
        
           ) 
        
           model_class_name = self.model.model.__class__.__name__ 
        
           if model_class_name not in SUPPORTED_MODELS_CLASSES: 
        
               raise ValueError( 
        
                   f"The model of class '{model_class_name}' is not supported for ImageToText." 
        
                   f"The supported classes are: {SUPPORTED_MODELS_CLASSES}."

but it runs after loading the model
(it prevents some other mistakes like using models that are not suitable for ImageToText:
try for example converter = TransformersImageToText(model_name_or_path="deepset/minilm-uncased-squad2"))

So the main question is: how to implement a check before loading the model?
This has probably already been addressed elsewhere in Haystack.
To get some historical memory 🙂 and great advice, I also tag @ZanSara

danielbichuetti · 2023-02-28T23:05:02Z

Oh. We use it on a custom node here. GIT and BLIP2 for Visual Question Answering

But we are using it:

class QAImageReader(BaseImageReader):

self._model = AutoModelForCausalLM.from_pretrained(model)

pixel_values = self._processor(images=images_dataset, return_tensors="pt").pixel_values.to(self._device)
input_ids = self._processor(text=question, add_special_tokens=False).input_ids
input_ids = [self._processor.tokenizer.cls_token_id] + input_ids
input_ids = torch.tensor(input_ids).unsqueeze(0).to(self._device)

BLIP2 has some differences:
self._model = Blip2ForConditionalGeneration.from_pretrained(model, torch_dtype=torch.float16)
...

I usually don't make usage of the pipelines. Interesting.

anakin87 · 2023-02-28T23:31:47Z

I'd prefer to use a pipeline for this task, to be agnostic about the underlying stuff (models, architectures, processors),
which is good when models are supported.

I see that somewhere in Haystack `AutoConfig` is used to get information on a model (without loading it)

haystack/haystack/modeling/model/language_model.py

Lines 936 to 946 in 2a1d73e

    
           try: 
        
               config = AutoConfig.from_pretrained( 
        
                   pretrained_model_name_or_path=model_name_or_path, 
        
                   use_auth_token=use_auth_token, 
        
                   revision=revision, 
        
                   **(autoconfig_kwargs or {}), 
        
               ) 
        
               model_type = config.model_type 
        
               # if unsupported model, try to infer from config.architectures 
        
               if not is_supported_model(model_type) and config.architectures: 
        
                   model_type = config.architectures[0] if is_supported_model(config.architectures[0]) else None

Unfortunately, in this case AutoConfig raises the mentioned KeyError (AutoModel too).

Currently, the pipeline only supports VisionEncoderDecoderModel models, but there is no easy way to filter those models in the Hugging Face Hub.

If we still want to use the pipeline (I'm in favor),
we can help the users in several ways (while waiting for more models to be supported):

specifying that this node only supports VisionEncoderDecoderModel in the docstrings
checking if the chosen model name contains "blip" or "git" and raising an understandable exception
keep also the current check

WDYT?

danielbichuetti · 2023-02-28T23:51:39Z

Using the pipeline provides a much cleaner code.

I noticed this because we are merging our company codebase with Haystack. It mainly consists of removing redundant nodes, functions, patching Haystack Document to our small modification (we like the ability to use IDs that can be calculated based on specific meta fields, not all).

Currently the only remaining nodes are:

MediaTranscriber
some LLM integration
ImageReader (ImageToText equivalent, but we only use GIT, BLIP2 and ViT as possible models)
QAImageReader (VQA component)

We are checking the model name start (git-* and blip2-, if not, we check for ViT in the middle of the model). Subsequently, we are trying to load the model. As our environment is constrained, it was a good workaround.

Maybe we could implement a temporary workaround to check if the model starts with git- and blip2 (blip has some diff. When building the inputs). If not, check if on the pipeline allowed list?

Furthermore, I noticed that transformers is pinned at 4.25. Which might represent an issue.

ZanSara · 2023-03-01T10:15:03Z

Hey @anakin87!

So the main question is: how to implement a check before loading the model?

This has been a notoriously hard task to accomplish unfortunately. We mostly used ad-hoc solutions depending on model type, sometimes all the way down to checking the model names for cues 🙈

If AutoConfig fails, the only thing I can recommend is hf_hub_download: I've use it in MultiModal Retriever to try distinguishing transformers and sentence-transformers models, but it might come handy here because it lets you "bypass" AutoConfig and just read the config yourself. An annoying approach but might be a decent choice here.

danielbichuetti changed the title ~~Impossible use any of the GIT models from Microsoft~~ Impossible to use any of the GIT models from Microsoft Feb 28, 2023

anakin87 mentioned this issue Mar 1, 2023

docs: TransformersImageToText- inform about supported models, better exception handling #4310

Merged

6 tasks

masci assigned ZanSara Mar 9, 2023

ZanSara closed this as completed Mar 9, 2023

anakin87 mentioned this issue May 14, 2023

feat: add BLIP support in TransformersImageToText #4912

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Impossible to use any of the GIT models from Microsoft #4301

Impossible to use any of the GIT models from Microsoft #4301

danielbichuetti commented Feb 28, 2023 •

edited

Loading

anakin87 commented Feb 28, 2023

danielbichuetti commented Feb 28, 2023

anakin87 commented Feb 28, 2023

danielbichuetti commented Feb 28, 2023

ZanSara commented Mar 1, 2023 •

edited

Loading

Impossible to use any of the GIT models from Microsoft #4301

Impossible to use any of the GIT models from Microsoft #4301

Comments

danielbichuetti commented Feb 28, 2023 • edited Loading

anakin87 commented Feb 28, 2023

danielbichuetti commented Feb 28, 2023

anakin87 commented Feb 28, 2023

danielbichuetti commented Feb 28, 2023

ZanSara commented Mar 1, 2023 • edited Loading

danielbichuetti commented Feb 28, 2023 •

edited

Loading

ZanSara commented Mar 1, 2023 •

edited

Loading