Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impossible to use any of the GIT models from Microsoft #4301

Closed
1 task done
danielbichuetti opened this issue Feb 28, 2023 · 5 comments
Closed
1 task done

Impossible to use any of the GIT models from Microsoft #4301

danielbichuetti opened this issue Feb 28, 2023 · 5 comments
Assignees

Comments

@danielbichuetti
Copy link
Contributor

danielbichuetti commented Feb 28, 2023

Describe the bug
When attempting to use the image to text node from Haystack, we are getting an error saying KeyError : git.

Error message
KeyError: git

Expected behavior
Use the GIT models from Microsoft

Additional context
Add any other context about the problem here, like document types / preprocessing steps / settings of reader etc.

To Reproduce

from haystack.nodes.image_to_text.transformers import TransformersImageToText

converter = TransformersImageToText(model_name_or_path='microsoft/git-large-textcaps')
data = converter.generate_captions(['pic.jpg'])

FAQ Check

System:

  • OS:
  • GPU/CPU:
  • Haystack version (commit or version number):
  • DocumentStore:
  • Reader:
  • Retriever:
@danielbichuetti danielbichuetti changed the title Impossible use any of the GIT models from Microsoft Impossible to use any of the GIT models from Microsoft Feb 28, 2023
@anakin87
Copy link
Member

Hello @danielbichuetti!

  • Transformers: transformers.ImageToTextPipeline doesn't support some ImageToText models.
    I asked on the forum to support for BLIP and GIT and @NielsRogge opened Add support for BLIP and GIT in image-to-text and VQA pipelines huggingface/transformers#21110 (that is in progress)

  • I implemented a check in TransformersImageToText:

    self.model = pipeline(
    task="image-to-text",
    model=model_name_or_path,
    revision=model_version,
    device=self.devices[0],
    use_auth_token=use_auth_token,
    )
    model_class_name = self.model.model.__class__.__name__
    if model_class_name not in SUPPORTED_MODELS_CLASSES:
    raise ValueError(
    f"The model of class '{model_class_name}' is not supported for ImageToText."
    f"The supported classes are: {SUPPORTED_MODELS_CLASSES}."

    but it runs after loading the model
    (it prevents some other mistakes like using models that are not suitable for ImageToText:
    try for example converter = TransformersImageToText(model_name_or_path="deepset/minilm-uncased-squad2"))

So the main question is: how to implement a check before loading the model?
This has probably already been addressed elsewhere in Haystack.
To get some historical memory 🙂 and great advice, I also tag @ZanSara

@danielbichuetti
Copy link
Contributor Author

Oh. We use it on a custom node here. GIT and BLIP2 for Visual Question Answering

But we are using it:

class QAImageReader(BaseImageReader):

self._model = AutoModelForCausalLM.from_pretrained(model)

pixel_values = self._processor(images=images_dataset, return_tensors="pt").pixel_values.to(self._device)
input_ids = self._processor(text=question, add_special_tokens=False).input_ids
input_ids = [self._processor.tokenizer.cls_token_id] + input_ids
input_ids = torch.tensor(input_ids).unsqueeze(0).to(self._device)

BLIP2 has some differences:
self._model = Blip2ForConditionalGeneration.from_pretrained(model, torch_dtype=torch.float16)
...

I usually don't make usage of the pipelines. Interesting.

@anakin87
Copy link
Member

I'd prefer to use a pipeline for this task, to be agnostic about the underlying stuff (models, architectures, processors),
which is good when models are supported.

I see that somewhere in Haystack `AutoConfig` is used to get information on a model (without loading it)

try:
config = AutoConfig.from_pretrained(
pretrained_model_name_or_path=model_name_or_path,
use_auth_token=use_auth_token,
revision=revision,
**(autoconfig_kwargs or {}),
)
model_type = config.model_type
# if unsupported model, try to infer from config.architectures
if not is_supported_model(model_type) and config.architectures:
model_type = config.architectures[0] if is_supported_model(config.architectures[0]) else None

Unfortunately, in this case AutoConfig raises the mentioned KeyError (AutoModel too).

Currently, the pipeline only supports VisionEncoderDecoderModel models, but there is no easy way to filter those models in the Hugging Face Hub.


If we still want to use the pipeline (I'm in favor),
we can help the users in several ways (while waiting for more models to be supported):

  • specifying that this node only supports VisionEncoderDecoderModel in the docstrings
  • checking if the chosen model name contains "blip" or "git" and raising an understandable exception
  • keep also the current check

WDYT?

@danielbichuetti
Copy link
Contributor Author

Using the pipeline provides a much cleaner code.

I noticed this because we are merging our company codebase with Haystack. It mainly consists of removing redundant nodes, functions, patching Haystack Document to our small modification (we like the ability to use IDs that can be calculated based on specific meta fields, not all).

Currently the only remaining nodes are:

  • MediaTranscriber
  • some LLM integration
  • ImageReader (ImageToText equivalent, but we only use GIT, BLIP2 and ViT as possible models)
  • QAImageReader (VQA component)

We are checking the model name start (git-* and blip2-, if not, we check for ViT in the middle of the model). Subsequently, we are trying to load the model. As our environment is constrained, it was a good workaround.

Maybe we could implement a temporary workaround to check if the model starts with git- and blip2 (blip has some diff. When building the inputs). If not, check if on the pipeline allowed list?

Furthermore, I noticed that transformers is pinned at 4.25. Which might represent an issue.

@ZanSara
Copy link
Contributor

ZanSara commented Mar 1, 2023

Hey @anakin87!

So the main question is: how to implement a check before loading the model?

This has been a notoriously hard task to accomplish unfortunately. We mostly used ad-hoc solutions depending on model type, sometimes all the way down to checking the model names for cues 🙈

If AutoConfig fails, the only thing I can recommend is hf_hub_download: I've use it in MultiModal Retriever to try distinguishing transformers and sentence-transformers models, but it might come handy here because it lets you "bypass" AutoConfig and just read the config yourself. An annoying approach but might be a decent choice here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants