-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add video modality for InstrucBLIP #30182
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
src/transformers/models/instructblip/image_processing_instructblip.py
Outdated
Show resolved
Hide resolved
src/transformers/models/instructblip/image_processing_instructblip.py
Outdated
Show resolved
Hide resolved
|
||
model_input_names = ["pixel_values"] | ||
|
||
def __init__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can perhaps also be Copied from
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, let me try. Then I have to add "copy ignore" on the preprocess probably
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No that's only if you add Copied from
to the class. In that case you can add "Ignore copy" above methods that you don't want to copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow that's awesome, thanks for working on that!
I'm just concerned about 2 things:
- making sure that we have a robust API for multimodal processors that is consistent
- the current InstructBLIP models on the hub all use BlipImageProcessor. This PR would introduce a new image processor, I guess we would then need to update the auto mapping to make sure
AutoImageProcessor
still works as expected.
@@ -57,6 +57,7 @@ def __init__(self, image_processor, tokenizer, qformer_tokenizer): | |||
def __call__( | |||
self, | |||
images: ImageInput = None, | |||
videos: ImageInput = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @molbap since we'd like to standardize multimodal processors, this one isn't making it easier 😅
at some point we will have a VideoTextToTextPipeline
, and we'll need to make sure they all have the same API.
See also the ImageTextToTextPipeline
which is worked on at #29572. Although technically it could work if we just expect the following to work for any video-text-to-text model:
inputs = processor(videos=..., text=..., return_tensors="pt")
generated_ids = model.generate(**inputs, max_new_tokens=20)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! Re to normalizing processors, for models taking video inputs, vivit
, videomae
, tvlt
have videos: ImageInput
but tvp
has videos: Union[ImageInput, List[ImageInput], List[List[ImageInput]]],
. However x_clip
reuses videomae
's processor.
Overall ImageInput
is defined as a type union of Images or list of images. Looks like in the future we might prefer supporting at least list of list of images so a VideoInput defined as such could make sense, or an union of types as done in x_clip.
Unfortunately video llava processing is going a different way to easily be able to interleave modalities. I guess that is part of what is being discussed internally in slack
Yep, seems like now it works only if calling specifically "InstructBlipImageProcessor" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on adding this capability!
Two general comments:
- We have to be careful here with the mappings wrt backwards compatibility and expected behaviour. As a user, I should be able to do:
image_processor = InstructBlipImageProcessor()
images = image_processor(images, return_tensors="pt")
and get exactly the same output as I was getting before with the blip image processor
- We should avoid adding lots of if-statements in existing modeling code and instead add a new model
@@ -1368,11 +1368,46 @@ def forward( | |||
>>> generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0].strip() | |||
>>> print(generated_text) | |||
The unusual aspect of this image is that a man is ironing clothes on the back of a yellow SUV, which is parked in the middle of a busy city street. This is an unconventional approach to ironing clothes, as it requires the man to balance himself and his ironing equipment on top of the vehicle while navigating through traffic. Additionally, the presence of taxis and other vehicles in the scene further emphasizes the unusual nature of this situation. | |||
|
|||
# To generate from video input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The addition of all the if statements here indicate this should really be split into a different model. For things like torch.compile we really want to avoid models being able to have inputs with a varying number of dimensions e.g. adding InstructBLIPForVideoQuestionAnswering
instead.
if images is not None or videos is not None: | ||
image_encoding = self.image_processor(images=images, videos=videos, return_tensors=return_tensors) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to break things. Either:
- We add the new image processor to auto map for existing checkpoints. This might lead to differences in output as the image processor used is different. AFAICT processing steps are the same but output shape isn't.
- We load the old image processors, which will break or emit warnings with the
videos
input
size = get_size_dict(size, default_to_square=False) | ||
|
||
if (images is None) ^ (videos is not None): | ||
raise ValueError("InstructBLIP currently does not support both images and videos as input") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ever going to support both? Otherwise this message can be misleading
# Ignore copy | ||
def preprocess( | ||
self, | ||
images: ImageInput = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're adding an image processor for InstructBlip, and it can process images, then it should process images consistently with how they were processed for the previous image processor (blip).
Whilst the processing steps are the same i.e. the processed image is consistent, the output shape won't be because this will add an extra axis to the output images i.e. they become (batch_size, num_frames, num_channels, height, width)
. Instead, we should keep the same output shape for images, this will allow us to add this to the image_processing_auto.py
mapping for instruct blip
@amyeroberts What if:
|
@amyeroberts ping |
@zucchini-nlp one thing related that I'll merge this week, regarding processors: #30511 I added a |
@zucchini-nlp Sorry for the late reply here.
This is a neat solution. My main question would be, what does this do to the shapes of the returned tensors e.g. If they're not the same, then we'll need to add a new class in the modeling file e.g. In terms of the image processor:
My vote would be for a new model. Having to handle video and images within the same image processor is a good indication they should be separated |
Thanks for detailed explanations! At this point it should be possible to go with the first option, I will have to go back and check. If not, making a separate model sounds good since BLIP will never work with both modalities at the same time. @molbap i see, probably having separate VideoProcessor files can be a better solutions for mutli-modal models, instead of packing it all in ImageProcessor |
I went for the "let's keep one processing and modeling files" way. Currently the following changes are applied:
Slow tests are passing locally + added a few tests for the video generation. |
Thanks for the continued work on this and apologies for the delay in my review. Skimming over the PR, and thinking about this more I think we should split this up such that we have a video image processor and a video model. We want to avoid conditional outputs from our base processing objects as well as conditional logic within our model's forward passes. As we won't interleave images and videos i.e. we don't expect a user to both be using video and images at the same time then we don't need to be able to handle these all at once |
@amyeroberts oke, I made Video InstructBlip its own model with its separate modeling and image processing files. Added some tests and indicated in the model-doc that it's the same model as InstructBlip except for the video processing capability. Ready for review @ArthurZucker I made more changes into diff converter in this PR, as it didn't work in some cases. Specifically:
Plus these are the changes in LLaVa-NeXT-Video PR, just duplicating to have all written down somewhere :)
Still have to fix cases when only "init" docstring is changed, I couldn't make it work yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great - thanks for splitting this up and adding this model!
Just a few comment and some questions about the changes to the diff_converter. Main comment is about making sure generate is properly tested
src/transformers/models/instructblipvideo/diff_instructblipvideo.py
Outdated
Show resolved
Hide resolved
class InstructBlipVideoForConditionalGenerationDecoderOnlyTest( | ||
ModelTesterMixin, GenerationTesterMixin, unittest.TestCase | ||
): | ||
all_model_classes = (InstructBlipVideoForConditionalGeneration,) if is_torch_available() else () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we need to define all_generative_models
here too to properly test the generation mixin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GenerationMixin still can't work for VLMs, and I am planning to properly add it after some unification of VLM processors. Otherwise we'll have so many conditional checks inside testing
AFAIK all VLMs currently are tested in IntegrationTests for that
utils/diff_model_converter.py
Outdated
# fmt: off | ||
self.python_module = python_module # we store the original module to use `code_for_node` | ||
self.transformers_imports = {} # maps the imports name like "from transformers.models.xxx" to the parsed AST module | ||
self.imported_mapping = {} # stores the name of the imported classes, with their source {"LlamaModel":"transformers.model.llama.modeling_llama"} | ||
self.visited_module = {} # modules visited like "transformers.models.llama.modeling_llama" | ||
self.visited_module_config = {} # modules visited like "transformers.models.llama.modeling_llama" in config file, needed to not mix config vs modeling imports |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment isn't completely clear to me as transformers.models.llama.modeling_llama"
is a modeling import and not a config import and the instructblip configuration file doesn't import transformers.models.llama.modeling_llama
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, maybe needs another example. This was needed for me because the auto-generated config wasn't getting imports from its "super config". In other words, it was only copying imports from diff files and removing unused ones by ruff.
One solution may be to indicate all imports in diff, but if they aren't used ruff removes them eventually. In my case PreTrainedConfig
wasn't being imported for ex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep this is also something I noticed, mixing import was not super well done
utils/diff_model_converter.py
Outdated
@@ -457,13 +463,18 @@ def leave_ClassDef(self, original_node, updated_node): | |||
f"Tried parsing the name of the imported package from {super_file_name}, could not extract the model name" | |||
) | |||
|
|||
if super_file_name not in self.visited_module: # only extract classes once | |||
visited_module = self.visited_module_config if "Config" in class_name else self.visited_module |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How well does this extend to other files we might have under the model folder e.g. do we need flags for visited_module_processor
etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if processors work for diff. I tried actually to add image-processor to diff but it messed up, so I believe it doesn't yet support that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not yet supported but planned for sure.
parser.add_argument( | ||
"--old_model_name", | ||
required=False, | ||
help="The name of the model from which the copying is done in CamelCase. If not provided is inferred from diff-file", | ||
) | ||
parser.add_argument( | ||
"--new_model_name", | ||
required=False, | ||
help="The name of the new model being added in CamelCase. If not provided is inferred from diff-file", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for models with composite config files e.g. CLIPVisionConfig which we don't want to infer as CLIPVision
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for models that do camel case without underscores, like InstructBlip (instructblip) vs LlavaNext (llava_next). In second case we can infer where to make a capital letter, while in former it's impossible so I decided to give users freedom passing model names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is a great addition indeed! WOuld even add this comment in the help 😉
@@ -474,7 +485,7 @@ def leave_ClassDef(self, original_node, updated_node): | |||
start_insert_idx = self.global_scope_index | |||
for dependency, _ in list_dependencies: | |||
node = class_finder.global_nodes.get(dependency, None) | |||
if node is not None: | |||
if node is not None and "Config" not in class_name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same q here - do we need to account for all other classes e.g. "Processor" not in class_name
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is added because diff importing configs to the configuration files, even though they are defined as a class a few lines below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be good to get a second review of the changes here in this file from @ArthurZucker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall good to me!
Separating config imports is the way to go, and further separating process import later on will be needed as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
diff converter changes look nice IMO, but we should not need to import all the classes. New diff converter is able to parse dependencies so tell me if this is not the case!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am late to the party here, bug normally you should not have to import all these classes. The diff converter will automatically detect dependencies and copy classes that are required. !
Unless this is an edge case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this on Slack and decided we shouldn't have separate imports for each file, and let ruff clean-out unnecessary ones. So I'm manually filtering the issue with configs and adding all imports. That worked for InstructBlip
utils/diff_model_converter.py
Outdated
# fmt: off | ||
self.python_module = python_module # we store the original module to use `code_for_node` | ||
self.transformers_imports = {} # maps the imports name like "from transformers.models.xxx" to the parsed AST module | ||
self.imported_mapping = {} # stores the name of the imported classes, with their source {"LlamaModel":"transformers.model.llama.modeling_llama"} | ||
self.visited_module = {} # modules visited like "transformers.models.llama.modeling_llama" | ||
self.visited_module_config = {} # modules visited like "transformers.models.llama.modeling_llama" in config file, needed to not mix config vs modeling imports |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep this is also something I noticed, mixing import was not super well done
utils/diff_model_converter.py
Outdated
@@ -457,13 +463,18 @@ def leave_ClassDef(self, original_node, updated_node): | |||
f"Tried parsing the name of the imported package from {super_file_name}, could not extract the model name" | |||
) | |||
|
|||
if super_file_name not in self.visited_module: # only extract classes once | |||
visited_module = self.visited_module_config if "Config" in class_name else self.visited_module |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not yet supported but planned for sure.
parser.add_argument( | ||
"--old_model_name", | ||
required=False, | ||
help="The name of the model from which the copying is done in CamelCase. If not provided is inferred from diff-file", | ||
) | ||
parser.add_argument( | ||
"--new_model_name", | ||
required=False, | ||
help="The name of the new model being added in CamelCase. If not provided is inferred from diff-file", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is a great addition indeed! WOuld even add this comment in the help 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall good to me!
Separating config imports is the way to go, and further separating process import later on will be needed as well
@amyeroberts this one is ready for the final review I guess |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another great addition - thanks!
As with Llava Next Video, we just need a final run of the slow tests and we're good to merge ❤️
Got the CI green, including slow tests. Will merge the PR |
What does this PR do?
I made these changes a month ago and forgot contributing. This PR adds video processing capabilities for InstructBLIP models. The paper states InstructBLIP was trained and evaluated on video, along with images and the original repo has some code on how video inference works.
Seems like this feature has some interest from the community (see here), so I believe we can add it.