-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LLaVa NeXT Video #31252
Add LLaVa NeXT Video #31252
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
1548ded
to
37aa822
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the import needs to be like this I believe
src/transformers/models/llava_next_video/diff_llava_next_video.py
Outdated
Show resolved
Hide resolved
af53791
to
a144d8d
Compare
ready for review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the work adding this model!
Most comments are about copied from and diff file choices
src/transformers/models/llava_next_video/configuration_llava_next_video.py
Outdated
Show resolved
Hide resolved
tests/models/llava_next_video/test_modeling_llava_next_video.py
Outdated
Show resolved
Hide resolved
tests/models/llava_next_video/test_modeling_llava_next_video.py
Outdated
Show resolved
Hide resolved
tests/models/llava_next_video/test_modeling_llava_next_video.py
Outdated
Show resolved
Hide resolved
raise ValueError(f"Could not make batched video from {videos}") | ||
|
||
|
||
class LlavaNextVideoVideoProcessor(BaseImageProcessor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be consistent with other models, this should really be LlavaNextVideoImageProcessor
.
I'd say we should for do this for this model, and then in a separate PR we can consider introduce a VideoProcessor class, which we might want to use for new models, and as a possible alias for the image processor for the other video models.
The outstanding question would be for video llava, which takes both images and videos. Given the model, I'd say it should probably be a VideoProcessor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oke, but in case of this model, it also accepts both as input. Right now it uses simple "LLaVaNext" as image processor and adds a new class LlavaNextVideoImageProcessor
for video processing. So, for VideoLlava we can separate into two classes in this same way, one for image and another for video processing.
The only major change is that arg name for images will be pixel_values
and not pixel_values_images
if we do so.
I will call this one LlavaNextVideoImageProcessor
processor then and try to unify video-image models in separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial comments, in particular the diff should probably be entirely defined in diff_llava_next_video.py, which then populates the config and modeling files
src/transformers/models/llava_next_video/configuration_llava_next_video.py
Outdated
Show resolved
Hide resolved
src/transformers/models/llava_next_video/modeling_llava_next_video.py
Outdated
Show resolved
Hide resolved
src/transformers/models/llava_next_video/processing_llava_next_video.py
Outdated
Show resolved
Hide resolved
@amyeroberts ready for re-review
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 kudos for using the diff file so well
src/transformers/models/llava_next_video/convert_llava_next_video_weights_to_hf.py
Show resolved
Hide resolved
utils/diff_model_converter.py
Outdated
# Iterate directky from node.body as there can be property/setters with same names which are overwritten when we use a dict | ||
for func in original_node.body.body: | ||
name = func.name.value if hasattr(func, "name") else func |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you mean original_methods.items() is overwriding some things?
Otherwise the goal is indeed to overwrite the methods using the func.with_changes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This happens only when original class has two methods with identical naming, in my case property and it setter. If we use dict, only one is retained as dict can't hold two keys with same naming.
Not sure if there're cases when directly iterating might cause errors 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hhhhhha I see!
Then it means we are not taking into account the decorator. Indeed that is a good catch and was probably affectingh @younesbelkada in his cohere porting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow - looks great! 🔥
Some comments relating to the diff logic and the generated model file, but overall looks great
src/transformers/models/llava_next_video/processing_llava_next_video.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArthurZucker I can forsee hitting potential issues when e.g. we have more than one modeling file. For example, the models under data2vec.
Seeing the config file here made me realise this. Rather than try to handle which imports belong to which file e.g. configuration_llava_next
vs. modeling_llava_next
we could make this simpler (and less prone to error) by having diff files for each e.g. diff_modeling_llava_next_video.py
, diff_configuration_llava_next_video.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but there should be an easy way to solve importing stuff. Basically pasting all imports everywhere and removing what is not used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping the config and tokenizer and whatever else in a single diff file for now will be simpler IMO !
|
||
|
||
@dataclass | ||
# Copied from transformers.models.idefics.modeling_idefics.IdeficsCausalLMOutputWithPast with Idefics->LlavaNextVideo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to copy across the copied from statements from the original file 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hehe, this is diff-generated but I'll see if that can be fixed + ping @ArthurZucker for visibility
UPD: Found how to remove those and can push a commit if needed. Btw, make fix-copies
currently doesn't check for diff-files, so imo "copied from" statements are needed to catch and propagate code changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep we could / should add the check into make fix-copies to test the hash of the file vs hash of the diff converter generated file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave this one for next the PR (to whoever wants to add the checks) as it's not related to LLaVa-NeXT :)
Edit: will open an issue to not forget
Model tester for `LlavaNextVideoForConditionalGeneration`. | ||
""" | ||
|
||
all_model_classes = (LlavaNextVideoForConditionalGeneration,) if is_torch_available() else () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to define all_generative_models
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as in another PR, VLMs cannot do GenerationTesterMixin out of the box due to pixel_values
, custom generation code or partially having generation related code in modeling (e.g. expand_attn_mask). It's on my todo list, but requires more unification of VLMs before
8ce7ecd
to
3947ad8
Compare
@amyeroberts this one is also ready for the last review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work - thanks for adding the model and for testing the diff converter!
src/transformers/models/llava_next_video/processing_llava_next_video.py
Outdated
Show resolved
Hide resolved
@zucchini-nlp Only thing before merge is to do a run on the slow tests for the model with a |
…_video.py Co-authored-by: amyeroberts <[email protected]>
Co-authored-by: amyeroberts <[email protected]>
Hmm I am not sure slow tests are working, they were skipped in here and in BlipVideo. I think the commit message format is correct |
Ah, we didn't have the |
b0433ed
to
bc51d0c
Compare
What does this PR do?
Adds a new model for Video LLMs built by tuning LLaVa-NeXT on video dataset, current SOTA for videos on VideoMME bench.
This PR introduces some changes to how we usually handled interleaved multimodal data. I need @amyeroberts opinion on that, of it's a viable option for these kinds of models. I added a separate videoProcessor class which has all same parameters as ImageProcessor but a different logic. The reason: imageProcessor file was already full of extra helper functions, and adding more conditions seemed to degrade readbility
Also, I added a chat template, if you can verify that it's correct @NielsRogge 😄 (it's in the
convert_model.py
). I will push all templates to the hub later, currently only 7B model has its template workingFixes #31164