Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen2-VL: clean-up and add more tests #33354

Merged
merged 12 commits into from
Sep 12, 2024

Conversation

zucchini-nlp
Copy link
Member

@zucchini-nlp zucchini-nlp commented Sep 6, 2024

What does this PR do?

This PR adds standardization on Qwen2-VL processors and adds tests for processing and generation. One thing to note is that generation tests currently will operate on text-only modality. I will add support of multimodal tests very soon

TODO:

  • add video processor tests

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this!

Maximum length of the returned list and optionally padding length (see above).
truncation (`bool`, *optional*):
Activates truncation to cut input sequences longer than `max_length` to `max_length`.
return_tensors (`str` or [`~utils.TensorType`], *optional*):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep the return tensors in the docstring for now. Although we're removing it as a defined kwarg, users need to set to to be able to properly use the processor for training.

tests/models/qwen2_vl/test_processing_qwen2_vl.py Outdated Show resolved Hide resolved
all_kwargs = {
"common_kwargs": {"return_tensors": "pt"},
"images_kwargs": {"size": {"height": 214, "width": 214}},
"videos_kwargs": {"size": {"height": 214, "width": 214}},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing the processing doesn't work if the video and image have different image sizes set as we can't batch?

Could we test passing one and not the other?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, passing only one worked for me but I'll add better tests also

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oke, I found that the size wasn't being used at all, my bad for missing that when reviewing. Since Qwen2-VL can't resize to the given size (it needs to follow some methods to smart resize and keep as much resolution as possible), the size dict will be different from other models. Added that to the docsting

Also modified all tests to follow the new size format and actually use if if users pass it with correct dict-keys. WDYT? Should we also find a way to support tuple-size? I'll also update model doc page soon

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated docs and added an example usage where users can change max-min resolution in processor's call-time

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also find a way to support tuple-size?

No. Previously size was a tuple and this caused issues as:

  • some models had it stored as (width, height)
  • some models just stored an int. Within that set of models, some would then resize to (int, int) others to (int, int * height / width) etc.
  • some models used a tuple to denote the longest and shortest edges

There was a huge level of ambiguity and inconsistency across the configs which made image processor behaviour difficult to predict without having to dig into the code.

Technically, tuples are supported using get_size_dict, but this is for backwards compatibility rather than an encouraged way to pass in inputs.

Adding min_pixels and max_pixels isn't a light decision, as this is something which sets in stone how this will be expressed for all configs going forward. I think this change is OK but we should be cautious about just adding keys to size dict. You'll also need to update some other code which verifies the keys in the size dicts -- namely VALID_SIZE_DICT_KEYS

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Previously size was a tuple and this caused issues as:

I see, that makes it easier in general to use max-min pixels now cause I didn't want to add the new possible keys to VALID_SIZE_DICT_KEYS. Qwen is and will prob be the only one accepting such kwargs. But then we couldn't call get_size_dict to validate the size dict is passed correctly

Oke, now I added it in VALID_DICT and added one more test with incorrect sizes passed as kwargs, in which case the defaults from self.min_pixels have to be used

Ready for re-review

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment re the size dict -- just some updates needed in image_processing_utils.py to reflect this addition

docs/source/en/model_doc/qwen2_vl.md Outdated Show resolved Hide resolved
docs/source/en/model_doc/qwen2_vl.md Outdated Show resolved Hide resolved
all_kwargs = {
"common_kwargs": {"return_tensors": "pt"},
"images_kwargs": {"size": {"height": 214, "width": 214}},
"videos_kwargs": {"size": {"height": 214, "width": 214}},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also find a way to support tuple-size?

No. Previously size was a tuple and this caused issues as:

  • some models had it stored as (width, height)
  • some models just stored an int. Within that set of models, some would then resize to (int, int) others to (int, int * height / width) etc.
  • some models used a tuple to denote the longest and shortest edges

There was a huge level of ambiguity and inconsistency across the configs which made image processor behaviour difficult to predict without having to dig into the code.

Technically, tuples are supported using get_size_dict, but this is for backwards compatibility rather than an encouraged way to pass in inputs.

Adding min_pixels and max_pixels isn't a light decision, as this is something which sets in stone how this will be expressed for all configs going forward. I think this change is OK but we should be cautious about just adding keys to size dict. You'll also need to update some other code which verifies the keys in the size dicts -- namely VALID_SIZE_DICT_KEYS

@amyeroberts amyeroberts self-requested a review September 12, 2024 14:52
@@ -192,13 +196,14 @@ def __init__(
self.patch_size = patch_size
self.temporal_patch_size = temporal_patch_size
self.merge_size = merge_size
self.size = {"min_pixels": min_pixels, "max_pixels": max_pixels}
self.size = size if size is not None else {"min_pixels": min_pixels, "max_pixels": max_pixels}
Copy link
Collaborator

@amyeroberts amyeroberts Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thing - if size can now be set in the input i.e. the values of min_pixels and max_pixels in the size dict are taken as precedence over the min_pixels and max_pixels input arguments, then we should pop max_pixels and min_pixels in the to_dict method so that only size is saved out into the config.

That is - if someone were to look at the config file, it's not clear which value takes precedence, and you'd have to look at the code to understand, which we should avoid

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, that's a good point. But I think the priority should be given to the max_pixel and min_pixels rather than size dict, which is the default setting for qwen2-vl

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - what I'd recommend is just not accepting the size argument and sticking with min_pixels and max_pixels then

Comment on lines 280 to 281
min_pixels = size.get("min_pixels", self.min_pixels)
max_pixels = size.get("max_pixels", self.max_pixels)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is funny - it implies size can be a dict without the min_pixels and max_pixels values, but this isn't compatible with this image processor e.g. "height" and "width" from the size dict wouldn't be used. It would be better to raise an error if these keys are missing rather than default to self.min_pixels or self.max_pixels (which create ambiguity in the behaviour) and can mask proper initialization in the init

Copy link
Member Author

@zucchini-nlp zucchini-nlp Sep 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, this was the first thing I thought but didn't want to raise errors that seemed to run smoothly earlier. We can raise an error and explain how to pass size, and this is the place where we can say that passing as tuple is neither good (related to below comment). Since we added min/max pixels to VALID_DICT, i think we did it to be able to run get_size_dict and that way validate that no weird keys are used

I am pro checking the size dict, but we can also check it within Qwen2 code only and not add it in VALID_DICT maybe

@@ -382,6 +395,7 @@ def preprocess(
"""
do_resize = do_resize if do_resize is not None else self.do_resize
size = size if size is not None else self.size
size = get_size_dict(size)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work - this will produce a dictionary with "height", "width"; "shortest_edge" or "shortest_edge", "longest_edge" keys which are not compatible with this image processor.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on this and the clean up!

@zucchini-nlp zucchini-nlp merged commit 2f611d3 into huggingface:main Sep 12, 2024
18 checks passed
BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024
* clean-up on qwen2-vl and add generation tests

* add video tests

* Update tests/models/qwen2_vl/test_processing_qwen2_vl.py

Co-authored-by: amyeroberts <[email protected]>

* fix and add better tests

* Update src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py

Co-authored-by: amyeroberts <[email protected]>

* update docs and address comments

* Update docs/source/en/model_doc/qwen2_vl.md

Co-authored-by: amyeroberts <[email protected]>

* Update docs/source/en/model_doc/qwen2_vl.md

Co-authored-by: amyeroberts <[email protected]>

* update

* remove size at all

---------

Co-authored-by: amyeroberts <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants