-
Notifications
You must be signed in to change notification settings - Fork 27.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qwen2-VL: clean-up and add more tests #33354
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this!
Maximum length of the returned list and optionally padding length (see above). | ||
truncation (`bool`, *optional*): | ||
Activates truncation to cut input sequences longer than `max_length` to `max_length`. | ||
return_tensors (`str` or [`~utils.TensorType`], *optional*): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the return tensors in the docstring for now. Although we're removing it as a defined kwarg, users need to set to to be able to properly use the processor for training.
all_kwargs = { | ||
"common_kwargs": {"return_tensors": "pt"}, | ||
"images_kwargs": {"size": {"height": 214, "width": 214}}, | ||
"videos_kwargs": {"size": {"height": 214, "width": 214}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing the processing doesn't work if the video and image have different image sizes set as we can't batch?
Could we test passing one and not the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, passing only one worked for me but I'll add better tests also
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oke, I found that the size
wasn't being used at all, my bad for missing that when reviewing. Since Qwen2-VL can't resize to the given size (it needs to follow some methods to smart resize and keep as much resolution as possible), the size dict will be different from other models. Added that to the docsting
Also modified all tests to follow the new size
format and actually use if if users pass it with correct dict-keys. WDYT? Should we also find a way to support tuple-size? I'll also update model doc page soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated docs and added an example usage where users can change max-min resolution in processor's call-time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also find a way to support tuple-size?
No. Previously size
was a tuple and this caused issues as:
- some models had it stored as (width, height)
- some models just stored an int. Within that set of models, some would then resize to
(int, int)
others to(int, int * height / width)
etc. - some models used a tuple to denote the longest and shortest edges
There was a huge level of ambiguity and inconsistency across the configs which made image processor behaviour difficult to predict without having to dig into the code.
Technically, tuples are supported using get_size_dict
, but this is for backwards compatibility rather than an encouraged way to pass in inputs.
Adding min_pixels
and max_pixels
isn't a light decision, as this is something which sets in stone how this will be expressed for all configs going forward. I think this change is OK but we should be cautious about just adding keys to size dict. You'll also need to update some other code which verifies the keys in the size dicts -- namely VALID_SIZE_DICT_KEYS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Previously size was a tuple and this caused issues as:
I see, that makes it easier in general to use max-min pixels now cause I didn't want to add the new possible keys to VALID_SIZE_DICT_KEYS
. Qwen is and will prob be the only one accepting such kwargs. But then we couldn't call get_size_dict
to validate the size dict is passed correctly
Oke, now I added it in VALID_DICT and added one more test with incorrect sizes passed as kwargs, in which case the defaults from self.min_pixels
have to be used
Ready for re-review
Co-authored-by: amyeroberts <[email protected]>
Co-authored-by: amyeroberts <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a comment re the size
dict -- just some updates needed in image_processing_utils.py
to reflect this addition
all_kwargs = { | ||
"common_kwargs": {"return_tensors": "pt"}, | ||
"images_kwargs": {"size": {"height": 214, "width": 214}}, | ||
"videos_kwargs": {"size": {"height": 214, "width": 214}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also find a way to support tuple-size?
No. Previously size
was a tuple and this caused issues as:
- some models had it stored as (width, height)
- some models just stored an int. Within that set of models, some would then resize to
(int, int)
others to(int, int * height / width)
etc. - some models used a tuple to denote the longest and shortest edges
There was a huge level of ambiguity and inconsistency across the configs which made image processor behaviour difficult to predict without having to dig into the code.
Technically, tuples are supported using get_size_dict
, but this is for backwards compatibility rather than an encouraged way to pass in inputs.
Adding min_pixels
and max_pixels
isn't a light decision, as this is something which sets in stone how this will be expressed for all configs going forward. I think this change is OK but we should be cautious about just adding keys to size dict. You'll also need to update some other code which verifies the keys in the size dicts -- namely VALID_SIZE_DICT_KEYS
Co-authored-by: amyeroberts <[email protected]>
Co-authored-by: amyeroberts <[email protected]>
@@ -192,13 +196,14 @@ def __init__( | |||
self.patch_size = patch_size | |||
self.temporal_patch_size = temporal_patch_size | |||
self.merge_size = merge_size | |||
self.size = {"min_pixels": min_pixels, "max_pixels": max_pixels} | |||
self.size = size if size is not None else {"min_pixels": min_pixels, "max_pixels": max_pixels} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last thing - if size
can now be set in the input i.e. the values of min_pixels and max_pixels in the size dict are taken as precedence over the min_pixels
and max_pixels
input arguments, then we should pop max_pixels
and min_pixels
in the to_dict
method so that only size
is saved out into the config.
That is - if someone were to look at the config file, it's not clear which value takes precedence, and you'd have to look at the code to understand, which we should avoid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, that's a good point. But I think the priority should be given to the max_pixel
and min_pixels
rather than size dict, which is the default setting for qwen2-vl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - what I'd recommend is just not accepting the size
argument and sticking with min_pixels
and max_pixels
then
min_pixels = size.get("min_pixels", self.min_pixels) | ||
max_pixels = size.get("max_pixels", self.max_pixels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is funny - it implies size
can be a dict without the min_pixels and max_pixels values, but this isn't compatible with this image processor e.g. "height" and "width" from the size dict wouldn't be used. It would be better to raise an error if these keys are missing rather than default to self.min_pixels
or self.max_pixels
(which create ambiguity in the behaviour) and can mask proper initialization in the init
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, this was the first thing I thought but didn't want to raise errors that seemed to run smoothly earlier. We can raise an error and explain how to pass size, and this is the place where we can say that passing as tuple is neither good (related to below comment). Since we added min/max pixels to VALID_DICT, i think we did it to be able to run get_size_dict
and that way validate that no weird keys are used
I am pro checking the size dict, but we can also check it within Qwen2 code only and not add it in VALID_DICT maybe
@@ -382,6 +395,7 @@ def preprocess( | |||
""" | |||
do_resize = do_resize if do_resize is not None else self.do_resize | |||
size = size if size is not None else self.size | |||
size = get_size_dict(size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work - this will produce a dictionary with "height", "width"; "shortest_edge" or "shortest_edge", "longest_edge" keys which are not compatible with this image processor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating on this and the clean up!
* clean-up on qwen2-vl and add generation tests * add video tests * Update tests/models/qwen2_vl/test_processing_qwen2_vl.py Co-authored-by: amyeroberts <[email protected]> * fix and add better tests * Update src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py Co-authored-by: amyeroberts <[email protected]> * update docs and address comments * Update docs/source/en/model_doc/qwen2_vl.md Co-authored-by: amyeroberts <[email protected]> * Update docs/source/en/model_doc/qwen2_vl.md Co-authored-by: amyeroberts <[email protected]> * update * remove size at all --------- Co-authored-by: amyeroberts <[email protected]>
What does this PR do?
This PR adds standardization on Qwen2-VL processors and adds tests for processing and generation. One thing to note is that generation tests currently will operate on text-only modality. I will add support of multimodal tests very soon
TODO: