-
Notifications
You must be signed in to change notification settings - Fork 27.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DetrImageProcessorFast #34063
Add DetrImageProcessorFast #34063
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice work! I added a couple comments, in addition to one also on the page you linked rel to performance of .to(device).to(dtype) VS .to(device, dtype), finding the second one to be always faster when moving to a cuda device, fwiw
5: F.InterpolationMode.HAMMING, | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method doesn't have any tensor <-> array change right?
# Copied from transformers.models.detr.image_processing_detr.get_size_with_aspect_ratio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes indeed. I'm not sure if I should import those functions directly from image_processing_detr.py
instead of using # Copied from
at this point, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We typically don't import from another file, but it'd be nice to do it yes - in general we should be able to import from an ...utils
file (say processing_utils.py
, image_processing_utils.py
, image_processing_utils_fast.py
) more generic methods instead of redefining them for each model. cc @ArthurZucker for this pattern type as we discussed it before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep I think it's good to think about how we can refactor a bit!
size = int(round(raw_size)) | ||
|
||
if (height <= width and height == size) or (width <= height and width == size): | ||
oh, ow = height, width |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and btw (for later) these methods should be slightly 🧹 🧹 rel to variable naming for oh + ow
if "max_size" in kwargs: | ||
logger.warning_once( | ||
"The `max_size` argument is deprecated and will be removed in a future version, use" | ||
" `size['longest_edge']` instead." | ||
) | ||
size = kwargs.pop("max_size") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can use the deprecate_kwarg util here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree but using it causes the test_image_processor_preprocess_arguments
to fail. I'm guessing this tests does not work well with the way decorators calls the function and the arguments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dove a bit here, nice refactor and improvement - I think if we can move to a more modular way of building processors, as we try to do with models, it'll reduce loc count greatly and improve readability. A modular_image_processing_detr_fast
that inherits from the necessary funcs and then build automatically the image_processing_detr_fast
would be amazing for instance, cc @ArthurZucker for visibility (not a requirement on this PR imo).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I was thinking about this too, because the loc count is huge compare to the actual changes. I might be wrong but it looks like something similar was done to reduce loc count for fast tokenizers? Not sure who to ping here for clarifications
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah! Kind of like we have a PreTrainedModel
we can have a ImageProcessorMixin
that would be a bit more powerfull!
For fast tokenizers, we just rely on the tokenizers
library directly, so it's a bit different
).T | ||
self.assertTrue(torch.allclose(encoding["labels"][0]["boxes"], expected_boxes_0, rtol=1)) | ||
self.assertTrue(torch.allclose(encoding["labels"][1]["boxes"], expected_boxes_1, rtol=1)) | ||
for image_processing_class in self.image_processor_list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅
).T | ||
self.assertTrue(torch.allclose(encoding["labels"][0]["boxes"], expected_boxes_0, rtol=1)) | ||
self.assertTrue(torch.allclose(encoding["labels"][1]["boxes"], expected_boxes_1, rtol=1)) | ||
for image_processing_class in self.image_processor_list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's interesting the git diff is so large here - unless I'm wrong we are generalizing the tests to integrate a list of processors, fast and slow, but the tests do not change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I'm guessing it's because of the change in indentation, but all I changed is testing a list of processors instead of only one
91f050d
to
d4ed4ee
Compare
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general, but having 1600 LOC for an image processor is quite intense!
Great performance gains tho
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah! Kind of like we have a PreTrainedModel
we can have a ImageProcessorMixin
that would be a bit more powerfull!
For fast tokenizers, we just rely on the tokenizers
library directly, so it's a bit different
5: F.InterpolationMode.HAMMING, | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep I think it's good to think about how we can refactor a bit!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of void of copied from, which I am a bit suprised about!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most methods are very similar to the ones in image_processing_detr, but slightly changed to work with tensors instead of numpy arrays. I haven't changed the post-processing methods at all (as they are not as much of a speed bottleneck), so they are all added with copied from (end of the file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah okay, got it!
4f6d93f
to
8f57e14
Compare
8f57e14
to
69bde86
Compare
* add fully functionning image_processing_detr_fast * Create tensors on the correct device * fix copies * fix doc * add tests equivalence cpu gpu * fix doc en * add relative imports and copied from * Fix copies and nit
@yonigozlan Hi, super excited to see this. just left two questions
|
Hi @SangbumChoi !
|
@yonigozlan thanks! I would love support this task if you need some help. Also if you can share the comparison script for compile + visualize plot etc... it would be very helpful! |
* add fully functionning image_processing_detr_fast * Create tensors on the correct device * fix copies * fix doc * add tests equivalence cpu gpu * fix doc en * add relative imports and copied from * Fix copies and nit
What does this PR do?
Adds a fast image processors for DETR. Follows issue #33810.
This image processor is a result of this work on comparing different image processing method.
The processing methods use only torchvision transforms (either v1 or v2, depending on the torchvision version) and torch tensors.
Just like the current DETR image processor, this processor can also process object detection or segmentation annotations. This processing also uses only torch tensors and torchvision transforms.
The post-processing methods have not been modified from the original image processor.
Implementation
A previous fast image processor implementation for VIT (link) uses torchvision transform classes and
Compose
to create a one step processing class. However this poses two problems:So this implementation uses the functional forms of torchvision transforms, and it's structure is very similar to the current DETR image processor.
All the numpy/PIL operations have been converted to torch or torchvision operations, and like the VIT fast image processor, this processor only accept
return_tensors = "pt"
The processor call function accept a
device
kwarg, as processing can be performed on both CPU and GPU, but is much faster on GPU.I wanted to add device as an
init
argument, but that would make the signatures of fast and slow processors different, which make some tests fails.Usage
Except for the fact that it only returns torch tensors, this fast processor is fully compatible with the current one.
It can be instantiated through AutoImageProcessor with use_fast=True, or through the Class directly:
Usage is the same as the current processor, except for the
device
kwarg:If
device
is not specified:Performance gains
Main Takeaways
Processing speedup
Inference pass speedup (GPU)
batch_size=8
. Padding needed, as the images have different sizes, and the DETR processor resize them using "shortest_edge"/"longest_edge", resulting in different sized resized images.batch_size=8
. Forcing padding to 1333x1333 (="longest_edge"), as otherwise torch.compile needs to recompile if the different batches have different max sizes.(I'm not sure what is going wrong when using the compiled model with the current processor)
batch_size=1
. Forcing padding to 1333x1333 for comparison with batched inputsTests
Looking forward to your feedback!
I was also wondering if we should adopt a more modular approach to the fast image processors, as there is quite a lot of repetition with the "slow" processor for now. It looks like something like this was done for Fast tokenizers? If someone that worked on Fast tokenizers has any advice on that I'll gladly hear them 🤗.
There will also be the question of how to advertise this "use_fast" option to users, and if we want to make it default eventually when torchvision is available?
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.