Add DetrImageProcessorFast #34063

yonigozlan · 2024-10-10T11:11:41Z

What does this PR do?

Adds a fast image processors for DETR. Follows issue #33810.
This image processor is a result of this work on comparing different image processing method.

The processing methods use only torchvision transforms (either v1 or v2, depending on the torchvision version) and torch tensors.
Just like the current DETR image processor, this processor can also process object detection or segmentation annotations. This processing also uses only torch tensors and torchvision transforms.
The post-processing methods have not been modified from the original image processor.

Implementation

A previous fast image processor implementation for VIT (link) uses torchvision transform classes and Compose to create a one step processing class. However this poses two problems:

The torchvision v2 Transforms are only torch compile/scripting compatible in their functional forms and not in their Class form (source).
A one step processing class is not possible when the processing depends on the input, like it's the case for DETR for resizing and padding.

So this implementation uses the functional forms of torchvision transforms, and it's structure is very similar to the current DETR image processor.

All the numpy/PIL operations have been converted to torch or torchvision operations, and like the VIT fast image processor, this processor only accept return_tensors = "pt"

The processor call function accept a device kwarg, as processing can be performed on both CPU and GPU, but is much faster on GPU.
I wanted to add device as an init argument, but that would make the signatures of fast and slow processors different, which make some tests fails.

Usage

Except for the fact that it only returns torch tensors, this fast processor is fully compatible with the current one.
It can be instantiated through AutoImageProcessor with use_fast=True, or through the Class directly:

from transformers import AutoImageProcessor

processor = AutoImageProcessor.from_pretrained("facebook/detr-resnet-50", use_fast=True)

from transformers import DetrImageProcessorFast

processor = DetrImageProcessorFast.from_pretrained("facebook/detr-resnet-50")

Usage is the same as the current processor, except for the device kwarg:

from torchvision.io import read_image
images = torchvision.io.read_image(image_path)
processor = DetrImageProcessorFast.from_pretrained("facebook/detr-resnet-50")
images_processed = processor(images , return_tensors="pt", device="cuda")

If device is not specified:

If the input images are tensors, the processing will be done on the device of the images.
If the inputs are PIL or Numpy images, the processing is done on CPU.

Performance gains

Main Takeaways

Processing speedup

~60x faster processing on GPU (single image)
~80x faster processing on GPU (batch_size=8)
~5x faster processing on CPU (single image)
~2.6x faster processing on CPU (batch_size=8)

Inference pass speedup (GPU)

~2.2x speedup on whole model inference pass (single image, eager)
~3.2x speedup on whole model inference pass (single image, compiled)
~2.4x speedup on whole model inference pass (batch_size=8, eager)

Average over 100 runs on the same 480x640 image. No padding needed, as "all" the images have the same size.

Average over 10% of the COCO 2017 validation dataset, with batch_size=8. Padding needed, as the images have different sizes, and the DETR processor resize them using "shortest_edge"/"longest_edge", resulting in different sized resized images.

Average over 10% of the COCO 2017 validation dataset, with batch_size=8. Forcing padding to 1333x1333 (="longest_edge"), as otherwise torch.compile needs to recompile if the different batches have different max sizes.
(I'm not sure what is going wrong when using the compiled model with the current processor)

Average over 10% of the COCO 2017 validation dataset, with batch_size=1. Forcing padding to 1333x1333 for comparison with batched inputs

Tests

The new image processor is tested on all the tests of the current processor.
I have also added two consistency tests (panoptic and detection) for processing on GPU vs CPU.

Looking forward to your feedback!
I was also wondering if we should adopt a more modular approach to the fast image processors, as there is quite a lot of repetition with the "slow" processor for now. It looks like something like this was done for Fast tokenizers? If someone that worked on Fast tokenizers has any advice on that I'll gladly hear them 🤗.
There will also be the question of how to advertise this "use_fast" option to users, and if we want to make it default eventually when torchvision is available?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

molbap

Very nice work! I added a couple comments, in addition to one also on the page you linked rel to performance of .to(device).to(dtype) VS .to(device, dtype), finding the second one to be always faster when moving to a cuda device, fwiw

src/transformers/models/detr/image_processing_detr_fast.py

molbap · 2024-10-14T09:20:33Z

src/transformers/models/detr/image_processing_detr_fast.py

+    5: F.InterpolationMode.HAMMING,
+}
+
+


This method doesn't have any tensor <-> array change right?

Suggested change

# Copied from transformers.models.detr.image_processing_detr.get_size_with_aspect_ratio

Yes indeed. I'm not sure if I should import those functions directly from image_processing_detr.py instead of using # Copied from at this point, wdyt?

We typically don't import from another file, but it'd be nice to do it yes - in general we should be able to import from an ...utils file (say processing_utils.py, image_processing_utils.py, image_processing_utils_fast.py ) more generic methods instead of redefining them for each model. cc @ArthurZucker for this pattern type as we discussed it before

Yep I think it's good to think about how we can refactor a bit!

src/transformers/models/detr/image_processing_detr_fast.py

molbap · 2024-10-14T09:39:35Z

src/transformers/models/detr/image_processing_detr_fast.py

+            size = int(round(raw_size))
+
+    if (height <= width and height == size) or (width <= height and width == size):
+        oh, ow = height, width


and btw (for later) these methods should be slightly 🧹 🧹 rel to variable naming for oh + ow

src/transformers/models/detr/image_processing_detr_fast.py

molbap · 2024-10-14T09:58:02Z

src/transformers/models/detr/image_processing_detr_fast.py

+        if "max_size" in kwargs:
+            logger.warning_once(
+                "The `max_size` argument is deprecated and will be removed in a future version, use"
+                " `size['longest_edge']` instead."
+            )
+            size = kwargs.pop("max_size")


we can use the deprecate_kwarg util here

I agree but using it causes the test_image_processor_preprocess_arguments to fail. I'm guessing this tests does not work well with the way decorators calls the function and the arguments

molbap · 2024-10-14T10:00:11Z

src/transformers/models/detr/image_processing_detr_fast.py

Dove a bit here, nice refactor and improvement - I think if we can move to a more modular way of building processors, as we try to do with models, it'll reduce loc count greatly and improve readability. A modular_image_processing_detr_fast that inherits from the necessary funcs and then build automatically the image_processing_detr_fast would be amazing for instance, cc @ArthurZucker for visibility (not a requirement on this PR imo).

Yes I was thinking about this too, because the loc count is huge compare to the actual changes. I might be wrong but it looks like something similar was done to reduce loc count for fast tokenizers? Not sure who to ping here for clarifications

Yeah! Kind of like we have a PreTrainedModel we can have a ImageProcessorMixin that would be a bit more powerfull!

For fast tokenizers, we just rely on the tokenizers library directly, so it's a bit different

molbap · 2024-10-14T10:00:43Z

tests/models/conditional_detr/test_image_processing_conditional_detr.py

-        ).T
-        self.assertTrue(torch.allclose(encoding["labels"][0]["boxes"], expected_boxes_0, rtol=1))
-        self.assertTrue(torch.allclose(encoding["labels"][1]["boxes"], expected_boxes_1, rtol=1))
+        for image_processing_class in self.image_processor_list:


molbap · 2024-10-14T10:01:55Z

tests/models/deformable_detr/test_image_processing_deformable_detr.py

-        ).T
-        self.assertTrue(torch.allclose(encoding["labels"][0]["boxes"], expected_boxes_0, rtol=1))
-        self.assertTrue(torch.allclose(encoding["labels"][1]["boxes"], expected_boxes_1, rtol=1))
+        for image_processing_class in self.image_processor_list:


It's interesting the git diff is so large here - unless I'm wrong we are generalizing the tests to integrate a list of processors, fast and slow, but the tests do not change?

Yes I'm guessing it's because of the change in indentation, but all I changed is testing a list of processors instead of only one

HuggingFaceDocBuilderDev · 2024-10-14T21:36:24Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

LGTM in general, but having 1600 LOC for an image processor is quite intense!
Great performance gains tho

ArthurZucker · 2024-10-15T13:34:06Z

src/transformers/models/detr/image_processing_detr_fast.py

Yeah! Kind of like we have a PreTrainedModel we can have a ImageProcessorMixin that would be a bit more powerfull!

For fast tokenizers, we just rely on the tokenizers library directly, so it's a bit different

ArthurZucker · 2024-10-16T18:40:07Z

src/transformers/models/detr/image_processing_detr_fast.py

+    5: F.InterpolationMode.HAMMING,
+}
+
+


Yep I think it's good to think about how we can refactor a bit!

ArthurZucker · 2024-10-16T18:40:52Z

src/transformers/models/detr/image_processing_detr_fast.py

This is kind of void of copied from, which I am a bit suprised about!

Most methods are very similar to the ones in image_processing_detr, but slightly changed to work with tensors instead of numpy arrays. I haven't changed the post-processing methods at all (as they are not as much of a speed bottleneck), so they are all added with copied from (end of the file)

ah okay, got it!

* add fully functionning image_processing_detr_fast * Create tensors on the correct device * fix copies * fix doc * add tests equivalence cpu gpu * fix doc en * add relative imports and copied from * Fix copies and nit

SangbumChoi · 2024-10-22T01:14:47Z

@yonigozlan Hi, super excited to see this. just left two questions

Figure says it is RTDetr but this is Detr right?
I think this can be also adapted to all vision processor?
Can you also elaborate the computing device?

yonigozlan · 2024-10-22T20:33:32Z

Hi @SangbumChoi !

Oops, thanks for pointing this out, yep it's a typo, this is indeed Detr (but I'm also working on a fast image processor for RTDetr :) ).
Yes absolutely this is the plan. There are quite a few other image processors based on the Detr one so their fast image processor should be out soon.
This was on a A10 GPU!

SangbumChoi · 2024-10-22T23:59:34Z

@yonigozlan thanks! I would love support this task if you need some help. Also if you can share the comparison script for compile + visualize plot etc... it would be very helpful!

* add fully functionning image_processing_detr_fast * Create tensors on the correct device * fix copies * fix doc * add tests equivalence cpu gpu * fix doc en * add relative imports and copied from * Fix copies and nit

yonigozlan changed the title ~~add fully functionning image_processing_detr_fast~~ Add DetrImageProcessorFast Oct 10, 2024

yonigozlan marked this pull request as ready for review October 10, 2024 12:42

yonigozlan requested review from qubvel and molbap October 10, 2024 12:42

molbap reviewed Oct 14, 2024

View reviewed changes

yonigozlan force-pushed the add-detr-image-processor-fast branch from 91f050d to d4ed4ee Compare October 14, 2024 21:09

yonigozlan requested a review from ArthurZucker October 15, 2024 10:13

ArthurZucker reviewed Oct 16, 2024

View reviewed changes

ArthurZucker approved these changes Oct 17, 2024

View reviewed changes

yonigozlan force-pushed the add-detr-image-processor-fast branch from 4f6d93f to 8f57e14 Compare October 21, 2024 12:28

yonigozlan added 8 commits October 21, 2024 12:55

add fully functionning image_processing_detr_fast

184f29c

Create tensors on the correct device

b2896b2

fix copies

5aa7e9d

fix doc

32d3353

add tests equivalence cpu gpu

6062b83

fix doc en

d804a23

add relative imports and copied from

f0d83e9

Fix copies and nit

69bde86

yonigozlan force-pushed the add-detr-image-processor-fast branch from 8f57e14 to 69bde86 Compare October 21, 2024 12:56

yonigozlan merged commit a412281 into huggingface:main Oct 21, 2024
24 of 26 checks passed

This was referenced Oct 23, 2024

Add Image Processor Fast Deformable DETR #34353

Merged

Add Image Processor Fast RT-DETR #34354

Merged

qubvel added Vision optimization Processing labels Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DetrImageProcessorFast #34063

Add DetrImageProcessorFast #34063

yonigozlan commented Oct 10, 2024 •

edited

Loading

molbap left a comment

molbap Oct 14, 2024

yonigozlan Oct 14, 2024

molbap Oct 14, 2024

ArthurZucker Oct 16, 2024

molbap Oct 14, 2024

molbap Oct 14, 2024

yonigozlan Oct 14, 2024

molbap Oct 14, 2024

yonigozlan Oct 14, 2024

ArthurZucker Oct 15, 2024

molbap Oct 14, 2024

molbap Oct 14, 2024

yonigozlan Oct 14, 2024

HuggingFaceDocBuilderDev commented Oct 14, 2024

ArthurZucker left a comment

ArthurZucker Oct 15, 2024

ArthurZucker Oct 16, 2024

ArthurZucker Oct 16, 2024

yonigozlan Oct 17, 2024

ArthurZucker Oct 17, 2024

SangbumChoi commented Oct 22, 2024 •

edited

Loading

yonigozlan commented Oct 22, 2024

SangbumChoi commented Oct 22, 2024


	# Copied from transformers.models.detr.image_processing_detr.get_size_with_aspect_ratio

Add DetrImageProcessorFast #34063

Add DetrImageProcessorFast #34063

Conversation

yonigozlan commented Oct 10, 2024 • edited Loading

What does this PR do?

Implementation

Usage

Performance gains

Main Takeaways

Processing speedup

Inference pass speedup (GPU)

Tests

Who can review?

molbap left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Oct 14, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SangbumChoi commented Oct 22, 2024 • edited Loading

yonigozlan commented Oct 22, 2024

SangbumChoi commented Oct 22, 2024

yonigozlan commented Oct 10, 2024 •

edited

Loading

SangbumChoi commented Oct 22, 2024 •

edited

Loading