Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DetrImageProcessorFast #34063

Merged
merged 8 commits into from
Oct 21, 2024

Conversation

yonigozlan
Copy link
Member

@yonigozlan yonigozlan commented Oct 10, 2024

What does this PR do?

Adds a fast image processors for DETR. Follows issue #33810.
This image processor is a result of this work on comparing different image processing method.

The processing methods use only torchvision transforms (either v1 or v2, depending on the torchvision version) and torch tensors.
Just like the current DETR image processor, this processor can also process object detection or segmentation annotations. This processing also uses only torch tensors and torchvision transforms.
The post-processing methods have not been modified from the original image processor.

Implementation

A previous fast image processor implementation for VIT (link) uses torchvision transform classes and Compose to create a one step processing class. However this poses two problems:

  • The torchvision v2 Transforms are only torch compile/scripting compatible in their functional forms and not in their Class form (source).
  • A one step processing class is not possible when the processing depends on the input, like it's the case for DETR for resizing and padding.

So this implementation uses the functional forms of torchvision transforms, and it's structure is very similar to the current DETR image processor.

All the numpy/PIL operations have been converted to torch or torchvision operations, and like the VIT fast image processor, this processor only accept return_tensors = "pt"

The processor call function accept a device kwarg, as processing can be performed on both CPU and GPU, but is much faster on GPU.
I wanted to add device as an init argument, but that would make the signatures of fast and slow processors different, which make some tests fails.

Usage

Except for the fact that it only returns torch tensors, this fast processor is fully compatible with the current one.
It can be instantiated through AutoImageProcessor with use_fast=True, or through the Class directly:

from transformers import AutoImageProcessor

processor = AutoImageProcessor.from_pretrained("facebook/detr-resnet-50", use_fast=True)
from transformers import DetrImageProcessorFast

processor = DetrImageProcessorFast.from_pretrained("facebook/detr-resnet-50")

Usage is the same as the current processor, except for the device kwarg:

from torchvision.io import read_image
images = torchvision.io.read_image(image_path)
processor = DetrImageProcessorFast.from_pretrained("facebook/detr-resnet-50")
images_processed = processor(images , return_tensors="pt", device="cuda")

If device is not specified:

  • If the input images are tensors, the processing will be done on the device of the images.
  • If the inputs are PIL or Numpy images, the processing is done on CPU.

Performance gains

Main Takeaways

Processing speedup

  • ~60x faster processing on GPU (single image)
  • ~80x faster processing on GPU (batch_size=8)
  • ~5x faster processing on CPU (single image)
  • ~2.6x faster processing on CPU (batch_size=8)

Inference pass speedup (GPU)

  • ~2.2x speedup on whole model inference pass (single image, eager)
  • ~3.2x speedup on whole model inference pass (single image, compiled)
  • ~2.4x speedup on whole model inference pass (batch_size=8, eager)

  • Average over 100 runs on the same 480x640 image. No padding needed, as "all" the images have the same size.

benchmark_results_full_pipeline_detr_fast


  • Average over 10% of the COCO 2017 validation dataset, with batch_size=8. Padding needed, as the images have different sizes, and the DETR processor resize them using "shortest_edge"/"longest_edge", resulting in different sized resized images.

benchmark_results_full_pipeline_detr_fast_batched


  • Average over 10% of the COCO 2017 validation dataset, with batch_size=8. Forcing padding to 1333x1333 (="longest_edge"), as otherwise torch.compile needs to recompile if the different batches have different max sizes.
    (I'm not sure what is going wrong when using the compiled model with the current processor)

benchmark_results_full_pipeline_detr_fast_batched_compiled


  • Average over 10% of the COCO 2017 validation dataset, with batch_size=1. Forcing padding to 1333x1333 for comparison with batched inputs

benchmark_results_full_pipeline_detr_fast_padded


Tests

  • The new image processor is tested on all the tests of the current processor.
  • I have also added two consistency tests (panoptic and detection) for processing on GPU vs CPU.

Looking forward to your feedback!
I was also wondering if we should adopt a more modular approach to the fast image processors, as there is quite a lot of repetition with the "slow" processor for now. It looks like something like this was done for Fast tokenizers? If someone that worked on Fast tokenizers has any advice on that I'll gladly hear them 🤗.
There will also be the question of how to advertise this "use_fast" option to users, and if we want to make it default eventually when torchvision is available?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@yonigozlan yonigozlan changed the title add fully functionning image_processing_detr_fast Add DetrImageProcessorFast Oct 10, 2024
@yonigozlan yonigozlan marked this pull request as ready for review October 10, 2024 12:42
@yonigozlan yonigozlan requested review from qubvel and molbap October 10, 2024 12:42
Copy link
Contributor

@molbap molbap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work! I added a couple comments, in addition to one also on the page you linked rel to performance of .to(device).to(dtype) VS .to(device, dtype), finding the second one to be always faster when moving to a cuda device, fwiw

src/transformers/models/detr/image_processing_detr_fast.py Outdated Show resolved Hide resolved
5: F.InterpolationMode.HAMMING,
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method doesn't have any tensor <-> array change right?

Suggested change
# Copied from transformers.models.detr.image_processing_detr.get_size_with_aspect_ratio

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed. I'm not sure if I should import those functions directly from image_processing_detr.py instead of using # Copied from at this point, wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We typically don't import from another file, but it'd be nice to do it yes - in general we should be able to import from an ...utils file (say processing_utils.py, image_processing_utils.py, image_processing_utils_fast.py ) more generic methods instead of redefining them for each model. cc @ArthurZucker for this pattern type as we discussed it before

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep I think it's good to think about how we can refactor a bit!

src/transformers/models/detr/image_processing_detr_fast.py Outdated Show resolved Hide resolved
src/transformers/models/detr/image_processing_detr_fast.py Outdated Show resolved Hide resolved
size = int(round(raw_size))

if (height <= width and height == size) or (width <= height and width == size):
oh, ow = height, width
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and btw (for later) these methods should be slightly 🧹 🧹 rel to variable naming for oh + ow

src/transformers/models/detr/image_processing_detr_fast.py Outdated Show resolved Hide resolved
Comment on lines +1025 to +833
if "max_size" in kwargs:
logger.warning_once(
"The `max_size` argument is deprecated and will be removed in a future version, use"
" `size['longest_edge']` instead."
)
size = kwargs.pop("max_size")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use the deprecate_kwarg util here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree but using it causes the test_image_processor_preprocess_arguments to fail. I'm guessing this tests does not work well with the way decorators calls the function and the arguments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dove a bit here, nice refactor and improvement - I think if we can move to a more modular way of building processors, as we try to do with models, it'll reduce loc count greatly and improve readability. A modular_image_processing_detr_fast that inherits from the necessary funcs and then build automatically the image_processing_detr_fast would be amazing for instance, cc @ArthurZucker for visibility (not a requirement on this PR imo).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I was thinking about this too, because the loc count is huge compare to the actual changes. I might be wrong but it looks like something similar was done to reduce loc count for fast tokenizers? Not sure who to ping here for clarifications

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah! Kind of like we have a PreTrainedModel we can have a ImageProcessorMixin that would be a bit more powerfull!

For fast tokenizers, we just rely on the tokenizers library directly, so it's a bit different

).T
self.assertTrue(torch.allclose(encoding["labels"][0]["boxes"], expected_boxes_0, rtol=1))
self.assertTrue(torch.allclose(encoding["labels"][1]["boxes"], expected_boxes_1, rtol=1))
for image_processing_class in self.image_processor_list:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

).T
self.assertTrue(torch.allclose(encoding["labels"][0]["boxes"], expected_boxes_0, rtol=1))
self.assertTrue(torch.allclose(encoding["labels"][1]["boxes"], expected_boxes_1, rtol=1))
for image_processing_class in self.image_processor_list:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's interesting the git diff is so large here - unless I'm wrong we are generalizing the tests to integrate a list of processors, fast and slow, but the tests do not change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I'm guessing it's because of the change in indentation, but all I changed is testing a list of processors instead of only one

@yonigozlan yonigozlan force-pushed the add-detr-image-processor-fast branch from 91f050d to d4ed4ee Compare October 14, 2024 21:09
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general, but having 1600 LOC for an image processor is quite intense!
Great performance gains tho

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah! Kind of like we have a PreTrainedModel we can have a ImageProcessorMixin that would be a bit more powerfull!

For fast tokenizers, we just rely on the tokenizers library directly, so it's a bit different

5: F.InterpolationMode.HAMMING,
}


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep I think it's good to think about how we can refactor a bit!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of void of copied from, which I am a bit suprised about!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most methods are very similar to the ones in image_processing_detr, but slightly changed to work with tensors instead of numpy arrays. I haven't changed the post-processing methods at all (as they are not as much of a speed bottleneck), so they are all added with copied from (end of the file)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah okay, got it!

@yonigozlan yonigozlan force-pushed the add-detr-image-processor-fast branch from 4f6d93f to 8f57e14 Compare October 21, 2024 12:28
@yonigozlan yonigozlan force-pushed the add-detr-image-processor-fast branch from 8f57e14 to 69bde86 Compare October 21, 2024 12:56
@yonigozlan yonigozlan merged commit a412281 into huggingface:main Oct 21, 2024
24 of 26 checks passed
stevhliu pushed a commit to stevhliu/transformers that referenced this pull request Oct 21, 2024
* add fully functionning image_processing_detr_fast

* Create tensors on the correct device

* fix copies

* fix doc

* add tests equivalence cpu gpu

* fix doc en

* add relative imports and copied from

* Fix copies and nit
@SangbumChoi
Copy link
Contributor

SangbumChoi commented Oct 22, 2024

@yonigozlan Hi, super excited to see this. just left two questions

  1. Figure says it is RTDetr but this is Detr right?
  2. I think this can be also adapted to all vision processor?
  3. Can you also elaborate the computing device?

@yonigozlan
Copy link
Member Author

Hi @SangbumChoi !

  1. Oops, thanks for pointing this out, yep it's a typo, this is indeed Detr (but I'm also working on a fast image processor for RTDetr :) ).
  2. Yes absolutely this is the plan. There are quite a few other image processors based on the Detr one so their fast image processor should be out soon.
  3. This was on a A10 GPU!

@SangbumChoi
Copy link
Contributor

@yonigozlan thanks! I would love support this task if you need some help. Also if you can share the comparison script for compile + visualize plot etc... it would be very helpful!

BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024
* add fully functionning image_processing_detr_fast

* Create tensors on the correct device

* fix copies

* fix doc

* add tests equivalence cpu gpu

* fix doc en

* add relative imports and copied from

* Fix copies and nit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants