Adding GPU acceleration to encode_jpeg #8391

deekay42 · 2024-04-23T05:20:10Z

Summary:
I'm adding GPU support to the existing torchvision.io.encode_jpeg function. If the input tensors are on the GPU, the CUDA version will be used and the CPU version otherwise.

Performance numbers indicate over 5000 imgs/s on 1 A100 GPU:

Processor: x86_64
Platform: Linux-5.12.0-0_fbk7_zion_6511_gd766966f605a-x86_64-with-glibc2.34
Logical CPUs: 192

CUDA device: NVIDIA PG509-210
Total Memory: 84.99 GB

Mean image size: 551x676
[----------------------------------------------------- Image Encoding -----------------------------------------------------]
                                                                                  |  1 images  |  100 images  |  1000 images
1 threads: -----------------------------------------------------------------------------------------------------------------
      CPU (unfused): [torchvision.io.encode_jpeg(img) for img in batch_input]     |   2466.1   |   219121.6   |   2169960.9
      CPU (fused): torchvision.io.encode_jpeg(batch_input)                        |   2627.3   |   221350.3   |   2098801.9
      CUDA:7 (unfused): [torchvision.io.encode_jpeg(img) for img in batch_input]  |    256.8   |    21060.6   |    212853.1
      CUDA:7 (fused): torchvision.io.encode_jpeg(batch_input)                     |    223.8   |    16829.9   |    193673.9
12 threads: ----------------------------------------------------------------------------------------------------------------
      CPU (unfused): [torchvision.io.encode_jpeg(img) for img in batch_input]     |   2512.9   |   216763.1   |   2161373.3
      CPU (fused): torchvision.io.encode_jpeg(batch_input)                        |   2608.3   |   223391.8   |   2152523.7
      CUDA:7 (unfused): [torchvision.io.encode_jpeg(img) for img in batch_input]  |    220.8   |    24009.1   |    245279.8
      CUDA:7 (fused): torchvision.io.encode_jpeg(batch_input)                     |    219.3   |    16971.8   |    175770.5
24 threads: ----------------------------------------------------------------------------------------------------------------
      CPU (unfused): [torchvision.io.encode_jpeg(img) for img in batch_input]     |   2549.4   |   215578.0   |   2195501.8
      CPU (fused): torchvision.io.encode_jpeg(batch_input)                        |   2427.9   |   225574.3   |   2139114.4
      CUDA:7 (unfused): [torchvision.io.encode_jpeg(img) for img in batch_input]  |    219.9   |    21726.3   |    214842.8
      CUDA:7 (fused): torchvision.io.encode_jpeg(batch_input)                     |    212.6   |    17351.6   |    172555.7

Times are in microseconds (us).```

Test Plan:
1. pytest test -vvv
2. ufmt format torchvision
3. flake8 torchvision

Reviewers:

Subscribers:

Tasks:

Tags:

<!-- Before submitting a PR, please make sure to check our contributing guidelines regarding code formatting, tests, and documentation: https://github.com/pytorch/vision/blob/main/CONTRIBUTING.md -->

Summary: I'm adding GPU support to the existing torchvision.io.encode_jpeg function. If the input tensors are on the GPU, the CUDA version will be used and the CPU version otherwise. Additionally, I'm adding a new function torchvision.io.encode_jpegs (plural) with uses a fused kernel and may be faster than successive calls to the singular version which incurs kernel launch overhead for each call. If it's alright, I'll be happy to refactor decode_jpeg to follow this convention in a follow up PR. Test Plan: 1. pytest test -vvv 2. ufmt format torchvision 3. flake8 torchvision Reviewers: Subscribers: Tasks: Tags:

pytorch-bot · 2024-04-23T05:20:13Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/8391

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures

As of commit 21eca4c with merge base f96c42f ():

NEW FAILURES - The following jobs have failed:

Lint / bc (gh)
Process completed with exit code 1.
Tests / unittests-linux (3.12, linux.12xlarge, cpu) / linux-job (gh)
test/test_ops.py::TestRoIAlign::test_autocast_cpu[rois_dtype1-x_dtype1-False-True]
Tests / unittests-macos (3.8, macos-m1-stable) / macos-job (gh)
test/test_ops.py::TestRoIAlign::test_autocast_cpu[rois_dtype1-x_dtype1-False-False]
Tests / unittests-windows (3.11, windows.4xlarge, cpu) / windows-job (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 128
Tests / unittests-windows (3.12, windows.4xlarge, cpu) / windows-job (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 128
Tests / unittests-windows (3.8, windows.4xlarge, cpu) / windows-job (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 128
Tests / unittests-windows (3.9, windows.4xlarge, cpu) / windows-job (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 128

This comment was automatically generated by Dr. CI and updates every 15 minutes.

NicolasHug

Thanks a lot @deekay42 . I made another pass but this looks good!

test/test_image.py

torchvision/io/image.py

torchvision/csrc/io/image/cuda/encode_jpeg_cuda.cpp

ahmadsharif1

Hi @deekay42,

I work on the video decoder in C++ so @NicolasHug thought that my comments may be useful for this PR.

I hope you find my comments useful, and feel free to push back.

I am also curious if you did any benchmarking to see how much speedup we get using hardware decoding or encoding?

torchvision/csrc/io/image/cuda/encode_decode_jpeg_cuda.h

ahmadsharif1 · 2024-05-01T15:47:48Z

torchvision/csrc/io/image/cuda/encode_jpeg_cuda.cpp

+#include <c10/cuda/CUDAGuard.h>
+#include <nvjpeg.h>
+
+nvjpegHandle_t nvjpeg_handle = nullptr;


Nit: perhaps rename this to g_nvjpeg_handle so it is clear this is a global variable?

Same for nvjpeg_handle_creation_flag below.

ahmadsharif1 · 2024-05-01T15:48:54Z

torchvision/csrc/io/image/cuda/encode_jpeg_cuda.cpp

+        "The number of channels should be 3, got: ",
+        image.size(0));
+
+    // nvjpeg requires images to be contiguous


Nit: add a citation link if you can.

ahmadsharif1 · 2024-05-01T15:51:47Z

torchvision/csrc/io/image/cuda/encode_decode_jpeg_cuda.h

+    ImageReadMode mode,
+    torch::Device device);
+
+C10_EXPORT std::vector<torch::Tensor> encode_jpeg_cuda(


Nit: perhaps the name itself should indicate this is a plurality of images, like maybe encode_jpegs_cuda?

ahmadsharif1 · 2024-05-01T15:52:50Z

torchvision/csrc/io/image/cuda/encode_decode_jpeg_cuda.h

+
+C10_EXPORT std::vector<torch::Tensor> encode_jpeg_cuda(
+    const std::vector<torch::Tensor>& images,
+    const int64_t quality);


Nit: add a comment about quality. Is higher better or lower? What is the range/min/max here?

ahmadsharif1 · 2024-05-01T18:32:27Z

torchvision/csrc/io/image/cuda/encode_jpeg_cuda.cpp

+
+  for (int c = 0; c < channels; c++) {
+    target_image.channel[c] = src_image[c].data_ptr<uint8_t>();
+    // this is why we need contiguous tensors


Nit: maybe add a CHECK here to make sure the tensor is contiguous?

ahmadsharif1 · 2024-05-01T18:34:34Z

torchvision/csrc/io/image/cuda/encode_jpeg_cuda.cpp

+  }
+}
+
+torch::Tensor encode_single_jpeg(


Nit: put this in an anonymous namespace since this function is not public?

ahmadsharif1 · 2024-05-01T18:36:41Z

torchvision/csrc/io/image/cuda/encode_jpeg_cuda.cpp

+  }
+}
+
+torch::Tensor encode_single_jpeg(


Nit: this declaration can be omitted entirely if you move the implementation of this function above in an anonymous namespace, right?

ahmadsharif1 · 2024-05-01T18:44:07Z

torchvision/csrc/io/image/cuda/encode_jpeg_cuda.cpp

+      getStreamState);
+
+  // Synchronize the stream to ensure that the encoded image is ready
+  cudaError_t syncState = cudaStreamSynchronize(stream);


I don't know the answer to this question and I am curious if you know -- is there a way to just do a single streamSynchronize per batch instead of per image? That way we can pipeline some work for some extra speedup when handling a batch of images.

ahmadsharif1 · 2024-05-01T18:45:54Z

torchvision/csrc/io/image/cuda/encode_jpeg_cuda.cpp

+  size_t length;
+  nvjpegStatus_t getStreamState = nvjpegEncodeRetrieveBitstreamDevice(
+      nvjpeg_handle, nv_enc_state, NULL, &length, stream);
+  TORCH_CHECK(


Nit: maybe CHECK for the length > 0?

NicolasHug · 2024-05-31T12:33:48Z

torchvision/csrc/io/image/cuda/encode_decode_jpeg_cuda.h

+    const std::vector<torch::Tensor>& images,
+    const int64_t quality);
+
+void nvjpeg_init();


Since we're not exposing this one, should we put it in a different namespace than in vision::image?

NicolasHug · 2024-05-31T12:35:10Z

torchvision/csrc/io/image/cuda/encode_jpeg_cuda.cpp

+
+#else
+
+void nvjpeg_init() {


This probably doesn't matter too much but nvjpeg_init() is declared in encode_decode_jpeg_cuda.h no matter what NVJPEG_FOUND is, but it is only defined here if NVJPEG_FOUND is defined.

ahmadsharif1

lgtm module nits. And sorry I don't understand the comment about not waiting for each image when the code seems to wait for every image.

ahmadsharif1 · 2024-06-06T20:00:43Z

torchvision/csrc/io/image/cuda/encode_jpegs_cuda.cpp

+    // gets destroyed, the CUDA runtime may already be shut down, rendering all
+    // destroy* calls in the encoder destructor invalid. Instead, we use an
+    // atexit hook which executes after main() finishes, but before CUDA shuts
+    // down when the program exits.


What's the guarantee that CUDA shuts down before us?

AFAICt, std::atexit runs these functions in reverse order of when they are called. Is CUDA using atexit() also? If so we need to make sure that is registered before us.

If so, add a comment to that effect.

There is no guarantee. There are a few mentions of using atexit in nvidia forums and stackoverflow https://forums.developer.nvidia.com/t/correct-placement-of-cudadevicereset-for-large-c-application/41104
https://stackoverflow.com/questions/19184865/cuda-context-destruction-at-host-process-termination
but CUDA shutdown in general is kept quite vague. Everything works fine when on my machine which means CUDA is indeed shutting down after atexit handlers are being called, but I'm adding some additional logic for good measure to make sure that if CUDA is already shut down we don't attempt to run cleanup.

ahmadsharif1 · 2024-06-06T20:02:46Z

torchvision/csrc/io/image/cuda/encode_jpegs_cuda.h

+
+  torch::Tensor encode_jpeg(const torch::Tensor& src_image);
+
+  void setQuality(const int64_t);


Add a parameter name similar to encode_jpeg above?

ahmadsharif1 · 2024-06-06T20:03:09Z

torchvision/csrc/io/image/cuda/encode_jpegs_cuda.h

+  CUDAJpegEncoder(const torch::Device& device);
+  ~CUDAJpegEncoder();
+
+  torch::Tensor encode_jpeg(const torch::Tensor& src_image);


The name here has underscores while below is using camelCase. Make them consistent?

ahmadsharif1 · 2024-06-06T20:05:58Z

torchvision/csrc/io/image/cuda/encode_decode_jpegs_cuda.h

@@ -11,5 +12,9 @@ C10_EXPORT torch::Tensor decode_jpeg_cuda(
    ImageReadMode mode,
    torch::Device device);

+C10_EXPORT std::vector<torch::Tensor> encode_jpegs_cuda(


Add a comment here or somewhere for the user to say that it only supports contiguous tensors?

Line 87 in encode_jpegs_cuda.cpp should takes care of handling non-contiguous images.

ahmadsharif1 · 2024-06-06T20:07:56Z

torchvision/csrc/io/image/cuda/encode_jpegs_cuda.cpp

+  // on the current stream of the calling context when this function returns. We
+  // use a blocking event to ensure that this is indeed the case. Crucially, we
+  // do not want to block the host (which is what cudaStreamSynchronize would
+  // do) Events allow us to synchronize the streams without blocking the host


Add periods here for punctuation.

ahmadsharif1 · 2024-06-06T20:15:24Z

torchvision/csrc/io/image/cuda/encode_jpegs_cuda.cpp

+  // on the current stream of the calling context when this function returns. We
+  // use a blocking event to ensure that this is indeed the case. Crucially, we
+  // do not want to block the host (which is what cudaStreamSynchronize would
+  // do) Events allow us to synchronize the streams without blocking the host


I don't understand this comment.

You are saying we are not blocking the host -- yet I do see there is a cudaEventSynchronize() call in encode_jpeg(). So it appears you are pausing the host every iteration of the for loop. Why does the comment say we are not blocking the host?

It's a micro-optimization. At certain points during the execution of the overall operator we have to synchronize because there is simply no other way, but at this particular point we only need to sync with the current stream and not the host itself.

This reverts commit c5810ff.

…nto add_gpu_encode

deekay42 · 2024-06-11T11:50:57Z

lgtm module nits. And sorry I don't understand the comment about not waiting for each image when the code seems to wait for every image.

Thanks for the review!
Sorry, which comment are you referring to, specifically?

NicolasHug

Thanks a ton for the great work @deekay42 !

Reviewed By: vmoens Differential Revision: D60596235 fbshipit-source-id: 0c76dea583ed1cfbc49996651ee0fee57b9e4ae1 Co-authored-by: Nicolas Hug <[email protected]> Co-authored-by: Nicolas Hug <[email protected]>

facebook-github-bot added the cla signed label Apr 23, 2024

deekay42 added 3 commits April 23, 2024 12:06

fix test cases

4cc30cb

fix lints

2db02f0

fix lints2

6acef83

NicolasHug reviewed Apr 29, 2024

View reviewed changes

deekay42 added 2 commits April 29, 2024 15:59

latest round of updates

ae0450d

fix lints

a799c53

NicolasHug requested a review from ahmadsharif1 May 1, 2024 15:13

ahmadsharif1 reviewed May 1, 2024

View reviewed changes

NicolasHug added 2 commits May 31, 2024 05:26

Ignore mypy

c5810ff

Add comment

ff40253

NicolasHug reviewed May 31, 2024

View reviewed changes

NicolasHug and others added 4 commits May 31, 2024 06:25

minor test refactor

0972863

Merge branch 'main' of github.com:pytorch/vision into add_gpu_encode

4ce658d

Merge branch 'pytorch:main' into add_gpu_encode

65372a3

Caching nvjpeg vars across calls

62e072a

deekay42 force-pushed the add_gpu_encode branch from 7aef769 to b3d06cb Compare June 5, 2024 17:06

Update if nvjpeg not found

f190d99

deekay42 force-pushed the add_gpu_encode branch from b3d06cb to f190d99 Compare June 5, 2024 23:15

ahmadsharif1 reviewed Jun 6, 2024

View reviewed changes

NicolasHug and others added 5 commits June 10, 2024 02:46

Merge branch 'main' of github.com:pytorch/vision into add_gpu_encode

b5eaa89

Revert "Ignore mypy"

5051050

This reverts commit c5810ff.

Add comment

136f790

minor changes to address ahmad's comments

0a88d27

Merge branch 'add_gpu_encode' of https://github.com/deekay42/vision i…

df60183

…nto add_gpu_encode

add dtor log messages

f3c8a72

deekay42 force-pushed the add_gpu_encode branch from 075b717 to f3c8a72 Compare June 12, 2024 20:49

Skip CUDA cleanup altogether

117d1f1

NicolasHug approved these changes Jun 13, 2024

View reviewed changes

Merge branch 'main' into add_gpu_encode

21eca4c

NicolasHug merged commit 143d078 into pytorch:main Jun 13, 2024
66 of 73 checks passed

NicolasHug added enhancement module: io labels Jun 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding GPU acceleration to encode_jpeg #8391

Adding GPU acceleration to encode_jpeg #8391

deekay42 commented Apr 23, 2024 •

edited

Loading

pytorch-bot bot commented Apr 23, 2024 •

edited

Loading

NicolasHug left a comment

ahmadsharif1 left a comment

ahmadsharif1 May 1, 2024

ahmadsharif1 May 1, 2024

ahmadsharif1 May 1, 2024

ahmadsharif1 May 1, 2024

ahmadsharif1 May 1, 2024

ahmadsharif1 May 1, 2024

ahmadsharif1 May 1, 2024

ahmadsharif1 May 1, 2024

ahmadsharif1 May 1, 2024

NicolasHug May 31, 2024

NicolasHug May 31, 2024

ahmadsharif1 left a comment

ahmadsharif1 Jun 6, 2024

deekay42 Jun 11, 2024

ahmadsharif1 Jun 6, 2024

ahmadsharif1 Jun 6, 2024

deekay42 Jun 11, 2024

ahmadsharif1 Jun 6, 2024

deekay42 Jun 11, 2024

ahmadsharif1 Jun 6, 2024

deekay42 Jun 11, 2024

ahmadsharif1 Jun 6, 2024

deekay42 Jun 11, 2024

deekay42 commented Jun 11, 2024

NicolasHug left a comment


		torch::Tensor encode_jpeg(const torch::Tensor& src_image);

		void setQuality(const int64_t);

Adding GPU acceleration to encode_jpeg #8391

Adding GPU acceleration to encode_jpeg #8391

Conversation

deekay42 commented Apr 23, 2024 • edited Loading

pytorch-bot bot commented Apr 23, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/8391

❌ 7 New Failures

NicolasHug left a comment

Choose a reason for hiding this comment

ahmadsharif1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahmadsharif1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deekay42 commented Jun 11, 2024

NicolasHug left a comment

Choose a reason for hiding this comment

deekay42 commented Apr 23, 2024 •

edited

Loading

pytorch-bot bot commented Apr 23, 2024 •

edited

Loading