Support for decoding jpegs on GPU with nvjpeg #3792

NicolasHug · 2021-05-07T13:09:05Z

Closes #2786
Closes #2742

This is based on @jamt9000's great initial work in #2786. I mostly made some minor clean ups and added some tests.

In terms of usage, the current supported API is to add a device parameter to io.image.decode_jpeg.

For benchmarks, see #2786 (comment). Overall this seems to offer a 2-3X speedup over CPU decoding (without libjpeg-turbo).

Note: We use nvjpegCreateSimple() which is only available in cuda >= 10.1. So even though nvjpeg exists for 10.0, this won't compile. I assume this is OK since our CI only tests 10.1 and upwards anyway.

While we can move forward with this simple basic version, there seem to be room for improvement. In particular:

Support for A100 devices
Support for batch decoding (I didn't see any improvement in my experiments in [WIP] nvJPEG support #2786 (comment), but perhaps I missed something)
Use a finer-grained API for the decoding phases, and potentially change the decoding backend depending on the image size, taking inspiration from https://github.com/NVIDIA/CUDALibrarySamples/tree/master/nvJPEG/nvJPEG-Decoder-MultipleInstances

It will be included in the cuda sdk lib and include paths set up by CUDAExtension

…_bis

datumbox

Thanks for the PR!

I'll have a more thorough look next week. I only had a quick check and left a few comments for discussion. Let me know your thoughts.

Edit: I had a second look. Overall the PR looks great. I flagged two more things to discuss prior merging. With the exception of the potential memory leak that we need to investigate, all other comments are questions around the API so there is no need to modify your code.

datumbox · 2021-05-07T16:29:08Z

torchvision/csrc/io/image/cuda/decode_jpeg_cuda.cpp

+
+#else
+
+static nvjpegHandle_t nvjpeg_handle = nullptr;


Consider adding this on an anonymous namespace as it looks an internal detail of the implementation. Also just checking whether this should be released at any point to avoid memory leaks.

Also just checking whether this should be released at any point to avoid memory leaks

Yeah this is a good point. Creating / allocating it at each call has some severe overhead so it makes sense to declare this as a global variable (related discussion: #2786 (comment)). But this means we never really know when to release it, and the memory will only be freed when the process is killed.

For reference: this is thread-safe

Good that it's thread-safe but it's still unclear to me whether we have to find a way to release it or if we can leave it be. We don't have such an idiom at TorchVision but I wonder if there are examples of resources on PyTorch core that are never released.

@ezyang how do you handle situations like this on core?

We leak a bunch of global statically initialized objects, as it is the easiest way to avoid destructor ordering problems. If nvjpeg is very simple library you might be able to write an RAII class for this object and have it destruct properly on shutdown.

Precedent in PyTorch is the cudnn convolution cache, look at that for some inspiration.

@ezyang Thanks for the insights.

@NicolasHug Given Ed's input, I think we can mark this as resolved if you want.

datumbox · 2021-05-07T16:31:24Z

torchvision/csrc/io/image/image.cpp

@@ -21,7 +21,8 @@ static auto registry = torch::RegisterOperators()
                           .op("image::encode_jpeg", &encode_jpeg)
                           .op("image::read_file", &read_file)
                           .op("image::write_file", &write_file)
-                           .op("image::decode_image", &decode_image);
+                           .op("image::decode_image", &decode_image)
+                           .op("image::decode_jpeg_cuda", &decode_jpeg_cuda);


Not sure if the dispatcher would make sense here. Since this is the first IO method we add for GPU, it might be worth checking the naming conventions (_cuda) as this will be reproduced on the near future in other places. Thoughts @fmassa ?

I think using the dispatcher would be good, but I'm not sure how it handles constructor functions (like torch.empty / torch.rand).

Indeed, this function always takes CPU tensors, and it's up to a device argument to decide if we should dispatch to the CPU or the CUDA version.

@ezyang do you know if we can use the dispatcher to dispatch taking a torch.device into account, knowing that all tensors live in the CPU?

@fmassa How about reading the data on CPU since that's needed and then calling to() to move it on the right device. This can happen in the python side of things and remain hidden. Then after the binary data living in GPU, the dispatcher can be used as normal to decide if the decoding should happen on the GPU or CPU. Thoughts on this?

Still, nvjpeg requires the input data to live on CPU, so we would need to move it back to CPU again within the function, which would be inefficient. I would have preferred if we could pass the tensor directly as a CUDA tensor as well, but I'm not sure this is possible without further overheads

Thanks for the clarifications concerning nvjpeg. I think that we can investigate on future PRs how we could do this more elegantly. No need to block this PR.

test/test_image.py

torchvision/io/image.py

fmassa

Looks great to me, thanks!

I've made a few comments, I none of which are merge-blocking I think.

fmassa · 2021-05-10T11:38:35Z

test/test_image.py

+    img_nvjpeg = f(data, mode=mode, device='cuda')
+
+    # Some difference expected between jpeg implementations
+    tester.assertTrue((img.float() - img_nvjpeg.cpu().float()).abs().mean() < 2)


Do we want to consider the mean different or the max difference here? What would be the minimum value so that max tests pass here?

The max error can be quite high unfortunately, the minimum threshold for all tests to pass seems to be 52, after which some tests start failing.
In test_decode_jpeg, we also test for the MAE (with the same threshold=2)

Hum, this looks suspicious that the decoding gives such large differences. Something to keep an eye on

fmassa · 2021-05-10T11:47:09Z

torchvision/csrc/io/image/cuda/decode_jpeg_cuda.cpp

+  nvjpegImage_t out_image;
+
+  for (int c = 0; c < num_channels_output; c++) {
+    out_image.channel[c] = out_tensor[c].data_ptr<uint8_t>();


nit: this is fine for now, but this adds extra overhead as we need to create a full Tensor just to extract the data pointer (and the Tensor construction is heavy). Given that we generally only have 3 channels that shouldn't be much of an issue, but still good to keep in mind.

Some alternatives would be to directly use the raw data_ptr adding the correct offsets, like

uint8_t * out_tensor_ptr = out_tensor.data_ptr<uint8_t>(); ... out_image.channel[c] = out_tensor_ptr + c * height * width;

Also, interesting that nvjpeg accept decoding images in both CHW and HWC formats -- I wonder if there is any performance implications by decoding it in CHW?

fmassa · 2021-05-10T11:56:44Z

torchvision/csrc/io/image/cuda/decode_jpeg_cuda.cpp

+
+#else
+
+static nvjpegHandle_t nvjpeg_handle = nullptr;


For reference: this is thread-safe

fmassa · 2021-05-10T12:25:07Z

torchvision/csrc/io/image/cuda/decode_jpeg_cuda.cpp

+    TORCH_CHECK(
+        create_status == NVJPEG_STATUS_SUCCESS,
+        "nvjpegCreateSimple failed: ",
+        create_status);


I'm wondering if we should clear this if the creation fails, otherwise we might not be able to run this function anymore as the handle will be invalid?

torchvision/csrc/io/image/cuda/decode_jpeg_cuda.cpp

datumbox

LGTM!

NicolasHug · 2021-05-11T12:35:55Z

CI is green(ish) so I'll merge.

Thanks everyone for the reviews and especially @jamt9000 for the initial work!
I'll follow up and open an issue with potential future improvements

jamt9000 · 2021-05-11T12:35:56Z

torchvision/csrc/io/image/cuda/decode_jpeg_cuda.cpp

+      if (create_status != NVJPEG_STATUS_SUCCESS) {
+        // Reset handle so that one can still call the function again in the
+        // same process if there was a failure
+        free(nvjpeg_handle);


Since it's an opaque handle I think the use of free() may not be correct unless it's documented as being supported.

(I would hope that it simply leaves the handle as null if initialisation fails, although I don't see that in the docs - here it just reinits the handle without any freeing when hw backend fails though)

Hm that's a good point. What would you recommend instead of free?

Before pushing this I did a quick test by inserting

nvjpegStatus_t create_status = nvjpegCreateSimple(&nvjpeg_handle); create_status = NVJPEG_STATUS_NOT_INITIALIZED; // <- this

and all the tests were failing gracefully with E RuntimeError: nvjpegCreateSimple failed: 1. Since I was running the tests with pytest test/test_image.py -k cuda they were all in the same process and pytest was just catching the RuntimeErrors, so I assumed it was OK.

Perhaps nvjpegDestroy(nvjpeg_handle)? See: https://github.com/NVIDIA/CUDALibrarySamples/blob/master/nvJPEG/Image-Resize/imageResize.cpp#L426

I didn't use it because I was wondering whether nvjpegDestroy would properly work with a bad handle?

I guess if you assume anything can happen if initialisation fails then it might end up being an arbitrary value like 0xDEADBEEF and all you can do is reset it to null.

Summary: Co-authored-by: James Thewlis <[email protected]> Reviewed By: datumbox Differential Revision: D28473331 fbshipit-source-id: d82d415e81876b660e599997860c737848d9afc0

jamt9000 and others added 25 commits October 10, 2020 20:22

Initial stab at nvJPEG support pytorch#2742

f878b36

Init nvjpeg once on first call

5eb6d73

Merge remote-tracking branch 'origin' into nvjpeg

afd4a2e

Add io/image/cuda search path

8abe4a5

Update test

8ae0751

Building CUDA ext should mean nvjpeg exists

9a2510f

It will be included in the cuda sdk lib and include paths set up by CUDAExtension

Check if nvjpeg.h is actually there

ac3330b

Add ImageReadMode support for nvjpeg

e798157

lint

a07f53a

Call nvjpegJpegStateDestroy when bailing out

e485656

Use at::cuda::getCurrentCUDAStream()

1c1e471

Merge branch 'master' into nvjpeg

3e7486e

Changes to match pytorch#3312

5bc5e21

Move includes outside namespace

4d4cd45

Lint

dd3e445

Guard includes so cpu builds work

f560eeb

Add device argument

ab90893

Merge branch 'master' into nvjpeg_bis

0992fa4

WIP

540eaa4

Merge branch 'master' of github.com:pytorch/vision into nvjpeg_bis

785ba98

WIP

7b6eadf

WIP

380d5b5

Merge branch 'master' of github.com:pytorch/vision into nvjpeg_bis

3c73ac9

clean up

ef4e8ce

remove bench

8de3a4a

facebook-github-bot added the cla signed label May 7, 2021

NicolasHug added 4 commits May 7, 2021 13:11

cleanup

be5526e

linting

7b5c09f

fix setup.py

67aad81

rocm

7be3c11

NicolasHug added 5 commits May 7, 2021 16:39

Merge branch 'master' into nvjpeg_bis

422d2f3

use proper device for stream

fbb4511

put back device guard

45f4515

Merge branch 'master' of github.com:pytorch/vision into nvjpeg_bis

7d9c55e

Merge branch 'nvjpeg_bis' of github.com:NicolasHug/vision into nvjpeg…

83124df

…_bis

NicolasHug marked this pull request as ready for review May 7, 2021 16:25

NicolasHug changed the title ~~WIP Support for decoding jpegs on GPU with nvjpeg~~ Support for decoding jpegs on GPU with nvjpeg May 7, 2021

NicolasHug added enhancement module: io labels May 7, 2021

datumbox reviewed May 7, 2021

View reviewed changes

datumbox reviewed May 9, 2021

View reviewed changes

test/test_image.py Show resolved Hide resolved

datumbox reviewed May 9, 2021

View reviewed changes

torchvision/io/image.py Show resolved Hide resolved

NicolasHug mentioned this pull request May 10, 2021

Added antialias option to transforms.functional.resize #3761

Merged

5 tasks

fmassa approved these changes May 10, 2021

View reviewed changes

ezyang reviewed May 10, 2021

View reviewed changes

torchvision/csrc/io/image/cuda/decode_jpeg_cuda.cpp Outdated Show resolved Hide resolved

NicolasHug added 3 commits May 11, 2021 09:54

Merge branch 'master' of github.com:pytorch/vision into nvjpeg_bis

1c3c4bd

Add unnamed namespace and use call_once to safely create handle

ba63e1e

once_flag shouldn't be static

f167602

datumbox reviewed May 11, 2021

View reviewed changes

torchvision/csrc/io/image/cuda/decode_jpeg_cuda.cpp Show resolved Hide resolved

datumbox approved these changes May 11, 2021

View reviewed changes

jamt9000 reviewed May 11, 2021

View reviewed changes

NicolasHug merged commit f87ce88 into pytorch:master May 11, 2021

jamt9000 mentioned this pull request May 11, 2021

TorchVision Roadmap - 2021 H1 #3221

Closed

13 tasks

NicolasHug mentioned this pull request May 17, 2021

Potential improvements to jpeg decoding on GPU #3848

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for decoding jpegs on GPU with nvjpeg #3792

Support for decoding jpegs on GPU with nvjpeg #3792

NicolasHug commented May 7, 2021 •

edited

Loading

datumbox left a comment •

edited

Loading

datumbox May 7, 2021

NicolasHug May 7, 2021

fmassa May 10, 2021

datumbox May 10, 2021

ezyang May 10, 2021

datumbox May 10, 2021

datumbox May 7, 2021

fmassa May 10, 2021

datumbox May 10, 2021

fmassa May 10, 2021

datumbox May 10, 2021

fmassa left a comment

fmassa May 10, 2021

NicolasHug May 11, 2021 •

edited

Loading

fmassa May 11, 2021

fmassa May 10, 2021

fmassa May 10, 2021

fmassa May 10, 2021

datumbox left a comment

NicolasHug commented May 11, 2021

jamt9000 May 11, 2021

NicolasHug May 11, 2021

datumbox May 11, 2021

NicolasHug May 11, 2021

jamt9000 May 11, 2021

Support for decoding jpegs on GPU with nvjpeg #3792

Support for decoding jpegs on GPU with nvjpeg #3792

Conversation

NicolasHug commented May 7, 2021 • edited Loading

datumbox left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fmassa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NicolasHug May 11, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

datumbox left a comment

Choose a reason for hiding this comment

NicolasHug commented May 11, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NicolasHug commented May 7, 2021 •

edited

Loading

datumbox left a comment •

edited

Loading

NicolasHug May 11, 2021 •

edited

Loading