Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding GPU acceleration to encode_jpeg #8391

Merged
merged 21 commits into from
Jun 13, 2024
Merged

Conversation

deekay42
Copy link
Contributor

@deekay42 deekay42 commented Apr 23, 2024

Summary:
I'm adding GPU support to the existing torchvision.io.encode_jpeg function. If the input tensors are on the GPU, the CUDA version will be used and the CPU version otherwise.

Performance numbers indicate over 5000 imgs/s on 1 A100 GPU:

Processor: x86_64
Platform: Linux-5.12.0-0_fbk7_zion_6511_gd766966f605a-x86_64-with-glibc2.34
Logical CPUs: 192

CUDA device: NVIDIA PG509-210
Total Memory: 84.99 GB

Mean image size: 551x676
[----------------------------------------------------- Image Encoding -----------------------------------------------------]
                                                                                  |  1 images  |  100 images  |  1000 images
1 threads: -----------------------------------------------------------------------------------------------------------------
      CPU (unfused): [torchvision.io.encode_jpeg(img) for img in batch_input]     |   2466.1   |   219121.6   |   2169960.9
      CPU (fused): torchvision.io.encode_jpeg(batch_input)                        |   2627.3   |   221350.3   |   2098801.9
      CUDA:7 (unfused): [torchvision.io.encode_jpeg(img) for img in batch_input]  |    256.8   |    21060.6   |    212853.1
      CUDA:7 (fused): torchvision.io.encode_jpeg(batch_input)                     |    223.8   |    16829.9   |    193673.9
12 threads: ----------------------------------------------------------------------------------------------------------------
      CPU (unfused): [torchvision.io.encode_jpeg(img) for img in batch_input]     |   2512.9   |   216763.1   |   2161373.3
      CPU (fused): torchvision.io.encode_jpeg(batch_input)                        |   2608.3   |   223391.8   |   2152523.7
      CUDA:7 (unfused): [torchvision.io.encode_jpeg(img) for img in batch_input]  |    220.8   |    24009.1   |    245279.8
      CUDA:7 (fused): torchvision.io.encode_jpeg(batch_input)                     |    219.3   |    16971.8   |    175770.5
24 threads: ----------------------------------------------------------------------------------------------------------------
      CPU (unfused): [torchvision.io.encode_jpeg(img) for img in batch_input]     |   2549.4   |   215578.0   |   2195501.8
      CPU (fused): torchvision.io.encode_jpeg(batch_input)                        |   2427.9   |   225574.3   |   2139114.4
      CUDA:7 (unfused): [torchvision.io.encode_jpeg(img) for img in batch_input]  |    219.9   |    21726.3   |    214842.8
      CUDA:7 (fused): torchvision.io.encode_jpeg(batch_input)                     |    212.6   |    17351.6   |    172555.7

Times are in microseconds (us).```

Test Plan:
1. pytest test -vvv
2. ufmt format torchvision
3. flake8 torchvision

Reviewers:

Subscribers:

Tasks:

Tags:

<!-- Before submitting a PR, please make sure to check our contributing guidelines regarding code formatting, tests, and documentation: https://github.com/pytorch/vision/blob/main/CONTRIBUTING.md -->

Summary:
I'm adding GPU support to the existing torchvision.io.encode_jpeg function. If the input tensors are on the GPU, the CUDA version will be used and the CPU version otherwise.
Additionally, I'm adding a new function torchvision.io.encode_jpegs (plural) with uses a fused kernel and may be faster than successive calls to the singular version which incurs kernel launch overhead for each call.
If it's alright, I'll be happy to refactor decode_jpeg to follow this
convention in a follow up PR.

Test Plan:
1. pytest test -vvv
2. ufmt format torchvision
3. flake8 torchvision

Reviewers:

Subscribers:

Tasks:

Tags:
Copy link

pytorch-bot bot commented Apr 23, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/8391

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures

As of commit 21eca4c with merge base f96c42f (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @deekay42 . I made another pass but this looks good!

test/test_image.py Outdated Show resolved Hide resolved
test/test_image.py Outdated Show resolved Hide resolved
test/test_image.py Outdated Show resolved Hide resolved
test/test_image.py Outdated Show resolved Hide resolved
test/test_image.py Outdated Show resolved Hide resolved
torchvision/io/image.py Show resolved Hide resolved
torchvision/io/image.py Outdated Show resolved Hide resolved
torchvision/io/image.py Show resolved Hide resolved
torchvision/csrc/io/image/cuda/encode_jpeg_cuda.cpp Outdated Show resolved Hide resolved
torchvision/csrc/io/image/cuda/encode_jpeg_cuda.cpp Outdated Show resolved Hide resolved
@NicolasHug NicolasHug requested a review from ahmadsharif1 May 1, 2024 15:13
Copy link
Contributor

@ahmadsharif1 ahmadsharif1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @deekay42,

I work on the video decoder in C++ so @NicolasHug thought that my comments may be useful for this PR.

I hope you find my comments useful, and feel free to push back.

I am also curious if you did any benchmarking to see how much speedup we get using hardware decoding or encoding?

torchvision/csrc/io/image/cuda/encode_decode_jpeg_cuda.h Outdated Show resolved Hide resolved
#include <c10/cuda/CUDAGuard.h>
#include <nvjpeg.h>

nvjpegHandle_t nvjpeg_handle = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: perhaps rename this to g_nvjpeg_handle so it is clear this is a global variable?

Same for nvjpeg_handle_creation_flag below.

"The number of channels should be 3, got: ",
image.size(0));

// nvjpeg requires images to be contiguous
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: add a citation link if you can.

ImageReadMode mode,
torch::Device device);

C10_EXPORT std::vector<torch::Tensor> encode_jpeg_cuda(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: perhaps the name itself should indicate this is a plurality of images, like maybe encode_jpegs_cuda?


C10_EXPORT std::vector<torch::Tensor> encode_jpeg_cuda(
const std::vector<torch::Tensor>& images,
const int64_t quality);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: add a comment about quality. Is higher better or lower? What is the range/min/max here?


for (int c = 0; c < channels; c++) {
target_image.channel[c] = src_image[c].data_ptr<uint8_t>();
// this is why we need contiguous tensors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe add a CHECK here to make sure the tensor is contiguous?

}
}

torch::Tensor encode_single_jpeg(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: put this in an anonymous namespace since this function is not public?

}
}

torch::Tensor encode_single_jpeg(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this declaration can be omitted entirely if you move the implementation of this function above in an anonymous namespace, right?

getStreamState);

// Synchronize the stream to ensure that the encoded image is ready
cudaError_t syncState = cudaStreamSynchronize(stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the answer to this question and I am curious if you know -- is there a way to just do a single streamSynchronize per batch instead of per image? That way we can pipeline some work for some extra speedup when handling a batch of images.

size_t length;
nvjpegStatus_t getStreamState = nvjpegEncodeRetrieveBitstreamDevice(
nvjpeg_handle, nv_enc_state, NULL, &length, stream);
TORCH_CHECK(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe CHECK for the length > 0?

const std::vector<torch::Tensor>& images,
const int64_t quality);

void nvjpeg_init();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're not exposing this one, should we put it in a different namespace than in vision::image?


#else

void nvjpeg_init() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably doesn't matter too much but nvjpeg_init() is declared in encode_decode_jpeg_cuda.h no matter what NVJPEG_FOUND is, but it is only defined here if NVJPEG_FOUND is defined.

Copy link
Contributor

@ahmadsharif1 ahmadsharif1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm module nits. And sorry I don't understand the comment about not waiting for each image when the code seems to wait for every image.

// gets destroyed, the CUDA runtime may already be shut down, rendering all
// destroy* calls in the encoder destructor invalid. Instead, we use an
// atexit hook which executes after main() finishes, but before CUDA shuts
// down when the program exits.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the guarantee that CUDA shuts down before us?

AFAICt, std::atexit runs these functions in reverse order of when they are called. Is CUDA using atexit() also? If so we need to make sure that is registered before us.

If so, add a comment to that effect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no guarantee. There are a few mentions of using atexit in nvidia forums and stackoverflow https://forums.developer.nvidia.com/t/correct-placement-of-cudadevicereset-for-large-c-application/41104
https://stackoverflow.com/questions/19184865/cuda-context-destruction-at-host-process-termination
but CUDA shutdown in general is kept quite vague. Everything works fine when on my machine which means CUDA is indeed shutting down after atexit handlers are being called, but I'm adding some additional logic for good measure to make sure that if CUDA is already shut down we don't attempt to run cleanup.


torch::Tensor encode_jpeg(const torch::Tensor& src_image);

void setQuality(const int64_t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a parameter name similar to encode_jpeg above?

CUDAJpegEncoder(const torch::Device& device);
~CUDAJpegEncoder();

torch::Tensor encode_jpeg(const torch::Tensor& src_image);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name here has underscores while below is using camelCase. Make them consistent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup

@@ -11,5 +12,9 @@ C10_EXPORT torch::Tensor decode_jpeg_cuda(
ImageReadMode mode,
torch::Device device);

C10_EXPORT std::vector<torch::Tensor> encode_jpegs_cuda(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment here or somewhere for the user to say that it only supports contiguous tensors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 87 in encode_jpegs_cuda.cpp should takes care of handling non-contiguous images.

// on the current stream of the calling context when this function returns. We
// use a blocking event to ensure that this is indeed the case. Crucially, we
// do not want to block the host (which is what cudaStreamSynchronize would
// do) Events allow us to synchronize the streams without blocking the host
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add periods here for punctuation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

// on the current stream of the calling context when this function returns. We
// use a blocking event to ensure that this is indeed the case. Crucially, we
// do not want to block the host (which is what cudaStreamSynchronize would
// do) Events allow us to synchronize the streams without blocking the host
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment.

You are saying we are not blocking the host -- yet I do see there is a cudaEventSynchronize() call in encode_jpeg(). So it appears you are pausing the host every iteration of the for loop. Why does the comment say we are not blocking the host?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a micro-optimization. At certain points during the execution of the overall operator we have to synchronize because there is simply no other way, but at this particular point we only need to sync with the current stream and not the host itself.

@deekay42
Copy link
Contributor Author

lgtm module nits. And sorry I don't understand the comment about not waiting for each image when the code seems to wait for every image.

Thanks for the review!
Sorry, which comment are you referring to, specifically?

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a ton for the great work @deekay42 !

@NicolasHug NicolasHug merged commit 143d078 into pytorch:main Jun 13, 2024
66 of 73 checks passed
facebook-github-bot pushed a commit that referenced this pull request Aug 2, 2024
Reviewed By: vmoens

Differential Revision: D60596235

fbshipit-source-id: 0c76dea583ed1cfbc49996651ee0fee57b9e4ae1

Co-authored-by: Nicolas Hug <[email protected]>
Co-authored-by: Nicolas Hug <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants