-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CUDA] StableDiffusion XL demo with CUDA EP #17997
Merged
Merged
+379
−196
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
onnxruntime/python/tools/transformers/models/stable_diffusion/pipeline_stable_diffusion.py
Fixed
Show fixed
Hide fixed
onnxruntime/python/tools/transformers/models/stable_diffusion/ort_utils.py
Show resolved
Hide resolved
kunal-vaishnavi
approved these changes
Oct 18, 2023
jchen351
pushed a commit
that referenced
this pull request
Oct 18, 2023
Add CUDA EP to the StableDiffusion XL Demo including: (1) Add fp16 VAE support for CUDA EP. (2) Configuration for each model separately (For example, some models can run with CUDA graph but some models cannot). Some remaining works will boost performance further later: (1) Enable CUDA Graph for Clip2 and UNet. Currently, some part of graph is partitioned to CPU, which blocks CUDA graph. (2) Update GroupNorm CUDA kernel for refiner. Currently, the cuda kernel only supports limited number of channels in refiner so we shall see some gain there if we remove the limitation. Some extra works that are nice to have (thus lower priority): (3) Support denoising_end to ensemble base and refiner. (4) Support classifier free guidance (The idea is from https://www.baseten.co/blog/sdxl-inference-in-under-2-seconds-the-ultimate-guide-to-stable-diffusion-optimiza/). #### Performance on A100-SXM4-80GB Example commands to test an engine built with static shape or dynamic shape: ``` engine_name=ORT_CUDA python demo_txt2img_xl.py --engine $engine_name "some prompt" python demo_txt2img_xl.py --engine $engine_name --disable-cuda-graph --build-dynamic-batch --build-dynamic-shape "some prompt" ``` Engine built with dynamic shape could support different batch size (1 to 4 for TRT; 1 to 16 for CUDA) and image size (256x256 to 1024x1024). Engine built with static shape could only support fixed batch size (1) and image size (1024x1024). The latency (ms) of generating an image of size 1024x1024 (sorted by total latency): Engine | Base (30 Steps)* | Refiner (9 Steps) | Total Latency (ms) -- | -- | -- | -- ORT_TRT (static shape) | 2467 | 1033 | 3501 TRT (static shape) | 2507 | 1048 | 3555 ORT_CUDA (static shape) | 2630 | 1015 | 3645 ORT_CUDA (dynamic shape) | 2639 | 1016 | 3654 TRT (dynamic shape) | 2777 | 1099 | 3876 ORT_TRT (dynamic shape) | 2890 | 1166 | 4057 \* VAE decoder is not used in Base since the output from base is latent, which is consumed by refiner to output image. We can see that ORT_CUDA is faster on dynamic shape, while slower in static shape (The cause is Clip2 and UNet cannot run with CUDA Graph right now, and we will address the issue later). ### Motivation and Context Follow up of #17536
tianleiwu
added a commit
that referenced
this pull request
Oct 31, 2023
Add CUDA EP to the StableDiffusion XL Demo including: (1) Add fp16 VAE support for CUDA EP. (2) Configuration for each model separately (For example, some models can run with CUDA graph but some models cannot). Some remaining works will boost performance further later: (1) Enable CUDA Graph for Clip2 and UNet. Currently, some part of graph is partitioned to CPU, which blocks CUDA graph. (2) Update GroupNorm CUDA kernel for refiner. Currently, the cuda kernel only supports limited number of channels in refiner so we shall see some gain there if we remove the limitation. Some extra works that are nice to have (thus lower priority): (3) Support denoising_end to ensemble base and refiner. (4) Support classifier free guidance (The idea is from https://www.baseten.co/blog/sdxl-inference-in-under-2-seconds-the-ultimate-guide-to-stable-diffusion-optimiza/). #### Performance on A100-SXM4-80GB Example commands to test an engine built with static shape or dynamic shape: ``` engine_name=ORT_CUDA python demo_txt2img_xl.py --engine $engine_name "some prompt" python demo_txt2img_xl.py --engine $engine_name --disable-cuda-graph --build-dynamic-batch --build-dynamic-shape "some prompt" ``` Engine built with dynamic shape could support different batch size (1 to 4 for TRT; 1 to 16 for CUDA) and image size (256x256 to 1024x1024). Engine built with static shape could only support fixed batch size (1) and image size (1024x1024). The latency (ms) of generating an image of size 1024x1024 (sorted by total latency): Engine | Base (30 Steps)* | Refiner (9 Steps) | Total Latency (ms) -- | -- | -- | -- ORT_TRT (static shape) | 2467 | 1033 | 3501 TRT (static shape) | 2507 | 1048 | 3555 ORT_CUDA (static shape) | 2630 | 1015 | 3645 ORT_CUDA (dynamic shape) | 2639 | 1016 | 3654 TRT (dynamic shape) | 2777 | 1099 | 3876 ORT_TRT (dynamic shape) | 2890 | 1166 | 4057 \* VAE decoder is not used in Base since the output from base is latent, which is consumed by refiner to output image. We can see that ORT_CUDA is faster on dynamic shape, while slower in static shape (The cause is Clip2 and UNet cannot run with CUDA Graph right now, and we will address the issue later). ### Motivation and Context Follow up of #17536
tianleiwu
removed
triage:approved
Approved for cherrypicks for release
release:1.16.2
labels
Nov 1, 2023
kleiti
pushed a commit
to kleiti/onnxruntime
that referenced
this pull request
Mar 22, 2024
Add CUDA EP to the StableDiffusion XL Demo including: (1) Add fp16 VAE support for CUDA EP. (2) Configuration for each model separately (For example, some models can run with CUDA graph but some models cannot). Some remaining works will boost performance further later: (1) Enable CUDA Graph for Clip2 and UNet. Currently, some part of graph is partitioned to CPU, which blocks CUDA graph. (2) Update GroupNorm CUDA kernel for refiner. Currently, the cuda kernel only supports limited number of channels in refiner so we shall see some gain there if we remove the limitation. Some extra works that are nice to have (thus lower priority): (3) Support denoising_end to ensemble base and refiner. (4) Support classifier free guidance (The idea is from https://www.baseten.co/blog/sdxl-inference-in-under-2-seconds-the-ultimate-guide-to-stable-diffusion-optimiza/). #### Performance on A100-SXM4-80GB Example commands to test an engine built with static shape or dynamic shape: ``` engine_name=ORT_CUDA python demo_txt2img_xl.py --engine $engine_name "some prompt" python demo_txt2img_xl.py --engine $engine_name --disable-cuda-graph --build-dynamic-batch --build-dynamic-shape "some prompt" ``` Engine built with dynamic shape could support different batch size (1 to 4 for TRT; 1 to 16 for CUDA) and image size (256x256 to 1024x1024). Engine built with static shape could only support fixed batch size (1) and image size (1024x1024). The latency (ms) of generating an image of size 1024x1024 (sorted by total latency): Engine | Base (30 Steps)* | Refiner (9 Steps) | Total Latency (ms) -- | -- | -- | -- ORT_TRT (static shape) | 2467 | 1033 | 3501 TRT (static shape) | 2507 | 1048 | 3555 ORT_CUDA (static shape) | 2630 | 1015 | 3645 ORT_CUDA (dynamic shape) | 2639 | 1016 | 3654 TRT (dynamic shape) | 2777 | 1099 | 3876 ORT_TRT (dynamic shape) | 2890 | 1166 | 4057 \* VAE decoder is not used in Base since the output from base is latent, which is consumed by refiner to output image. We can see that ORT_CUDA is faster on dynamic shape, while slower in static shape (The cause is Clip2 and UNet cannot run with CUDA Graph right now, and we will address the issue later). ### Motivation and Context Follow up of microsoft#17536
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Add CUDA EP to the StableDiffusion XL Demo including:
(1) Add fp16 VAE support for CUDA EP.
(2) Configuration for each model separately (For example, some models can run with CUDA graph but some models cannot).
Some remaining works will boost performance further later:
(1) Enable CUDA Graph for Clip2 and UNet. Currently, some part of graph is partitioned to CPU, which blocks CUDA graph.
(2) Update GroupNorm CUDA kernel for refiner. Currently, the cuda kernel only supports limited number of channels in refiner so we shall see some gain there if we remove the limitation.
Some extra works that are nice to have (thus lower priority):
(3) Support denoising_end to ensemble base and refiner.
(4) Support classifier free guidance (The idea is from https://www.baseten.co/blog/sdxl-inference-in-under-2-seconds-the-ultimate-guide-to-stable-diffusion-optimiza/).
Performance on A100-SXM4-80GB
Example commands to test an engine built with static shape or dynamic shape:
Engine built with dynamic shape could support different batch size (1 to 4 for TRT; 1 to 16 for CUDA) and image size (256x256 to 1024x1024).
Engine built with static shape could only support fixed batch size (1) and image size (1024x1024).
The latency (ms) of generating an image of size 1024x1024 (sorted by total latency):
* VAE decoder is not used in Base since the output from base is latent, which is consumed by refiner to output image.
We can see that ORT_CUDA is faster on dynamic shape, while slower in static shape (The cause is Clip2 and UNet cannot run with CUDA Graph right now, and we will address the issue later).
Motivation and Context
Follow up of #17536