[CUDA] StableDiffusion XL demo with CUDA EP #17997

tianleiwu · 2023-10-17T18:52:50Z

Description

Add CUDA EP to the StableDiffusion XL Demo including:
(1) Add fp16 VAE support for CUDA EP.
(2) Configuration for each model separately (For example, some models can run with CUDA graph but some models cannot).

Some remaining works will boost performance further later:
(1) Enable CUDA Graph for Clip2 and UNet. Currently, some part of graph is partitioned to CPU, which blocks CUDA graph.
(2) Update GroupNorm CUDA kernel for refiner. Currently, the cuda kernel only supports limited number of channels in refiner so we shall see some gain there if we remove the limitation.

Some extra works that are nice to have (thus lower priority):
(3) Support denoising_end to ensemble base and refiner.
(4) Support classifier free guidance (The idea is from https://www.baseten.co/blog/sdxl-inference-in-under-2-seconds-the-ultimate-guide-to-stable-diffusion-optimiza/).

Performance on A100-SXM4-80GB

Example commands to test an engine built with static shape or dynamic shape:

engine_name=ORT_CUDA
python demo_txt2img_xl.py --engine $engine_name "some prompt"
python demo_txt2img_xl.py --engine $engine_name --disable-cuda-graph --build-dynamic-batch --build-dynamic-shape "some prompt"

Engine built with dynamic shape could support different batch size (1 to 4 for TRT; 1 to 16 for CUDA) and image size (256x256 to 1024x1024).
Engine built with static shape could only support fixed batch size (1) and image size (1024x1024).

The latency (ms) of generating an image of size 1024x1024 (sorted by total latency):

Engine	Base (30 Steps)*	Refiner (9 Steps)	Total Latency (ms)
ORT_TRT (static shape)	2467	1033	3501
TRT (static shape)	2507	1048	3555
ORT_CUDA (static shape)	2630	1015	3645
ORT_CUDA (dynamic shape)	2639	1016	3654
TRT (dynamic shape)	2777	1099	3876
ORT_TRT (dynamic shape)	2890	1166	4057

* VAE decoder is not used in Base since the output from base is latent, which is consumed by refiner to output image.

We can see that ORT_CUDA is faster on dynamic shape, while slower in static shape (The cause is Clip2 and UNet cannot run with CUDA Graph right now, and we will address the issue later).

Motivation and Context

Follow up of #17536

onnxruntime/python/tools/transformers/models/stable_diffusion/pipeline_stable_diffusion.py

onnxruntime/python/tools/transformers/models/stable_diffusion/ort_utils.py

Add CUDA EP to the StableDiffusion XL Demo including: (1) Add fp16 VAE support for CUDA EP. (2) Configuration for each model separately (For example, some models can run with CUDA graph but some models cannot). Some remaining works will boost performance further later: (1) Enable CUDA Graph for Clip2 and UNet. Currently, some part of graph is partitioned to CPU, which blocks CUDA graph. (2) Update GroupNorm CUDA kernel for refiner. Currently, the cuda kernel only supports limited number of channels in refiner so we shall see some gain there if we remove the limitation. Some extra works that are nice to have (thus lower priority): (3) Support denoising_end to ensemble base and refiner. (4) Support classifier free guidance (The idea is from https://www.baseten.co/blog/sdxl-inference-in-under-2-seconds-the-ultimate-guide-to-stable-diffusion-optimiza/). #### Performance on A100-SXM4-80GB Example commands to test an engine built with static shape or dynamic shape: ``` engine_name=ORT_CUDA python demo_txt2img_xl.py --engine $engine_name "some prompt" python demo_txt2img_xl.py --engine $engine_name --disable-cuda-graph --build-dynamic-batch --build-dynamic-shape "some prompt" ``` Engine built with dynamic shape could support different batch size (1 to 4 for TRT; 1 to 16 for CUDA) and image size (256x256 to 1024x1024). Engine built with static shape could only support fixed batch size (1) and image size (1024x1024). The latency (ms) of generating an image of size 1024x1024 (sorted by total latency): Engine | Base (30 Steps)* | Refiner (9 Steps) | Total Latency (ms) -- | -- | -- | -- ORT_TRT (static shape) | 2467 | 1033 | 3501 TRT (static shape) | 2507 | 1048 | 3555 ORT_CUDA (static shape) | 2630 | 1015 | 3645 ORT_CUDA (dynamic shape) | 2639 | 1016 | 3654 TRT (dynamic shape) | 2777 | 1099 | 3876 ORT_TRT (dynamic shape) | 2890 | 1166 | 4057 \* VAE decoder is not used in Base since the output from base is latent, which is consumed by refiner to output image. We can see that ORT_CUDA is faster on dynamic shape, while slower in static shape (The cause is Clip2 and UNet cannot run with CUDA Graph right now, and we will address the issue later). ### Motivation and Context Follow up of #17536

Add CUDA EP to the StableDiffusion XL Demo including: (1) Add fp16 VAE support for CUDA EP. (2) Configuration for each model separately (For example, some models can run with CUDA graph but some models cannot). Some remaining works will boost performance further later: (1) Enable CUDA Graph for Clip2 and UNet. Currently, some part of graph is partitioned to CPU, which blocks CUDA graph. (2) Update GroupNorm CUDA kernel for refiner. Currently, the cuda kernel only supports limited number of channels in refiner so we shall see some gain there if we remove the limitation. Some extra works that are nice to have (thus lower priority): (3) Support denoising_end to ensemble base and refiner. (4) Support classifier free guidance (The idea is from https://www.baseten.co/blog/sdxl-inference-in-under-2-seconds-the-ultimate-guide-to-stable-diffusion-optimiza/). #### Performance on A100-SXM4-80GB Example commands to test an engine built with static shape or dynamic shape: ``` engine_name=ORT_CUDA python demo_txt2img_xl.py --engine $engine_name "some prompt" python demo_txt2img_xl.py --engine $engine_name --disable-cuda-graph --build-dynamic-batch --build-dynamic-shape "some prompt" ``` Engine built with dynamic shape could support different batch size (1 to 4 for TRT; 1 to 16 for CUDA) and image size (256x256 to 1024x1024). Engine built with static shape could only support fixed batch size (1) and image size (1024x1024). The latency (ms) of generating an image of size 1024x1024 (sorted by total latency): Engine | Base (30 Steps)* | Refiner (9 Steps) | Total Latency (ms) -- | -- | -- | -- ORT_TRT (static shape) | 2467 | 1033 | 3501 TRT (static shape) | 2507 | 1048 | 3555 ORT_CUDA (static shape) | 2630 | 1015 | 3645 ORT_CUDA (dynamic shape) | 2639 | 1016 | 3654 TRT (dynamic shape) | 2777 | 1099 | 3876 ORT_TRT (dynamic shape) | 2890 | 1166 | 4057 \* VAE decoder is not used in Base since the output from base is latent, which is consumed by refiner to output image. We can see that ORT_CUDA is faster on dynamic shape, while slower in static shape (The cause is Clip2 and UNet cannot run with CUDA Graph right now, and we will address the issue later). ### Motivation and Context Follow up of microsoft#17536

tianleiwu added 2 commits October 17, 2023 18:30

sdxl demo with CUDA EP

763c273

remove denoising_end and denoising_start

Loading
Loading status checks…

540b73d

github-advanced-security bot found potential problems Oct 17, 2023

View reviewed changes

onnxruntime/python/tools/transformers/models/stable_diffusion/pipeline_stable_diffusion.py Fixed Show fixed Hide fixed

tianleiwu added 2 commits October 17, 2023 20:14

remove do_classifier_free_guidance

f85a078

remove free_dimension_override

Loading
Loading status checks…

4961cfe

tianleiwu requested review from wangyems and kunal-vaishnavi October 17, 2023 20:43

undo __init__.py

Loading
Loading status checks…

466c3f3

kunal-vaishnavi reviewed Oct 17, 2023

View reviewed changes

onnxruntime/python/tools/transformers/models/stable_diffusion/ort_utils.py Show resolved Hide resolved

Add default in _ModelConfig

Loading
Loading status checks…

9607ef9

kunal-vaishnavi approved these changes Oct 18, 2023

View reviewed changes

tianleiwu merged commit 59ae3fd into main Oct 18, 2023
91 checks passed

tianleiwu deleted the tlwu/sdxl_demo_cuda branch October 18, 2023 04:30

tianleiwu added the release:1.16.2 label Oct 24, 2023

faxu added triage:approved sdxl_llama labels Oct 25, 2023

tianleiwu removed triage:approved release:1.16.2 labels Nov 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUDA] StableDiffusion XL demo with CUDA EP #17997

[CUDA] StableDiffusion XL demo with CUDA EP #17997

tianleiwu commented Oct 17, 2023 •

edited

Loading

[CUDA] StableDiffusion XL demo with CUDA EP #17997

[CUDA] StableDiffusion XL demo with CUDA EP #17997

Conversation

tianleiwu commented Oct 17, 2023 • edited Loading

Description

Performance on A100-SXM4-80GB

Motivation and Context

tianleiwu commented Oct 17, 2023 •

edited

Loading