-
Notifications
You must be signed in to change notification settings - Fork 3k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[CUDA] StableDiffusion XL demo with CUDA EP (#17997)
Add CUDA EP to the StableDiffusion XL Demo including: (1) Add fp16 VAE support for CUDA EP. (2) Configuration for each model separately (For example, some models can run with CUDA graph but some models cannot). Some remaining works will boost performance further later: (1) Enable CUDA Graph for Clip2 and UNet. Currently, some part of graph is partitioned to CPU, which blocks CUDA graph. (2) Update GroupNorm CUDA kernel for refiner. Currently, the cuda kernel only supports limited number of channels in refiner so we shall see some gain there if we remove the limitation. Some extra works that are nice to have (thus lower priority): (3) Support denoising_end to ensemble base and refiner. (4) Support classifier free guidance (The idea is from https://www.baseten.co/blog/sdxl-inference-in-under-2-seconds-the-ultimate-guide-to-stable-diffusion-optimiza/). #### Performance on A100-SXM4-80GB Example commands to test an engine built with static shape or dynamic shape: ``` engine_name=ORT_CUDA python demo_txt2img_xl.py --engine $engine_name "some prompt" python demo_txt2img_xl.py --engine $engine_name --disable-cuda-graph --build-dynamic-batch --build-dynamic-shape "some prompt" ``` Engine built with dynamic shape could support different batch size (1 to 4 for TRT; 1 to 16 for CUDA) and image size (256x256 to 1024x1024). Engine built with static shape could only support fixed batch size (1) and image size (1024x1024). The latency (ms) of generating an image of size 1024x1024 (sorted by total latency): Engine | Base (30 Steps)* | Refiner (9 Steps) | Total Latency (ms) -- | -- | -- | -- ORT_TRT (static shape) | 2467 | 1033 | 3501 TRT (static shape) | 2507 | 1048 | 3555 ORT_CUDA (static shape) | 2630 | 1015 | 3645 ORT_CUDA (dynamic shape) | 2639 | 1016 | 3654 TRT (dynamic shape) | 2777 | 1099 | 3876 ORT_TRT (dynamic shape) | 2890 | 1166 | 4057 \* VAE decoder is not used in Base since the output from base is latent, which is consumed by refiner to output image. We can see that ORT_CUDA is faster on dynamic shape, while slower in static shape (The cause is Clip2 and UNet cannot run with CUDA Graph right now, and we will address the issue later). ### Motivation and Context Follow up of #17536
- Loading branch information
Showing
14 changed files
with
379 additions
and
196 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.