Optimization Fixes and Improvements #575

NullSenseStudio · 2023-02-21T04:49:40Z

Bugs

Fixes a bug where most speed optimizations are not visible in CUDA and sequential CPU offload was wrongly visible in macOS. Also fixes a bug with CPU offloading that was supposed to already be fixed, but I believe was accidentally reintroduced and missed while merging at some point.

Improvements

I've extracted the device checking functionality from Optimizations.can_use() to its own classmethod device_supports() to simplify and make optimization filtering clearer. Descriptions have been added to all optimizations with some indication of what each does and if there are certain device limitations on them.

Removed AMP

Automatic mixed precision was removed due to 🤗 diffusers recommending against it. Any memory savings it would have can be done better with half precision, often times being faster, and with the same image quality.

New Optimizations

Memory efficient attention from xFormers saves VRAM and inference time. I've noticed memory savings around the same as attention slicing size 1 and an increase in it/s of 20% for normal image generation. It can overtake attention slicing memory savings if the image is larger than usual and when upscaling. xFormers may automatically select different attention optimizations for some GPUs/generations so improvements will vary.
CPU offloading has been split into model and submodule options; submodule being a rename of sequential CPU offload. Model offloading is a lesser version of submodule offloading: not as significant of memory savings while not severely slowing inference speed.

Discussion

xFormers is often recommended for use in diffusers and may be suitable to be enabled by default. One issue is I've heard some GPUs won't quite generate the same image with the same settings while using xFormers so could cause unneeded confusion. It also causes a warning message when installed A matching Triton is not available, some optimizations will not be enabled. which should be suppressed, and I don't know a good method how to. Triton is only officially available for linux and could be added to a linux specific requirements.txt.
Model offloading requires accelerate 0.17.0, which isn't released yet. Windows/linux requirements.txt could be updated to install from github or wait for it to be officially released (and based on release history that likely could be soon).

NullSenseStudio · 2023-03-11T01:36:56Z

With the addition of memory efficient attention and VAE tiling it's possible to create some quite large images without requiring an equally enormous amount of VRAM. Great for textures and backgrounds as long as the subject can be repeated easily and if you're willing to wait.

VAE tiling can have some issues with color and detail accuracy. AFAIK this is caused by normalization operations, tiling the latents will not allow each section to normalize the same. Tile size set to 128 and blend 0 for demonstration. While it can save a little memory on standard size images I don't recommend it.

Some prompts may not produce usable results at high resolutions. These were meant to be people.

Interestingly tiled VAE decoding and encoding was merged into diffusers recently huggingface/diffusers#1441. Doesn't support seamless axis blending like the implementation here. The tiled encoding for img2img and inpainting is a nice idea and could be added in a later PR. Also some discussion of improving tiling in the link, might be worth looking into at some point.

carson-katri · 2023-03-16T19:18:29Z

Would you be interested in trying out the new attention mechanisms built-in to PyTorch 2.0 for this? It should be automatically enabled in diffusers if PyTorch 2 is installed.

https://pytorch.org/blog/accelerated-diffusers-pt-20/

NullSenseStudio · 2023-03-16T22:52:41Z

Nice to see that PyTorch 2.0 just got released. I have already tried it out in previous nightly builds and saw similar performance to xFormers attention, but looking at that blog post it aught to do better for newer GPUs.

A native C++ implementation suitable for non-CUDA devices or when high-precision is required.

Hope that means there'll be some meaningful MPS and CPU improvement as well.

I'll see about adding this in soon.

carson-katri

👍 Everything looks good overall.

If PyTorch 2's scaled_dot_product_attention is similar enough in performance to xformers, I'd rather just upgrade that dependency than add another though.

This reverts commit 35f1b27.

NullSenseStudio · 2023-04-07T00:43:03Z

I'll see about adding this in soon.

That certainly didn't go as planned.

Anywho, got the PyTorch 2.0 SDP attention optimization added, and set to be on by default since that's how it is in diffusers and it appears to alter the image less than xFormers attention. I have also tested it on CPU, but I don't see any difference in speed or memory usage. Not so sure it'll help on MPS now. DirectML is also not compatible for the time being, it at least needs to be released specifically for PyTorch 2.0 or newer.

carson-katri

LGTM 👍

NullSenseStudio added 7 commits February 17, 2023 15:08

show missing optimizations

407fdff

optimization descriptions

dd10d0e

remove automatic mixed precision

ab9162e

don't move pipe with cpu offloading

36819fd

xformers

cf800b3

model offloading

9404f4a

try except xformers

2a299b6

carson-katri mentioned this pull request Feb 23, 2023

When trying to generate a texture, I get an error about the lack of video memory. #573

Closed

NullSenseStudio added 3 commits March 3, 2023 18:23

vae tiling

07b0094

filter optimizations for cpu only

ad0a270

support triton, silence warning

35f1b27

NullSenseStudio marked this pull request as ready for review March 11, 2023 01:43

carson-katri reviewed Mar 19, 2023

View reviewed changes

carson-katri added this to the v0.2.0 milestone Mar 25, 2023

NullSenseStudio added 4 commits April 6, 2023 17:51

Revert "support triton, silence warning"

a3f04db

This reverts commit 35f1b27.

sdp attention

29417ad

Merge branch 'main' into optimizations

55150cc

fix controlnet offload

679296f

carson-katri approved these changes Apr 7, 2023

View reviewed changes

carson-katri merged commit 24821fa into main Apr 7, 2023

carson-katri deleted the optimizations branch April 7, 2023 17:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization Fixes and Improvements #575

Optimization Fixes and Improvements #575

NullSenseStudio commented Feb 21, 2023 •

edited

Loading

NullSenseStudio commented Mar 11, 2023

carson-katri commented Mar 16, 2023

NullSenseStudio commented Mar 16, 2023

carson-katri left a comment

NullSenseStudio commented Apr 7, 2023

carson-katri left a comment

Optimization Fixes and Improvements #575

Optimization Fixes and Improvements #575

Conversation

NullSenseStudio commented Feb 21, 2023 • edited Loading

Bugs

Improvements

Removed AMP

New Optimizations

Discussion

NullSenseStudio commented Mar 11, 2023

carson-katri commented Mar 16, 2023

NullSenseStudio commented Mar 16, 2023

carson-katri left a comment

Choose a reason for hiding this comment

NullSenseStudio commented Apr 7, 2023

carson-katri left a comment

Choose a reason for hiding this comment

NullSenseStudio commented Feb 21, 2023 •

edited

Loading