-
-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization Fixes and Improvements #575
Conversation
With the addition of memory efficient attention and VAE tiling it's possible to create some quite large images without requiring an equally enormous amount of VRAM. Great for textures and backgrounds as long as the subject can be repeated easily and if you're willing to wait. VAE tiling can have some issues with color and detail accuracy. AFAIK this is caused by normalization operations, tiling the latents will not allow each section to normalize the same. Tile size set to 128 and blend 0 for demonstration. While it can save a little memory on standard size images I don't recommend it. Some prompts may not produce usable results at high resolutions. These were meant to be people. Interestingly tiled VAE decoding and encoding was merged into diffusers recently huggingface/diffusers#1441. Doesn't support seamless axis blending like the implementation here. The tiled encoding for img2img and inpainting is a nice idea and could be added in a later PR. Also some discussion of improving tiling in the link, might be worth looking into at some point. |
Would you be interested in trying out the new attention mechanisms built-in to PyTorch 2.0 for this? It should be automatically enabled in diffusers if PyTorch 2 is installed. |
Nice to see that PyTorch 2.0 just got released. I have already tried it out in previous nightly builds and saw similar performance to xFormers attention, but looking at that blog post it aught to do better for newer GPUs.
Hope that means there'll be some meaningful MPS and CPU improvement as well. I'll see about adding this in soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Everything looks good overall.
If PyTorch 2's scaled_dot_product_attention
is similar enough in performance to xformers, I'd rather just upgrade that dependency than add another though.
That certainly didn't go as planned. Anywho, got the PyTorch 2.0 SDP attention optimization added, and set to be on by default since that's how it is in diffusers and it appears to alter the image less than xFormers attention. I have also tested it on CPU, but I don't see any difference in speed or memory usage. Not so sure it'll help on MPS now. DirectML is also not compatible for the time being, it at least needs to be released specifically for PyTorch 2.0 or newer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Bugs
Fixes a bug where most speed optimizations are not visible in CUDA and sequential CPU offload was wrongly visible in macOS. Also fixes a bug with CPU offloading that was supposed to already be fixed, but I believe was accidentally reintroduced and missed while merging at some point.
Improvements
I've extracted the device checking functionality from
Optimizations.can_use()
to its own classmethoddevice_supports()
to simplify and make optimization filtering clearer. Descriptions have been added to all optimizations with some indication of what each does and if there are certain device limitations on them.Removed AMP
Automatic mixed precision was removed due to 🤗 diffusers recommending against it. Any memory savings it would have can be done better with half precision, often times being faster, and with the same image quality.
New Optimizations
Memory efficient attention from xFormers saves VRAM and inference time. I've noticed memory savings around the same as attention slicing size 1 and an increase in it/s of 20% for normal image generation. It can overtake attention slicing memory savings if the image is larger than usual and when upscaling. xFormers may automatically select different attention optimizations for some GPUs/generations so improvements will vary.
CPU offloading has been split into model and submodule options; submodule being a rename of sequential CPU offload. Model offloading is a lesser version of submodule offloading: not as significant of memory savings while not severely slowing inference speed.
Discussion
xFormers is often recommended for use in diffusers and may be suitable to be enabled by default. One issue is I've heard some GPUs won't quite generate the same image with the same settings while using xFormers so could cause unneeded confusion.
It also causes a warning message when installedA matching Triton is not available, some optimizations will not be enabled.
which should be suppressed, and I don't know a good method how to. Triton is only officially available for linux and could be added to a linux specific requirements.txt.Model offloading requires accelerate 0.17.0, which isn't released yet. Windows/linux requirements.txt could be updated to install from github or wait for it to be officially released (and based on release history that likely could be soon).