-
Notifications
You must be signed in to change notification settings - Fork 7.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow FP16 accumulation with --fast
#6453
Allow FP16 accumulation with --fast
#6453
Conversation
Currently only applies to PyTorch nightly releases. (>=20250112)
pytorch version checks are probably needed for this patch to not break old pytorch installations? |
pytorch version: 2.7.0.dev20250112+cu126 |
I have applied this on reforge and speed bump seems to be variable on batch size. On my 3090/4090, with batch size 1 i see about 10-15% improvement, and then 25-33% at batch size 2-3 or more. I guess with single images you can have a RAM/CPU bottleneck. |
try: | ||
if is_nvidia() and args.fast: | ||
torch.backends.cuda.matmul.allow_fp16_accumulation = True | ||
except: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can catch AttributeError
here, it doesn't need to be a wildcard except
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just following the style of the other code in this file (which could use this type of improvement too...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely. For whatever my opinion is worth (which probably isn't much!) I think it would be good if new changes avoided "code smells" linters will complain about. Seems like the ComfyUI codebase is slowly trying to modernize with adding type annotations and stuff.
isn't the |
Looks like the change got reverted in pytorch so I'll merge this once they add it back. |
It got merged again pytorch/pytorch@a6763b7 So this PR will work if building from source now, or in the nightly build in the next 1-2 days. EDIT: It got reverted again |
Looks like it got merged in pytorch for real this time. Latest pytorch nightly works and gives a performance improvement. |
Great! Let's hope it stays enabled, since it was reverted like 3-4 times in the past 3 weeks, but now it seems to have been enduring some more time. |
I successfully install torch 2.7.0.dev20250208. The test command is as follows.
Then I use command 'python main.py --fast' to start comfyui. I already update comfyui. And I can find the following code in model_management.py
But It seems that there is no speed up for Flux default or fp8 using --fast compared with no --fast. BUT comparing torch2.5.1+cu124+py3.11 and torch2.7.0dev+cu12.4+py3.12. torch2.7.0dev is approximately 20% faster. Both use wavespeed to speedup caching. |
This will only speed up models when using fp16, flux is a model that uses bf16 by default and this won't speed up fp8_fast or bf16. ComfyUI does contain a hack to make flux work on fp16 but it doesn't work with some of the flux variants like flux fill. |
Are there any metrics on the precision loss/quality degradation? And is my understanding that this loss stems from downcasting operations typically done in fp32 to fp16? Is accumulation typically an fp32 step, regardless of model precision? |
This PR enables a new PyTorch flag, currently only available in nightly releases as of
2025-01-122025-02-07, that enables FP16 accumulation in matmul ops for NVIDIA GPUs. On my system with a RTX 3090, running the default provided image generation workflow at a resolution of 1024x1024, provides anit/s
bump from4it/s
to5it/s
.I've opted to only enable this when
--fast
is used since it seems to be "potentially quality deteriorating", as the--fast
arg suggests.Note that performance improvement only really applies to the 3090 or newer GPUs (i.e. 4000 series). Older cards will likely see no performance improvement.
For reference:
pytorch/pytorch#144441
https://docs-preview.pytorch.org/pytorch/pytorch/144441/notes/cuda.html#full-fp16-accmumulation-in-fp16-gemms
pytorch/pytorch#144441 (comment)
For future work it may also be worth to look at this (retain performance improvement w/o quality deterioration):
pytorch/pytorch#144441 (comment)