Allow FP16 accumulation with `--fast` #6453

catboxanon · 2025-01-13T15:41:25Z

This PR enables a new PyTorch flag, currently only available in nightly releases as of ~~2025-01-12~~ 2025-02-07, that enables FP16 accumulation in matmul ops for NVIDIA GPUs. On my system with a RTX 3090, running the default provided image generation workflow at a resolution of 1024x1024, provides an it/s bump from 4it/s to 5it/s.

I've opted to only enable this when --fast is used since it seems to be "potentially quality deteriorating", as the --fast arg suggests.

Note that performance improvement only really applies to the 3090 or newer GPUs (i.e. 4000 series). Older cards will likely see no performance improvement.

For reference:
pytorch/pytorch#144441
https://docs-preview.pytorch.org/pytorch/pytorch/144441/notes/cuda.html#full-fp16-accmumulation-in-fp16-gemms
pytorch/pytorch#144441 (comment)

3090/4090 users would also likely benefit from fp16 accumulation inside attention and convolution operations. with this, perhaps the 4090 could approximate the speed of an A100.

For future work it may also be worth to look at this (retain performance improvement w/o quality deterioration):
pytorch/pytorch#144441 (comment)

Currently only applies to PyTorch nightly releases. (>=20250112)

liesened · 2025-01-13T20:45:19Z

pytorch version checks are probably needed for this patch to not break old pytorch installations?

ao899 · 2025-01-13T22:48:53Z

pytorch version: 2.7.0.dev20250112+cu126
I'm not entirely confident if it's working properly, but I was able to calculate a result showing a 12.09% speed improvement.

Panchovix · 2025-01-13T23:08:55Z

I have applied this on reforge and speed bump seems to be variable on batch size.

On my 3090/4090, with batch size 1 i see about 10-15% improvement, and then 25-33% at batch size 2-3 or more.

I guess with single images you can have a RAM/CPU bottleneck.

blepping · 2025-01-14T00:12:15Z

comfy/model_management.py

+try:
+    if is_nvidia() and args.fast:
+        torch.backends.cuda.matmul.allow_fp16_accumulation = True
+except:


You can catch AttributeError here, it doesn't need to be a wildcard except.

I was just following the style of the other code in this file (which could use this type of improvement too...)

Definitely. For whatever my opinion is worth (which probably isn't much!) I think it would be good if new changes avoided "code smells" linters will complain about. Seems like the ComfyUI codebase is slowly trying to modernize with adding type annotations and stuff.

Dahvikiin · 2025-01-14T20:06:11Z

isn't the --fast argument only for Ampere+? pytorch documentation says “Full FP16 Accmumulation in FP16 GEMMs” it works in Turing and Volta too (7.0+).

comfyanonymous · 2025-01-15T00:04:42Z

Looks like the change got reverted in pytorch so I'll merge this once they add it back.

Panchovix · 2025-01-15T18:52:40Z

It got merged again pytorch/pytorch@a6763b7

So this PR will work if building from source now, or in the nightly build in the next 1-2 days.

EDIT: It got reverted again

ao899 · 2025-01-23T10:12:13Z

pytorch/pytorch@de945d7

comfyanonymous · 2025-02-08T22:04:20Z

Looks like it got merged in pytorch for real this time. Latest pytorch nightly works and gives a performance improvement.

Panchovix · 2025-02-08T22:17:07Z

Great! Let's hope it stays enabled, since it was reverted like 3-4 times in the past 3 weeks, but now it seems to have been enduring some more time.

codexq123 · 2025-02-10T14:22:40Z

I successfully install torch 2.7.0.dev20250208. The test command is as follows.

import torch
torch.cuda.is_available()
True
torch.version
'2.7.0.dev20250208+cu124'
torch.backends.cuda.matmul.allow_fp16_accumulation = True
torch.backends.cuda.matmul.allow_fp16_accumulation
True

Then I use command 'python main.py --fast' to start comfyui. I already update comfyui. And I can find the following code in model_management.py

try:
if is_nvidia() and args.fast:
torch.backends.cuda.matmul.allow_fp16_accumulation = True
except:
pass

But It seems that there is no speed up for Flux default or fp8 using --fast compared with no --fast.

BUT comparing torch2.5.1+cu124+py3.11 and torch2.7.0dev+cu12.4+py3.12. torch2.7.0dev is approximately 20% faster. Both use wavespeed to speedup caching.

comfyanonymous · 2025-02-10T23:12:43Z

This will only speed up models when using fp16, flux is a model that uses bf16 by default and this won't speed up fp8_fast or bf16.

ComfyUI does contain a hack to make flux work on fp16 but it doesn't work with some of the flux variants like flux fill.

scottmudge · 2025-03-02T07:56:05Z

Are there any metrics on the precision loss/quality degradation? And is my understanding that this loss stems from downcasting operations typically done in fp32 to fp16? Is accumulation typically an fp32 step, regardless of model precision?

Allow FP16 accumulation with --fast

6789f3d

Currently only applies to PyTorch nightly releases. (>=20250112)

catboxanon requested a review from comfyanonymous as a code owner January 13, 2025 15:41

blepping reviewed Jan 14, 2025

View reviewed changes

comfyanonymous merged commit 43a74c0 into comfyanonymous:master Feb 8, 2025
5 checks passed

catboxanon deleted the feat/fp16-accumulation branch February 8, 2025 23:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow FP16 accumulation with `--fast` #6453

Allow FP16 accumulation with `--fast` #6453

catboxanon commented Jan 13, 2025 •

edited

Loading

liesened commented Jan 13, 2025

ao899 commented Jan 13, 2025

Panchovix commented Jan 13, 2025

blepping Jan 14, 2025

catboxanon Jan 14, 2025 •

edited

Loading

blepping Jan 14, 2025

Dahvikiin commented Jan 14, 2025

comfyanonymous commented Jan 15, 2025

Panchovix commented Jan 15, 2025 •

edited

Loading

ao899 commented Jan 23, 2025

comfyanonymous commented Feb 8, 2025

Panchovix commented Feb 8, 2025

codexq123 commented Feb 10, 2025 •

edited

Loading

comfyanonymous commented Feb 10, 2025

scottmudge commented Mar 2, 2025

Allow FP16 accumulation with --fast #6453

Allow FP16 accumulation with --fast #6453

Conversation

catboxanon commented Jan 13, 2025 • edited Loading

liesened commented Jan 13, 2025

ao899 commented Jan 13, 2025

Panchovix commented Jan 13, 2025

blepping Jan 14, 2025

Choose a reason for hiding this comment

catboxanon Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

blepping Jan 14, 2025

Choose a reason for hiding this comment

Dahvikiin commented Jan 14, 2025

comfyanonymous commented Jan 15, 2025

Panchovix commented Jan 15, 2025 • edited Loading

ao899 commented Jan 23, 2025

comfyanonymous commented Feb 8, 2025

Panchovix commented Feb 8, 2025

codexq123 commented Feb 10, 2025 • edited Loading

comfyanonymous commented Feb 10, 2025

scottmudge commented Mar 2, 2025

Allow FP16 accumulation with `--fast` #6453

Allow FP16 accumulation with `--fast` #6453

catboxanon commented Jan 13, 2025 •

edited

Loading

catboxanon Jan 14, 2025 •

edited

Loading

Panchovix commented Jan 15, 2025 •

edited

Loading

codexq123 commented Feb 10, 2025 •

edited

Loading