FLUX Speed Improvements (~10% speedup) #7399

RyanJDick · 2024-11-29T15:52:29Z

Summary

This PR includes several small speed improvements to FLUX inference:

Use torch.nn.functional.rms_norm(...) rather than the custom implementation
Reduce tensor type casting in apply_rope(...)
Use .view(...) over .reshape(...) to be sure that the underlying tensor data is contiguous and shared.

After these changes, some operations are now run at a lower precision than before, which results in slight differences in the generated images.

Speedup

Configuration: 1024x1024, 15 steps
Before:

bf16: 0.481 secs / iter
BnB int8: 0.521 secs / iter

After:

bf16: 0.435 secs / iter (9.6% speedup)
BnB int8: 0.468 secs / iter (9.1% speedup)

Image Change

Left=before, Right=after

Prompt: "An architecture rendering of the reception area of a corporate office with modern decor."

Prompt: "A pixar cartoon rendering of a frog with big eyes watching a frog."

Prompt: "A portrait photo of a man with blonde hair and glasses wearing a suit and tie."

QA Instructions

Generated before / after comparison images with the same seed as shown above.
bf16 inference on CUDA
BnB int8 inference on CUDA
Test on MacOS

Checklist

The PR has a short but descriptive title, suitable for a changelog
Tests added / updated (if applicable)
Documentation added / updated (if applicable)
Updated What's New copy (if doing a release after this PR)

…rm(...) for improved speed.

## Summary #7422 As reported in the above ticket, a recent FLUX performance improvement caused a regression on MacOS. This PR reverts the offending part of the change. ## Related Issues / Discussions - Closes #7422 - Original perf improvement: #7399 ## QA Instructions I don't have a Mac capable of running this test, so trusting the report in #7422 that this fixes the problem. ## Checklist - [x] _The PR has a short but descriptive title, suitable for a changelog_ - [x] _Tests added / updated (if applicable)_ - [x] _Documentation added / updated (if applicable)_ - [ ] _Updated `What's New` copy (if doing a release after this PR)_

RyanJDick requested review from lstein, blessedcoolant, brandonrising and hipsterusername as code owners November 29, 2024 15:52

github-actions bot added python PRs that change python files backend PRs that change backend files labels Nov 29, 2024

hipsterusername approved these changes Nov 29, 2024

View reviewed changes

RyanJDick added 3 commits November 29, 2024 12:24

Replace custom RMSNorm implementation with torch.nn.functional.rms_no…

2117101

…rm(...) for improved speed.

Use view() instead of rearrange() for better performance.

f0672ac

Avoid unnecessary dtype conversions with rope encodings.

a03721d

hipsterusername force-pushed the ryan/flux-speed-improvements branch from 184e0f3 to a03721d Compare November 29, 2024 17:24

hipsterusername enabled auto-merge (rebase) November 29, 2024 17:30

hipsterusername merged commit 021552f into main Nov 29, 2024
14 checks passed

hipsterusername deleted the ryan/flux-speed-improvements branch November 29, 2024 17:32

RyanJDick mentioned this pull request Dec 3, 2024

Revert FLUX performance improvement that fails on MacOS #7423

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FLUX Speed Improvements (~10% speedup) #7399

FLUX Speed Improvements (~10% speedup) #7399

RyanJDick commented Nov 29, 2024

FLUX Speed Improvements (~10% speedup) #7399

FLUX Speed Improvements (~10% speedup) #7399

Conversation

RyanJDick commented Nov 29, 2024

Summary

Speedup

Image Change

QA Instructions

Checklist