LinFusion: 1 GPU, 1 Minute, 16K Image
Songhua Liu, Weuhao Yu, Zhenxiong Tan, and Xinchao Wang
Learning and Vision Lab, National University of Singapore
[2024/12/23] We release CLEAR, a dedicated solution for pre-trained DiTs like FLUX and SD3 to accelerate high-resolution generation. Please checkout our paper and code.
[2024/11/24] LinFusion is supported by triton implementation, which is even much more efficient than previous naive one! We would like to extend sincere gratitude to @hp-133 for the amazing work!
[2024/09/28] We release evaluation codes on the COCO benchmark!
[2024/09/27] We successfully integrate LinFusion to DistriFusion, an effective and efficient strategy for generating an image in parallel, and achieve more significant acceleration! Please refer to the example here!
[2024/09/26] We enable 16K image generation with merely 24G video memory! Please refer to the example here!
[2024/09/20] We release a more advanced pipeline for ultra-high-resolution image generation using SD-XL! It can be used for text-to-image generation and image super-resolution!
[2024/09/20] We release training codes for Stable Diffusion XL here!
[2024/09/13] We release LinFusion models for Stable Diffusion v-2.1 and Stable Diffusion XL!
[2024/09/13] We release training codes for Stable Diffusion v-1.5, v-2.1, and their variants here!
[2024/09/08] We release codes for 16K image generation here!
[2024/09/05] Gradio demo for SD-v1.5 is released! Text-to-image, image-to-image, and IP-Adapter are supported currently.
Yuanshi/LinFusion-1-5
: For Stable Diffusion v-1.5 and its variants.Yuanshi/LinFusion-2-1
: For Stable Diffusion v-2.1 and its variants.Yuanshi/LinFusion-XL
: For Stable Diffusion XL and its variants.
-
Clone this repo to your project directory:
git clone https://github.com/Huage001/LinFusion.git
-
You only need two lines!
from diffusers import AutoPipelineForText2Image import torch + from src.linfusion import LinFusion sd_repo = "Lykon/dreamshaper-8" pipeline = AutoPipelineForText2Image.from_pretrained( sd_repo, torch_dtype=torch.float16, variant="fp16" ).to(torch.device("cuda")) + linfusion = LinFusion.construct_for(pipeline) image = pipeline( "An astronaut floating in space. Beautiful view of the stars and the universe in the background.", generator=torch.manual_seed(123) ).images[0]
LinFusion.construct_for(pipeline)
will return a LinFusion model that matches the pipeline's structure. And this LinFusion model will automatically mount to the pipeline's forward function. -
examples/inference/basic_usage.ipynb
shows a basic text-to-image example.
- Currently, you can try LinFusion for SD-v1.5 online here. Text-to-image, image-to-image, and IP-Adapter are supported currently.
- We are building Gradio local demos for more base models and applications, so that everyone can deploy the demos locally.
From the perspective of efficiency, our method supports high-resolution generation such as 16K images. Nevertheless, directly applying diffusion models trained on low resolutions for higher-resolution generation can result in content distortion and duplication. To tackle this challenge, we apply following techniques:
-
SDEdit. The basic idea is to generate a low-resolution result at first, based on which we gradually upscale the image.
Please refer to
examples/inference/ultra_text2image_w_sdedit.ipynb
for an example. -
DemoFusion. It also generates high-resolution images from low-resolution results. Latents of the low-resolution generation are reused for high-resolution generation. Dilated convolutions are involved. Compared with the original version:
- We can generate high-resolution directly with the help of LinFusion instead of using patch-wise strategies.
- Insights in SDEdit are also applied here, so that the high-resolution branch does not need to go through full denoising steps.
- Image are upscaled to 2x, 4x, 8x, ... resolutions instead of 1x, 2x, 3x, ...
Please refer to
examples/inference/ultra_text2image_sdxl.ipynb
for an example of high-resolution text-to-image generation (first generate 1024 resolution, then generate 2048, 4096, 8192, etc) andexamples/inference/superres_sdxl.ipynb
for an example of image super resolution (directly upscale to the target resolution, generally 2x is recommended and using it multiple times if you want higher scales). -
Above codes for 16K image generation require a GPU with 80G video memory. If you encounter OOM issues, you may consider
examples/inference/superres_sdxl_low_w_mem.ipynb
, which requires only 24G video memory. We achieve this by 1) chunked forward of classifier-free guidance inference, 2) chunked forward of feed-forward network in Transformer blocks, 3) in-placed activation functions in ResNets, and 4) caching UNet residuals on CPU. -
DistriFusion. Alternatively, if you have multiple GPU cards, you can try integrating LinFusion to DistriFusion, which achieves more significant acceleration due to its linearity and thus almost constant communication cost. You can run an minimal example with:
torchrun --nproc_per_node=$N_GPUS -m examples.inference.sdxl_distrifusion_example
-
We are working on integrating LinFusion with more advanced approaches that are dedicated on high-resolution extension! Feel free to create pull requests if you come up with better solutions!
-
Before training, make sure you have the packages shown in
requirements.txt
installed:pip install -r requirements.txt
-
Training codes for Stable Diffusion v-1.5, v-2.1, and their variants are released in
src/train/distill.py
. We present an exampler running script inexamples/train/distill.sh
. You can run it on a 8-GPU machine via:bash ./examples/training/distill.sh
The codes will download
bhargavsdesai/laion_improved_aesthetics_6.5plus_with_images
dataset automatically to~/.cache
directory by default if there is not, which contains 169k images and requires ~75 GB disk space.We use fp16 precision and 512 resolution for Stable Diffusion v-1.5 and bf16 precision and 768 resolution for Stable Diffusion v-2.1.
-
Training codes for Stable Diffusion XL are released in
src/train/distill_xl.py
. We present an exampler running script inexamples/train/distill_xl.sh
. You can run it on a 8-GPU machine via:bash ./examples/training/distill_xl.sh
Following GigaGAN, we use 30,000 COCO captions to generate 30,000 images for evaluation. FID against COCO val2014 is reported as a metric, and CLIP text cosine similarity is used to reflect the text-image alignment.
-
To evaluate LinFusion, first install the required packages:
pip install git+https://github.com/openai/CLIP.git pip install click clean-fid open_clip_torch
-
Download and unzip COCO val2014 to
/path/to/coco
:wget http://images.cocodataset.org/zips/val2014.zip unzip val2014.zip -d /path/to/coco
-
Run
examples/eval/eval.sh
to generate images for evaluation. You may need to specifyoutdir
,repo_id
,resolution
, etc.bash examples/eval/eval.sh
-
Run
examples/eval/calculate_metrics.sh
to calculate the metrics. You may need to specify/path/to/coco
,fake_dir
, etc.bash examples/eval/calculate_metrics.sh
- Stable Diffusion 1.5 support.
- Stable Diffusion 2.1 support.
- Stable Diffusion XL support.
- Release training code for LinFusion.
- Release evaluation code for LinFusion.
- Gradio local interface.
- We extend our gratitude to the authors of SDEdit, DemoFusion, and DistriFusion for their contributions, which inspire us a lot on applying LinFusion for high-resolution generation.
- Our evaluation codes are adapted from SiD-LSG and GigaGAN.
- We thank @Adamdad, @yu-rp, and @czg1225 for valuable discussions.
If you finds this repo is helpful, please consider citing:
@article{liu2024linfusion,
title = {LinFusion: 1 GPU, 1 Minute, 16K Image},
author = {Liu, Songhua and Yu, Weihao and Tan, Zhenxiong and Wang, Xinchao},
year = {2024},
eprint = {2409.02097},
archivePrefix={arXiv},
primaryClass={cs.CV}
}