This repo is the official PyTorch implementation for Improving Long-Text Alignment for Text-to-Image Diffusion Models (LongAlign)
by Luping Liu1,2, Chao Du2, Tianyu Pang2, Zehan Wang2,4, Chongxuan Li3, Dong Xu1.
1The University of Hong Kong; 2Sea AI Lab, Singapore; 3Renmin University of China; 4Zhejiang University
To improve long-text alignment for T2I diffusion models, we propose LongAlign, which features a segment-level encoding method for processing long texts and a decomposed preference optimization method for effective alignment training. For decomposed preference optimization, we find the preference models can be decomposed into two components: a text-relevant part and a text-irrelevant part. We propose a reweighting strategy that assigns different weights to these two components, reducing overfitting and enhancing alignment.
- (a) Schematic results for text embeddings. (b) Statistics of the projection scalar
$\eta$ for three CLIP-based preference models. (c) The relationship between the original preference score and the two scores after decomposition.
-
Generation results using our LongAlign and baselines. We highlight three key facts for each prompt and provide the evaluation results at the end.
-
Generation results using different preference models, with and without our reweighting strategy.
pip install -r requirements.txt
# if you encounter an error with LoRA, please run `pip uninstall peft`
- 2 million long-text & image dataset (the raw images need to be downloaded separately): https://huggingface.co/datasets/luping-liu/LongAlign
- Stable Diffusion v1.5 (this will download automatically): https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5
- T5-adapter (please download this to
./model/LaVi-Bridge
): https://huggingface.co/shihaozhao/LaVi-Bridge/tree/main/t5_unet/adapter - Denscore (this will download automatically): https://huggingface.co/luping-liu/Denscore
- longSD (please download this to
./model/longSD
or train them yourself): https://huggingface.co/luping-liu/LongSD
# support long-text inputs
bash run_unet.sh align ct5f
# please move {args.output_dir}/s{global_step_}_lora_vis.pt --> {args.output_dir}/lora_vis.pt and so on
# preference optimization for long-text alignment
bash run_unet.sh reward test
# support LCM sampling
bash run_unet.sh lcm ct5f
# preference optimization for long-text alignment
bash run_unet.sh reward_lcm test
If you find this work useful for your research, please consider citing:
@article{liu2024improving,
title={Improving Long-Text Alignment for Text-to-Image Diffusion Models},
author={Luping Liu and Chao Du and Tianyu Pang and Zehan Wang and Chongxuan Li and Dong Xu},
year={2024},
journal={arXiv preprint arXiv:2410.11817},
}
This code is mainly built upon diffusers and LaVi-Bridge repositories, which you might also find interesting.