Improving Long-Text Alignment for Text-to-Image Diffusion Models

This repo is the official PyTorch implementation for Improving Long-Text Alignment for Text-to-Image Diffusion Models (LongAlign)

by Luping Liu^1,2, Chao Du², Tianyu Pang², Zehan Wang^2,4, Chongxuan Li³, Dong Xu¹.

¹The University of Hong Kong; ²Sea AI Lab, Singapore; ³Renmin University of China; ⁴Zhejiang University

What does this work do?

To improve long-text alignment for T2I diffusion models, we propose LongAlign, which features a segment-level encoding method for processing long texts and a decomposed preference optimization method for effective alignment training. For decomposed preference optimization, we find the preference models can be decomposed into two components: a text-relevant part and a text-irrelevant part. We propose a reweighting strategy that assigns different weights to these two components, reducing overfitting and enhancing alignment.

CLIP-based preference decomposition

(a) Schematic results for text embeddings. (b) Statistics of the projection scalar $\eta$ for three CLIP-based preference models. (c) The relationship between the original preference score and the two scores after decomposition.

Generation result

Generation results using our LongAlign and baselines. We highlight three key facts for each prompt and provide the evaluation results at the end.
Generation results using different preference models, with and without our reweighting strategy.

How to run the code?

Prepare environment

pip install -r requirements.txt
# if you encounter an error with LoRA, please run `pip uninstall peft`

Prepare dataset and checkpoint

2 million long-text & image dataset (the raw images need to be downloaded separately): https://huggingface.co/datasets/luping-liu/LongAlign
Stable Diffusion v1.5 (this will download automatically): https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5
T5-adapter (please download this to ./model/LaVi-Bridge): https://huggingface.co/shihaozhao/LaVi-Bridge/tree/main/t5_unet/adapter
Denscore (this will download automatically): https://huggingface.co/luping-liu/Denscore
longSD (please download this to ./model/longSD or train them yourself): https://huggingface.co/luping-liu/LongSD

Train original Stable Diffusion

# support long-text inputs
bash run_unet.sh align ct5f
# please move {args.output_dir}/s{global_step_}_lora_vis.pt --> {args.output_dir}/lora_vis.pt and so on

# preference optimization for long-text alignment
bash run_unet.sh reward test

Train LCM-version Stable Diffusion

# support LCM sampling
bash run_unet.sh lcm ct5f

# preference optimization for long-text alignment
bash run_unet.sh reward_lcm test

References

If you find this work useful for your research, please consider citing:

@article{liu2024improving,
      title={Improving Long-Text Alignment for Text-to-Image Diffusion Models}, 
      author={Luping Liu and Chao Du and Tianyu Pang and Zehan Wang and Chongxuan Li and Dong Xu},
      year={2024},
      journal={arXiv preprint arXiv:2410.11817},
}

This code is mainly built upon diffusers and LaVi-Bridge repositories, which you might also find interesting.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
images		images
modules		modules
LICENSE		LICENSE
acc_config.yaml		acc_config.yaml
loss.py		loss.py
readme.md		readme.md
requirements.txt		requirements.txt
run_unet.sh		run_unet.sh
tools.py		tools.py
unet_align_ct5.py		unet_align_ct5.py
unet_lcm.py		unet_lcm.py
unet_reward.py		unet_reward.py
unet_reward_lcm.py		unet_reward_lcm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Improving Long-Text Alignment for Text-to-Image Diffusion Models

What does this work do?

CLIP-based preference decomposition

Generation result

How to run the code?

Prepare environment

Prepare dataset and checkpoint

Train original Stable Diffusion

Train LCM-version Stable Diffusion

References

About

Releases

Packages

Languages

License

luping-liu/LongAlign

Folders and files

Latest commit

History

Repository files navigation

Improving Long-Text Alignment for Text-to-Image Diffusion Models

What does this work do?

CLIP-based preference decomposition

Generation result

How to run the code?

Prepare environment

Prepare dataset and checkpoint

Train original Stable Diffusion

Train LCM-version Stable Diffusion

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages