LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment

Yibin Wang, Zhiyu Tan, Junyan Wang, Xiaomeng Yang, Cheng Jin†, Hao Li†

[Fudan University]

[Shanghai Academy of Artificial Intelligence for Science]

[Australian Institute for Machine Learning, The University of Adelaide]

(†corresponding author)

🔥 News

[2024/12/25] 🔥 We released the supplementary of our paper.
[2024/12/24] 🔥🔥 We have updated our LiFT-HRA 10K/20K dataset and LiFT-Critic-v1.5. Welcome to download the latest version !!
[2024/12/20] 🔥 The supplementary of our paper will be updated on arXiv soon.
[2024/12/17] 🔥 We released our optimized evaluation prompts derived from VBench in Vbench/Vbench_full_info_opt.json for users to reproduce the results in our paper.
[2024/12/17] 🔥🔥 We released our LiFT-HRA dataset 10K/20K and the enhanced version LiFT-Critic-v1.5 !!
[2024/12/16] 🔥 Our LiFT-HRA dataset 10K/20K and the enhanced version LiFT-Critic-v1.5 is coming soon!!
[2024/12/10] 🔥🔥 We released the training and inference code.
[2024/12/9] 🔥 We released the LiFT-Critic-v1.0 and CogVideoX-2B-LiFT. Our code is coming soon!!
[2024/12/9] 🔥 We released the paper.
[2024/12/6] 🔥 We launched the project page.

📖 Abstract

Recent advancements in text-to-video (T2V) generative models have shown impressive capabilities. However, these models are still inadequate in aligning synthesized videos with human preferences (e.g., accurately reflecting text descriptions), which is particularly difficult to address, as human preferences are inherently subjective and challenging to formalize as objective functions. Therefore, this paper proposes LiFT, a novel fine-tuning method leveraging human feedback for T2V model alignment. Specifically, we first construct a Human Rating Annotation dataset, LiFT-HRA, which includes approximately 10k human annotations comprising both a score and the corresponding rationale. Based on this, we train a reward model LiFT-Critic to learn human feedback-based reward function effectively, which serves as a proxy for human judgment, measuring the alignment between given videos and human expectations. Lastly, we leverage the learned reward function to align the T2V model by maximizing the reward-weighted likelihood. As a case study, we apply our pipeline to CogVideoX-2B, showing that the fine-tuned model outperforms the CogVideoX-5B across all 16 metrics, highlighting the potential of human feedback in improving the alignment and quality of synthesized videos.

🔧 Installation

Clone this repository and navigate to LiFT folder

git clone https://github.com/CodeGoat24/LiFT.git
cd LiFT

Install packages

bash ./environment_setup.sh lift

🚀 Inference

LiFT-Critic-13b/40b-lora Weights

Please download all public LiFT-Critic checkpoints from Huggingface.

Run

We provide some synthesized videos for quick inference in ./demo directory.

LiFT-Critic-13b:

python LiFT-Critic/test/run_critic_13b.py --model-path ./LiFT-Critic-13b-lora

LiFT-Critic-40b:

python LiFT-Critic/test/run_critic_40b.py --model-path ./LiFT-Critic-40b-lora

Examples

💻 Training

LiFT-Critic is trained on 8 H100 GPUs with 80GB memory.

Dataset

Please download our LiFT-HRA dataset and the 1K subset of the VIDGEN-1M (derived from HD-VILA) we used in our paper.

Please put them under ./dataset directory. The data structure is like this:

dataset
├── LiFT-HRA
│  ├── LiFT-HRA-data.json
│  ├── videos
├── VIDGEN
│  ├── vidgen-data.json
│  ├── videos

Training

LiFT-Critic-13b

bash LiFT_Critic/train/train_critic_13b.sh

LiFT-Critic-40b

bash LiFT_Critic/train/train_critic_40b.sh

🗓️ TODO

✅ Release project page
✅ Release paper
✅ Release LiFT-Critic 13B/40B-v1.0
✅ Release CogVideoX-2B-LiFT
✅ Release inference code
✅ Release training code
✅ Release LiFT-Critic 13B/40B-v1.5
✅ Release dataset LiFT-HRA 10K
✅ Release dataset LiFT-HRA 20K
Release CogVideoX-5B-LiFT
Release LiFT-Critic 13B/40B-v2.0

📧 Contact

If you have any comments or questions, please open a new issue or feel free to contact Yibin Wang.

🖊️ Citation

🌟 If you find our work helpful, please leave us a star and cite our paper.

@article{LiFT,
  title={LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment.},
  author={Wang, Yibin and Tan, Zhiyu, and Wang, Junyan and Yang, Xiaomeng and Jin, Cheng and Li, Hao},
  journal={arXiv preprint arXiv:2412.04814},
  year={2024}
}

🖼️ Results

CogVideoX-2B cogx-1.mp4	CogVideoX-2B-LiFT(Ours) LiFT-1.mp4
cogx-2.mp4	LiFT-2.mp4
cogx-3.mp4	LiFT-3.mp4
cogx-4.mp4	LiFT-4.mp4
cogx-5.mp4	LiFT-5.mp4
cogx-6.mp4	LiFT-6.mp4
cogx-7.mp4	LiFT-7.mp4

🙏 Acknowledgement

Our work is based on LLaVA and VILA, thanks to all the contributors!

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
CogVideo		CogVideo
LiFT-Critic		LiFT-Critic
Vbench		Vbench
dataset		dataset
demo		demo
docs		docs
llava.egg-info		llava.egg-info
llava		llava
scripts		scripts
tests		tests
README.md		README.md
environment_setup.sh		environment_setup.sh
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment

🔥 News

📖 Abstract

🔧 Installation

🚀 Inference

LiFT-Critic-13b/40b-lora Weights

Run

Examples

💻 Training

Dataset

Training

🗓️ TODO

📧 Contact

🖊️ Citation

🖼️ Results

CogVideoX-2B

CogVideoX-2B-LiFT(Ours)

🙏 Acknowledgement

About

Releases

Packages

Languages

CodeGoat24/LiFT

Folders and files

Latest commit

History

Repository files navigation

LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment

🔥 News

📖 Abstract

🔧 Installation

🚀 Inference

LiFT-Critic-13b/40b-lora Weights

Run

Examples

💻 Training

Dataset

Training

🗓️ TODO

📧 Contact

🖊️ Citation

🖼️ Results

CogVideoX-2B

CogVideoX-2B-LiFT(Ours)

🙏 Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages