Skip to content

Official implementation of LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment.

Notifications You must be signed in to change notification settings

CodeGoat24/LiFT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

59 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment

Yibin Wang, Zhiyu Tan, Junyan Wang, Xiaomeng Yang, Cheng Jin†, Hao Li†

[Fudan University]

[Shanghai Academy of Artificial Intelligence for Science]

[Australian Institute for Machine Learning, The University of Adelaide]

(†corresponding author)

Paper PDF Project Page

Hugging Face Spaces Hugging Face Spaces

πŸ”₯ News

  • [2024/12/25] πŸ”₯ We released the supplementary of our paper.
  • [2024/12/24] πŸ”₯πŸ”₯ We have updated our LiFT-HRA 10K/20K dataset and LiFT-Critic-v1.5. Welcome to download the latest version !!
  • [2024/12/20] πŸ”₯ The supplementary of our paper will be updated on arXiv soon.
  • [2024/12/17] πŸ”₯ We released our optimized evaluation prompts derived from VBench in Vbench/Vbench_full_info_opt.json for users to reproduce the results in our paper.
  • [2024/12/17] πŸ”₯πŸ”₯ We released our LiFT-HRA dataset 10K/20K and the enhanced version LiFT-Critic-v1.5 !!
  • [2024/12/16] πŸ”₯ Our LiFT-HRA dataset 10K/20K and the enhanced version LiFT-Critic-v1.5 is coming soon!!
  • [2024/12/10] πŸ”₯πŸ”₯ We released the training and inference code.
  • [2024/12/9] πŸ”₯ We released the LiFT-Critic-v1.0 and CogVideoX-2B-LiFT. Our code is coming soon!!
  • [2024/12/9] πŸ”₯ We released the paper.
  • [2024/12/6] πŸ”₯ We launched the project page.

πŸ“– Abstract

Recent advancements in text-to-video (T2V) generative models have shown impressive capabilities. However, these models are still inadequate in aligning synthesized videos with human preferences (e.g., accurately reflecting text descriptions), which is particularly difficult to address, as human preferences are inherently subjective and challenging to formalize as objective functions. Therefore, this paper proposes LiFT, a novel fine-tuning method leveraging human feedback for T2V model alignment. Specifically, we first construct a Human Rating Annotation dataset, LiFT-HRA, which includes approximately 10k human annotations comprising both a score and the corresponding rationale. Based on this, we train a reward model LiFT-Critic to learn human feedback-based reward function effectively, which serves as a proxy for human judgment, measuring the alignment between given videos and human expectations. Lastly, we leverage the learned reward function to align the T2V model by maximizing the reward-weighted likelihood. As a case study, we apply our pipeline to CogVideoX-2B, showing that the fine-tuned model outperforms the CogVideoX-5B across all 16 metrics, highlighting the potential of human feedback in improving the alignment and quality of synthesized videos.

teaser

πŸ”§ Installation

  1. Clone this repository and navigate to LiFT folder
git clone https://github.com/CodeGoat24/LiFT.git
cd LiFT
  1. Install packages
bash ./environment_setup.sh lift

πŸš€ Inference

LiFT-Critic-13b/40b-lora Weights

Please download all public LiFT-Critic checkpoints from Huggingface.

Run

We provide some synthesized videos for quick inference in ./demo directory.

LiFT-Critic-13b:

python LiFT-Critic/test/run_critic_13b.py --model-path ./LiFT-Critic-13b-lora

LiFT-Critic-40b:

python LiFT-Critic/test/run_critic_40b.py --model-path ./LiFT-Critic-40b-lora

Examples

critic_case

πŸ’» Training

LiFT-Critic is trained on 8 H100 GPUs with 80GB memory.

Dataset

Please download our LiFT-HRA dataset and the 1K subset of the VIDGEN-1M (derived from HD-VILA) we used in our paper.

Please put them under ./dataset directory. The data structure is like this:

dataset
β”œβ”€β”€ LiFT-HRA
β”‚  β”œβ”€β”€ LiFT-HRA-data.json
β”‚  β”œβ”€β”€ videos
β”œβ”€β”€ VIDGEN
β”‚  β”œβ”€β”€ vidgen-data.json
β”‚  β”œβ”€β”€ videos

Training

LiFT-Critic-13b

bash LiFT_Critic/train/train_critic_13b.sh

LiFT-Critic-40b

bash LiFT_Critic/train/train_critic_40b.sh

πŸ—“οΈ TODO

  • βœ… Release project page
  • βœ… Release paper
  • βœ… Release LiFT-Critic 13B/40B-v1.0
  • βœ… Release CogVideoX-2B-LiFT
  • βœ… Release inference code
  • βœ… Release training code
  • βœ… Release LiFT-Critic 13B/40B-v1.5
  • βœ… Release dataset LiFT-HRA 10K
  • βœ… Release dataset LiFT-HRA 20K
  • Release CogVideoX-5B-LiFT
  • Release LiFT-Critic 13B/40B-v2.0

πŸ“§ Contact

If you have any comments or questions, please open a new issue or feel free to contact Yibin Wang.

πŸ–ŠοΈ Citation

🌟 If you find our work helpful, please leave us a star and cite our paper.

@article{LiFT,
  title={LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment.},
  author={Wang, Yibin and Tan, Zhiyu, and Wang, Junyan and Yang, Xiaomeng and Jin, Cheng and Li, Hao},
  journal={arXiv preprint arXiv:2412.04814},
  year={2024}
}

πŸ–ΌοΈ Results

CogVideoX-2B

cogx-1.mp4

CogVideoX-2B-LiFT(Ours)

LiFT-1.mp4
cogx-2.mp4
LiFT-2.mp4
cogx-3.mp4
LiFT-3.mp4
cogx-4.mp4
LiFT-4.mp4
cogx-5.mp4
LiFT-5.mp4
cogx-6.mp4
LiFT-6.mp4
cogx-7.mp4
LiFT-7.mp4

πŸ™ Acknowledgement

Our work is based on LLaVA and VILA, thanks to all the contributors!

About

Official implementation of LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages