🎬🎨 VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement
🚨 All code will be released by the first week of Dec, stay tuned!
✨ VideoRepair can (1) detect misalignments by generating fine-grained evaluation questions and answering, (2) plan refinement, (3) decompose the region and finally (4) conduct localized refinement.
You can install all packages from requirements.txt
.
conda create -n videorepair python==3.10
conda activate videorepair
pip install -r requirements.txt
Additionally, for Semantic-SAM, you should install detectron2 like below:
python -m pip install 'git+https://github.com/MaureenZOU/detectron2-xyz.git'
Our VideoRepair is based on GPT4 / GPT4o, so you need to setup your Azure OpenAI API config in the below files. You can find your keys in Azure Portal. We recommend using python-dotenv to store and load your keys.
DSG/openai_utils.py
DSG/dsg_questions_gen.py
DSG/query_utils.py
DSG/vqa_utils.py
client = AzureOpenAI(
azure_endpoint = # your keys,
api_key= # your keys,
api_version=# your keys,
)
Locate all downloaded models in the ./checkpoints
directory! The code structure will like below:
./checkpoints
├── blip2-opt-2.7b
├── t2v-turbo
│ ├── unet_lora.pt
│ ├── inference_t2v_512_v2.0.yaml # downloaded from T2V-turbo official repo
├── VideoCrafter
│ ├── model.ckpt
├── ssam
│ ├── swinl_only_sam_many2many.pth
You can download pre-trained models as below:
- T2V-turbo
- VideoCrafter2
- MolmoE-1B-0924
- Semantic-SAM (L)
- BLIP-BLUE for Video Ranking
git lfs install
git clone https://huggingface.co/Salesforce/blip2-opt-2.7b
We provide demo (run_demo.sh
) for your own prompt! This demo use main_iter_demo.py
.
output_root="your output root"
prompt="your own prompt"
CUDA_VISIBLE_DEVICES=1,2 python main_iter_demo.py --prompt="$prompt" \
--model="t2vturbo" \ # base t2v-model
--output_root="$output_root" \
--seed=123 \ # global random seed (use for initial video generation)
--load_molmo \
--selection_score='dsg_blip' \ # video ranking method
--round=1 \
--seed=369 # localized generation seeds
VideoRepair is tested on EvalCrafter and T2V-CompBench.
We provide our ./datasets
. The structure is like below:
./datasets
├── compbench
│ ├── consistent_attr.json
│ ├── numeracy.json
│ ├── spatial_relationship.json
├── evalcrafter
│ ├── dsg_action.json
│ ├── dsg_color.json
│ ├── dsg_count.json
│ ├── dsg_none.json
Based on above question set, you can run benchmarks as follows:
output_root="your output path" # output path
eval_sections=("count", "action", "color") # eval dimension for each benchmark (e.g., count, )
for section in "${eval_sections[@]}"
do
CUDA_VISIBLE_DEVICES=1,2,3 python main_iter.py \
--output_root="$output_root" \
--eval_section="$section" \
--model='t2vturbo' \ # t2v model backbone
--load_molmo \
--selection_score='dsg_blip' \ # video ranking metric
--seed=123 \ # random seed
--round=1 \ # iteration round
--k=10 \ # number of video candidates
--div_seeds # use diverse seed per iterative rounds.
done
- Release the whole code.
💗 If you enjoy our VideoRepair and find some beneficial things, citing our paper would be the best support for us!
@article{lee2024videorepair,
title={VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement},
author={Lee, Daeun and Yoon, Jaehong and Cho, Jaemin and Bansal, Mohit},
journal={arXiv preprint arXiv:2411.15115},
year={2024}
}