🎬🎨 VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement

Daeun Lee, Jaehong Yoon, Jaemin Cho, Mohit Bansal

🚨 All code will be released by the first week of Dec, stay tuned!

✨ VideoRepair can (1) detect misalignments by generating fine-grained evaluation questions and answering, (2) plan refinement, (3) decompose the region and finally (4) conduct localized refinement.

🔧 Setup

Environment Setup

You can install all packages from requirements.txt.

conda create -n videorepair python==3.10
conda activate videorepair
pip install -r requirements.txt

Additionally, for Semantic-SAM, you should install detectron2 like below:

python -m pip install 'git+https://github.com/MaureenZOU/detectron2-xyz.git'

OpenAI API Setup

Our VideoRepair is based on GPT4 / GPT4o, so you need to setup your Azure OpenAI API config in the below files. You can find your keys in Azure Portal. We recommend using python-dotenv to store and load your keys.

DSG/openai_utils.py
DSG/dsg_questions_gen.py
DSG/query_utils.py
DSG/vqa_utils.py

client = AzureOpenAI(
            azure_endpoint = # your keys,  
            api_key= # your keys,  
            api_version=# your keys,  
            )

Download Models

Locate all downloaded models in the ./checkpoints directory! The code structure will like below:

./checkpoints
    ├── blip2-opt-2.7b
    ├── t2v-turbo 
    │   ├── unet_lora.pt
    │   ├── inference_t2v_512_v2.0.yaml     # downloaded from T2V-turbo official repo 
    ├── VideoCrafter
    │   ├── model.ckpt
    ├── ssam
    │   ├── swinl_only_sam_many2many.pth

You can download pre-trained models as below:

T2V-turbo
VideoCrafter2
MolmoE-1B-0924
Semantic-SAM (L)
BLIP-BLUE for Video Ranking

git lfs install
git clone https://huggingface.co/Salesforce/blip2-opt-2.7b

🎨 Apply to your own prompt

We provide demo (run_demo.sh) for your own prompt! This demo use main_iter_demo.py.

output_root="your output root"
prompt="your own prompt"

CUDA_VISIBLE_DEVICES=1,2 python main_iter_demo.py --prompt="$prompt" \
                                                    --model="t2vturbo" \              # base t2v-model 
                                                    --output_root="$output_root" \
                                                    --seed=123 \                      # global random seed (use for initial video generation) 
                                                    --load_molmo \            
                                                    --selection_score='dsg_blip' \    # video ranking method 
                                                    --round=1 \
                                                    --seed=369                        # localized generation seeds

🌿 Apply to Benchmark

VideoRepair is tested on EvalCrafter and T2V-CompBench.

We provide our $dsg^{obj}$ questions in ./datasets. The structure is like below:

./datasets
    ├── compbench
    │   ├── consistent_attr.json
    │   ├── numeracy.json
    │   ├── spatial_relationship.json
    ├── evalcrafter
    │   ├── dsg_action.json
    │   ├── dsg_color.json
    │   ├── dsg_count.json
    │   ├── dsg_none.json

Based on above question set, you can run benchmarks as follows:

output_root="your output path"                # output path 
eval_sections=("count", "action", "color")                       # eval dimension for each benchmark (e.g., count, )

for section in "${eval_sections[@]}"
do
    CUDA_VISIBLE_DEVICES=1,2,3 python main_iter.py \
                        --output_root="$output_root" \
                        --eval_section="$section" \
                        --model='t2vturbo' \              # t2v model backbone 
                        --load_molmo \
                        --selection_score='dsg_blip' \    # video ranking metric 
                        --seed=123 \                      # random seed 
                        --round=1 \                       # iteration round 
                        --k=10 \                          # number of video candidates 
                        --div_seeds                       # use diverse seed per iterative rounds. 
done

📝 TODO List

Release the whole code.

📚 BibTeX

💗 If you enjoy our VideoRepair and find some beneficial things, citing our paper would be the best support for us!

@article{lee2024videorepair,
  title={VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement},
  author={Lee, Daeun and Yoon, Jaehong and Cho, Jaemin and Bansal, Mohit},
  journal={arXiv preprint arXiv:2411.15115},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
image		image
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎬🎨 VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement

Daeun Lee, Jaehong Yoon, Jaemin Cho, Mohit Bansal

🔧 Setup

Environment Setup

OpenAI API Setup

Download Models

🎨 Apply to your own prompt

🌿 Apply to Benchmark

📝 TODO List

📚 BibTeX

About

Releases

Packages

daeunni/VideoRepair

Folders and files

Latest commit

History

Repository files navigation

🎬🎨 VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement

Daeun Lee, Jaehong Yoon, Jaemin Cho, Mohit Bansal

🔧 Setup

Environment Setup

OpenAI API Setup

Download Models

🎨 Apply to your own prompt

🌿 Apply to Benchmark

📝 TODO List

📚 BibTeX

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages