Motion Inversion for Video Customization

Customize the motion in your videos under 10 minutes of training time.

Luozhou Wang*, Ziyang Mai*, Guibao Shen, Yixun Liang, Xin Tao, Pengfei Wan, Di Zhang, Yijun Li, Yingcong Chen†

HKUST(GZ), HKUST, Kuaishou Technology, Adobe Research.

*Indicates Equal Contribution. †Indicates Corresponding Author.

In this work, we present a novel approach for motion customization in video generation, addressing the widespread gap in the exploration of motion representation within video generative models. Recognizing the unique challenges posed by the spatiotemporal nature of video, our method introduces Motion Embeddings, a set of explicit, temporally coherent embeddings derived from a given video. These embeddings are designed to integrate seamlessly with the temporal transformer modules of video diffusion models, modulating self-attention computations across frames without compromising spatial integrity. Our approach provides a compact and efficient solution to motion representation, utilizing two types of embeddings: a Motion Query-Key Embedding to modulate the temporal attention map and a Motion Value Embedding to modulate the attention values. Additionally, we introduce an inference strategy that excludes spatial dimensions from the Motion Query-Key Embedding and applies a differential operation to the Motion Value Embedding, both designed to debias appearance and ensure the embeddings focus solely on motion. Our contributions include the introduction of a tailored motion embedding for customization tasks and a demonstration of the practical advantages and effectiveness of our method through extensive experiments.

📰 News

[2024.10.22] The website of Motion Inversion is ready!
[2024.10.20] Hugging face demo is ready. Click here.
[2024.10.15] We improve the structure of motion embedding and obtain better performance. Check our latest paper.
[2024.04.03] We released the configuration files, inference code sample.
[2024.04.01] We will soon release the configuration files, inference code, and motion embedding weights. Please stay tuned for updates!
[2024.03.31] We have released the project page, arXiv paper, and training code.

🚧 Todo List

Released code for the UNet3D model (ZeroScope, ModelScope, VideoCrafter2).
Release detailed guidance for training and inference.
Release Gradio demo.
Launch Website

Installation

pip install -r requirements.txt

Also, xformers is required in this repository. Please check here for detailed installation guidance.

Training

To start training, first download the ZeroScope weights and specify the path in the config file. Then, run the following commands to begin training:

python train.py --config ./configs/config.yaml

We provide a sample config file in config.py. Note for various motion types and editing requirements, selecting the appropriate loss function impacts the outcome. In scenarios where only the camera motion from the source video is desired, without the need to retain information about the objects in the source, it is advisable to employ DebiasedHybridLoss. Similarly, when editing objects that undergo significant deformation, DebiasedTemporalLoss is recommended. For straightforward cross-categorical editing, as described in DMT, utilizing BaseLoss function suffices.

Inference

After training, you can easily load motion embeddings for video generation as follows:

python inference.py

If you want to test more checkpoints, don't forget to modify the paths asigned in the code, including embedding_dir and video_round .

Acknowledgement

MotionDirector: We followed their implementation of loss design and techniques to reduce computation resources.
ZeroScope: The pretrained video checkpoint we used in our main paper.
AnimateDiff: The pretrained video checkpoint we used in our main paper.
Latte: A video generation model with a similar architecture to Sora.
Open-Sora: A video generation model with a similar architecture to Sora.
VideoCrafter2: A video generation model with a similar architect ure to ZeroScope and stronger ability for high-quality generation.
VideoCrafter2 (diffuser checkpoint): A checkpoint file that can seamlessly integrate in our framework with one line code.
Textual Inversion: A image generation method that provides simple yet efficient way for attention injection.

We are grateful for their exceptional work and generous contribution to the open-source community.

Citation

@misc{wang2024motioninversionvideocustomization,
     title={Motion Inversion for Video Customization}, 
     author={Luozhou Wang and Ziyang Mai and Guibao Shen and Yixun Liang and Xin Tao and Pengfei Wan and Di Zhang and Yijun Li and Yingcong Chen},
     year={2024},
     eprint={2403.20193},
     archivePrefix={arXiv},
     primaryClass={cs.CV},
     url={https://arxiv.org/abs/2403.20193}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Motion Inversion for Video Customization

Customize the motion in your videos under 10 minutes of training time.

📰 News

🚧 Todo List

Contents

Installation

Training

Inference

Acknowledgement

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Motion Inversion for Video Customization

Customize the motion in your videos under 10 minutes of training time.

📰 News

🚧 Todo List

Contents

Installation

Training

Inference

Acknowledgement

Citation