Text2Performer: Text-Driven Human Video Generation

Yuming Jiang¹, Shuai Yang¹, Tong Liang Koh¹, Wayne Wu², Chen Change Loy¹, Ziwei Liu¹

¹S-Lab, Nanyang Technological University ²Shanghai AI Laboratory

Paper | Project Page | Dataset | Video

Text2Performer synthesizes human videos by taking the text descriptions as the only input.

📖 For more visual results, go checkout our project page

Installation

Clone this repo:

git clone https://github.com/yumingj/Text2Performer.git
cd Text2Performer

Dependencies:

conda env create -f env.yaml
conda activate text2performer

(1) Dataset Preparation

In this work, we contribute a human video dataset with rich label and text annotations named Fashion-Text2Video Dataset.

You can download our processed dataset from this Google Drive. After downloading the dataset, unzip the file and put them under the dataset folder with the following structure:

./datasets
├── FashionDataset_frames_crop
    ├── xxxxxx
        ├── 000.png
        ├── 001.png
        ├── ...
    ├── xxxxxx
    └── xxxxxx
├── train_frame_num.txt
├── val_frame_num.txt
├── test_frame_num.txt
├── moving_frames.npy
├── captions_app.json
├── caption_motion_template.json
├── action_label
    ├── xxxxxx.txt
    ├── xxxxxx.txt
    ├── ...
    └── xxxxxx.txt
└── shhq_dataset % optional

(2) Sampling

Pretrained Models

Pretrained models can be downloaded from the Google Drive. Unzip the file and put them under the pretrained_models folder with the following structure:

pretrained_models
├── sampler_high_res.pth
├── video_trans_high_res.pth
└── vqgan_decomposed_high_res.pth

After downloading pretrained models, you can use generate_long_video.ipynb to generate videos.

(3) Training Text2Performer

Stage I: Decomposed VQGAN

Train the decomposed VQGAN. If you want to skip the training of this network, you can download our pretrained model from here.

For better performance, we also use the data from SHHQ dataset to train this stage.

python -m torch.distributed.launch --nproc_per_node=4 --master_port=29596 train_vqvae_iter_dist.py -opt ./configs/vqgan/vqgan_decompose_high_res.yml --launcher pytorch

Stage II: Video Transformer

Train the video transformer. If you want to skip the training of this network, you can download our pretrained model from here.

python -m torch.distributed.launch --nproc_per_node=4 --master_port=29596 train_dist.py -opt ./configs/video_transformer/video_trans_high_res.yml --launcher pytorch

Stage III: Appearance Transformer

Train the appearance transformer. If you want to skip the training of this network, you can download our pretrained model from here.

python train_sampler.py -opt ./configs/sampler/sampler_high_res.yml

Citation

If you find this work useful for your research, please consider citing our paper:

@inproceedings{jiang2023text2performer,
  title={Text2Performer: Text-Driven Human Video Generation},
  author={Jiang, Yuming and Yang, Shuai and Koh, Tong Liang and Wu, Wayne and Loy, Chen Change and Liu, Ziwei},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year={2023}
}

🗞️ License

Distributed under the S-Lab License. See LICENSE for more information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text2Performer: Text-Driven Human Video Generation

Installation

(1) Dataset Preparation

(2) Sampling

Pretrained Models

(3) Training Text2Performer

Stage I: Decomposed VQGAN

Stage II: Video Transformer

Stage III: Appearance Transformer

Citation

🗞️ License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
configs		configs
data		data
img		img
models		models
utils		utils
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md
env.yaml		env.yaml
generate_long_video.ipynb		generate_long_video.ipynb
train_dist.py		train_dist.py
train_sampler.py		train_sampler.py
train_vqvae_iter_dist.py		train_vqvae_iter_dist.py

License

yumingj/Text2Performer

Folders and files

Latest commit

History

Repository files navigation

Text2Performer: Text-Driven Human Video Generation

Installation

(1) Dataset Preparation

(2) Sampling

Pretrained Models

(3) Training Text2Performer

Stage I: Decomposed VQGAN

Stage II: Video Transformer

Stage III: Appearance Transformer

Citation

🗞️ License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages