- Input: add text semantics, style replaced with speaker ID, feature representation of audio and gestures
- Model: optimized network architecture, seed gestures frames without speech
- Code: data read by
h5
instead oflmdb
,lmdb
is very space-consuming; update configuration file - Evaluation: We submitted the system to GENEA 2023 to be evaluated with other models. DiffuseStyleGesture+ is competitive. We further analyze the challenge results as seen in the paper.
In addition to the requirements of DiffuseStyleGesture, you need to install the following packages:
pip install pydub praat-parselmouth essentia TextGrid h5py
We tested the code on NVIDIA GeForce RTX 4090
with updated the torch (optional):
pip install torch==1.9.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
We give the process on two datasets (BEAT and TWH), and you can choose a dataset as you wish.
Download files such as pre-trained models from baidu cloud or google cloud.
cd ./BEAT-TWH-main/mydiffusion_beat_twh
The BEAT dataset is so big that we only selected speaker 2 (male) and speaker 10 (female) for training.
- Put
model001080000.pt
in'./BEAT_mymodel4_512_v0/'
. - Put
audio_BEAT
,gesture_BEAT
andtext_BEAT
in'./BEAT/test_data/'
of downloaded files to./BEAT_dataset/processed/
.
Run:
python sample.py --config=./configs/DiffuseStyleGesture.yml --gpu 0 --model_path './BEAT_mymodel4_512_v0/model001080000.pt' --max_len 0 --tst_prefix '2_scott_0_1_1'
You will get the result as "./BEAT-TWH-main/mydiffusion_beat_twh/BEAT_mymodel4_512_v0/sample_dir_model001080000/2_scott_0_1_1_generated.bvh"
.
You can visualize it using Blender to get the following result (To visualize bvh with Blender see this issue and this tutorial video):
0001-1923.mp4
On the left is GT, in the middle is the gesture generated above, and on the right is the result of retraining by extracting motion features based on the ZEGGS, bvh library's motion data processing method (notice that the ends are off by a segment).
- Put
model001200000.pt
in'./TWH_mymodel4_512_v0/'
. - Put
audio_TWH
,gesture_TWH
,text_TWH
andmetadata.csv
in'./TWH/test_data/'
of downloaded files to./TWH_dataset/processed/
.
python sample.py --config=./configs/DiffuseStyleGesture.yml --dataset TWH --gpu 0 --model_path './TWH_mymodel4_512_v0/model001200000.pt' --max_len 0 --tst_prefix 'val_2023_v0_014_main-agent'
You will get the result as "./BEAT-TWH-main/mydiffusion_beat_twh/TWH_mymodel4_512_v0/sample_dir_model001200000/val_2023_v0_014_main-agent.bvh"
.
0001-1800.mp4
GT on the left, generated gestures on the right.
Here we use a text tts.txt
and a TTS audio tts.mp3
generated by Azure for example.
Ref to here to install gentle to align the text and audio.
Run:
cd ./BEAT-TWH-main/data
python3 "...your gentle path/gentle/align.py" "tts.mp3" "tts.txt" -o "tts_align.txt"
python process_text.py
ffmpeg -i tts.mp3 tts.wav
And you will get tts.wav
, tts_align.txt
, and tts_align_process.tsv
, which text format is the same as TWH.
Download the WavLM Large and crawl-300d-2M.vec.
You can choose TWH or BEAT dataset.
cd ../mydiffusion_beat_twh
python sample.py --config=./configs/DiffuseStyleGesture.yml --dataset TWH --gpu 0 --model_path './TWH_mymodel4_512_v0/model001200000.pt' --max_len 0 --wav_path ../data/tts.wav --txt_path ../data/tts_align_process.tsv --wavlm_path "... your path/WavLM/WavLM-Large.pt" --word2vector_path "... your path/crawl-300d-2M.vec"
It takes about three minutes to load the word2vector model.
0001-0247.mp4
On the left are the gestures generated using the BEAT dataset, and on the right are the results trained on TWH.
Here we only use one or two files to illustrate the data processing and training process.
You can get all the data from official BEAT and TWH.
Put the downloaded file . /BEAT/train
data into the ./BEAT_dataset/source/
folder.
Run:
cd ./BEAT-TWH-main/process/
python process_BEAT_bvh.py ../../BEAT_dataset/source/ ../../BEAT_dataset/processed/ None None "v0" "step1" "cuda:0"
python process_BEAT_bvh.py ../../BEAT_dataset/source/ ../../BEAT_dataset/processed/ "... your path/WavLM/WavLM-Large.pt" "... your path/crawl-300d-2M.vec" "v0" "step3" "cuda:0"
python process_BEAT_bvh.py ../../BEAT_dataset/source/ ../../BEAT_dataset/processed/ None None "v0" "step4" "cuda:0"
python calculate_gesture_statistics.py --dataset BEAT --version "v0"
where step1
is checking if the number of frames in BEAT
is the same as the actual number of motion frames.
step2
is processing gestures, text and speech to generate corresponding features.
step4
integrate all the generated features into one h5 file for easy training.
You can modify the cuda
.
Then you should get mean, std and .h5
file in ./BEAT-TWH-main/process/
.
cd ../mydiffusion_beat_twh
python end2end.py --config=./configs/DiffuseStyleGesture.yml --gpu 0
The model and opt are saved in ./BEAT-TWH-main/mydiffusion_beat_twh/BEAT_mymodel4_512_v0/
.
You can adjust the settings of your parameters: ./configs/DiffuseStyleGesture.yml
, e.g. batch size
.
If you want to train on more data from BEAT, make sure to download the latest BEAT data, see here (because we found a lot of problems with BEAT, such as T-pose facing differently, unsynchronized, missing data, mismatched motion frames, etc., and step2
is solving the problem of the different T-pose orientation, but without the root norm, the result is not good, so just download the latest BEAT and skip this step)
We trained on all the data in the GENEA Challenge 2023 training dataset, the data has been compiled by the GENEA organizers, see here.
Put the downloaded file . /TWH/train
data into the ./TWH_dataset/source/
folder.
Run:
cd ../process/
python process_TWH_bvh.py --dataroot "../../TWH_dataset/source/" --save_path "../../TWH_dataset/processed/" --wavlm_path "... your path/WavLM/WavLM-Large.pt" --word2vector_path "... your path/crawl-300d-2M.vec" --gpu 0 --debug True
python calculate_gesture_statistics.py --dataset TWH --version "v0"
Same as above, you should get mean, std and .h5
file in ./BEAT-TWH-main/process/
. When using all TWH dataset, set --debug False
.
cd ../mydiffusion_beat_twh
python end2end.py --config=./configs/DiffuseStyleGesture.yml --gpu 0 --dataset TWH
Similarly, the model and opt are saved in ./BEAT-TWH-main/mydiffusion_beat_twh/TWH_mymodel4_512_v0/
.
- We forgot to normalize the seed gestures in the challenge (fix here, which resulted in the first segment (first 120 frames, 4s) of all the submitted results being a bit strange, which deteriorated the performance.
- The data after adding BEAT retargeting to TWH in the challenge is not training well, the problem is still being troubleshooted.
- There are some discrepancies between the RM in the framework and the paper, thanks issue.
If you find this repo useful for your research, please consider citing the following papers :)
@inproceedings{yang2022DiffuseStyleGestureplus,
title={The DiffuseStyleGesture+ entry to the GENEA Challenge 2023},
author={Sicheng Yang, Haiwei Xue, Zhensong Zhang, Minglei Li, Zhiyong Wu, Xiaofei Wu, Songcen Xu, Zonghong Dai},
booktitle={Proceedings of the 2023 International Conference on Multimodal Interaction},
year={2023}
}
Feel free to contact us ([email protected] or [email protected]) with any questions or concerns.