This repo provides a solution for GENEA Challange 2020.

Table of contents

Data processing

The folder DataProcessing contains scripts for features extraction, data normalization and generation of output video. The pipeline is based on one of baselines.

Assume that challenge dataset for train located in folder data with following files:

  • ./data/Audio/Recoding_%3d.wav - input raw audio files with recorded speech
  • ./data/Motion/Recoding_%3d.bvh - motion data corresponding to audio files
  • ./data/Transcripts/Recording_%3d.json - text transcripts for recorded speech

There are several scripts to prepare data:

  1. - converts motion data into numpy arrays and stores them into *.npy files. Arguments:

    • --src - path to the folder with motion data
    • --dst - path to the folder the processed arrays will be stored.
    • --pipe - (optional, default=./pipe) - the path where sklearn pipeline will be stored or read.
    • --bvh - (flag) if exists inverse transform: generate bvh-files from npy


    python DataProcessing/ --src data/Motion --dst data/Features
  2. - extracts MFCC features from speech recordings, averages 5 successive frames to match FPS and stores obtained arrays into *.npy files./ Arguments:

    • --src - path to the folder with audio files
    • --dst - path to the folder the extracted MFCCs features will be stored.


    python DataProcessing/ --src data/Audio --dst data/MFCC
  3. - aligns motion and audio features and stores them into numpy archives with X and Y keys for audio and motion features respectively. Optionally adds contexts to the each audio frame. Arguments:

    • --motion_dir - (optional) path to the folder with motion features arrays. If not set, saves contextualized audio features into numpy array.
    • --audio_dir - path to the folder with audio features arrays
    • --dst_dir - path to the folder the aligned data will be stored
    • --with_context - (flag) if set the audio features will be stored with the context window for each frame
    • --context_length - (optional. default=60) context window size


    python DataProcessing/ --motion_dir data/Features --audio_dir data/MFCC --dst_dir data/Ready --with_context
  4. - normalizes motion features to [-1,1], maximum and mean values are calculated on train dataset (all recording except the first one).

    • --src - path to the folder with aligned data
    • --dst - path to the folder the normalized data will be stored.
    • --values - (optional, default=./mean_pose.npz) path to npz file where normalizing values will be stored


    python DataProcessing/ --src data/Ready --dst data/Normalized
  5. Splitting into train & valid. We were validating on Recording_001, the rest was used for training.

mkdir -p data/dataset/train data/dataset/test
cp data/Ready/* data/dataset/train
mv data/dataset/train/data_001.npz data/dataset/test

After running these 4 scripts listed above we get numpy archives appropriate for training models.


There are 2 types of models in this repository, both are based on a seq2seq architecture.

  • ContextSeq2Seq - a seq2seq with a context encoder, which at each steps combines words and audio features in an encoder cell. This model is described in section 3.2 of our article.
python ContextSeq2Seq\  --gpus 1 --predicted-poses 20 --previous-poses 10 --serialize-dir new  --max_epochs 100 --stide  --batch_size 50 --embedding .\embeddings\glove.6B.100d.txt --text_folder data\Transcripts 
  • WordsSeq2Seq - seq2seq, which uses attention over encoded words & audio features. This model is described in section 3.2 of our work.
python WordsSeq2Seq/ --gpus 1 --predicted-poses 20 --previous-poses 10 --serialize-dir new --max_epochs 100 --stride 1 --batch_size 512 --with_context --embedding embeddings/glove.6B.100d.txt --text_folder data/Transcripts

Both models have common params:

  • Parameters for pytorch lightning Trainer
  • --predicted-poses - number of frames in sequence per instance
  • --previous-poses - number of previous poses to initialize decoder state
  • --serialize-dir - folder to save checkpoints
  • --stride - margin between two successive instances
  • --embedding - path to GLOVE embeddings file
  • --text_folder - path to text transcripts For WordsSeq2Seq model argument --with_context add contexts for audio encoder


The file should be used from an according model folder. This code example predicts motion features from valid file:

python WordsSeq2Seq/ --src data/dataset/test/data_001.npz --checkpoint new/last.ckpt --dest predictions/data_001.npy -text_folder data/Transcripts

To create bvh or mp4 files from predicted features use or

python --pred predicions/pred.npy --dest vid.mp4 --smooth --audio data/Audio/Recording_001.wav --mean mean_pose.npz
python --pred predictions --dest results --smooth --mean mean_pose.npz takes folder on input (--pred) and generates bvh-file to output folder (--dest) for each npy-file from input folder takes only one npy-file (--pred) to generate mp4-file (--dest) using visualization server. Also takes --audio parameter - path to the input audio file to merge it with silent visualization video.

Common parameters:

  • --smooth - apply Savitzky-Golay filter to predicted motion features
  • --mean - file with mean values obtained from normalization
  • --pipe - pipeline path obtained from processing motions

Processing test dataset

Assume that test audio and transcripts are placed in folders data/Test/Audio and data/Test/Transcripts. To make predictions for the test dataset you need to follow these steps:

  • Get MFCC features from audio:
python DataProcessing/ --src data/Test/Audio --dst data/Test/MFCC
  • Add contexts to Audio features:
python DataProcessing/ --audio_dir data/Test/MFCC --dst_dir data/Test/Ready --with_context
  • Rename transcripts (TestSeq%3d.json -> Recording_%3d.json):
cd data\Test\Transcripts
Dir | Rename-Item -NewName {$ -replace "TestSeq","Recording_"}
  • Predict:
python ContextSeq2Seq/ --src data/Test/Ready/TestSeq001.npy --checkpoint text_encoder/last.ckpt --dest predictions/data_001.npy --text_folder data/Test/Transcripts

Visualization is the same as with train dataset.


Checkpoints and some examples of generated motions can be found here

There are some other approaches we tried on this repo.

  • Folders DenoisingAutoEncoder and SpeechEncoder are our pytorch reimplementation of one of the baselines.
  • VariationalAutoEncoder - is our attempt to change autoencoder in baseline mentioned before by Variational Auto Encoder.
  • Our experiments with adversarial learning are located in branches seq2seq_asmekal and seq2seq_gan.

The code listed above has not been tested, so it may not work.


