Skip to content

Latest commit

 

History

History
132 lines (101 loc) · 7.81 KB

README.md

File metadata and controls

132 lines (101 loc) · 7.81 KB

TED Expressive Dataset

This folder contains the scripts to build TED Expressive Dataset. You can download Youtube videos and transcripts, divide the videos into scenes, and extract human poses. Note that this dataset is built upon TED Gesture Dataset by Yoon et al., where we extend the pose annotations of 3D finger keypoints. Please see the project page and paper for more details.

Project | Paper | Demo

Environment

The scripts are tested on Ubuntu 16.04 LTS and Python 3.5.2.

Dependencies

  • OpenPose (v1.4) for pose estimation
  • ExPose for 3d pose estimation
  • PySceneDetect (v0.5) for video scene segmentation
  • OpenCV (v3.4) for video read
    • We use FFMPEG. Use the latest pip version of opencv-python or build OpenCV with FFMPEG.
  • Gentle (Jan. 2019 version) for transcript alignment
    • Download the source code from Gentle github and run ./install.sh. And then, you can import gentle library by specifying the path to the library. See run_gentle.py.
    • Add an option -vn to resample.py in gentle as follows:
      cmd = [
          FFMPEG,
          '-loglevel', 'panic',
          '-y',
      ] + offset + [
          '-i', infile,
      ] + duration + [
          '-vn',  # ADDED (it blocks video streams, see the ffmpeg option)
          '-ac', '1', '-ar', '8000',
          '-acodec', 'pcm_s16le',
          outfile
      ]

A step-by-step guide

  1. Set config

    • Update paths and youtube developer key in config.py (the directories will be created if not exist).
    • Update target channel ID. The scripts are tested for TED and LaughFactory channels.
  2. Execute download_video.py

    • Download youtube videos, metadata, and subtitles (./videos_ted/*.mp4, *.json, *.vtt).
  3. Execute run_mp3.py

    • Extract the audio files from the video files by ffmpeg (./audio_ted/*.mp3).
  4. Execute run_openpose.py

    • Run OpenPose to extract body, hand, and face skeletons for all videos (./temp_skeleton_raw/vid/keypoints/*.json).
  5. Execute run_ffmpeg.py

    • Since the codebase of ExPose requires both the raw images and OpenPose keypoint json files for inference. We first extract all the raw image frames via ffmpeg (./temp_skeleton_raw/vid/images/*.png).
  6. Execute run_expose.py

    • Run ExPose to extract 3D human body, hand (contain finger), and face skeletons for all videos (./expose_ted/vid/*.npz).
    • Note that during our implementation, I fail to set up the open3d environment required by ExPose in the slurm without sudo. Hence I modify the inference code to avoid from such dependency. Besides, the output format of ExPose is slightly changed to better facilitate the dataset building (i.e., save extra estimated camera paramters for 3D keypoints visualization in Step 10). You could substitute the original inference.py under the ExPose directory by the modified version code.
  7. Execute run_scenedetect.py

    • Run PySceneDetect to divide videos into scene clips (./clip_ted/*.csv).
  8. Execute run_gentle.py

    • Run Gentle for word-level alignments (./videos_ted/*_align_results.json).
    • You should skip this step if you use auto-generated subtitles. This step is necessary for the TED Talks channel.
  9. Execute run_clip_filtering.py

    • Remove inappropriate clips.
    • Save clips with body skeletons (./filter_res/vid/*.json).
  10. (optional) Execute review_filtered_clips.py

  • Review filtering results. Note that different from the original process that visualize the 2D keypoints on the image, we additionally support the visualization of 3D keypoint extracted by ExPose based on coordinates and camera parameters.
  1. Execute make_ted_dataset.py
  • Do some post-processing and split into train, validation, and test sets (./whole_output/*.pickle).

Note: Since the overall data pre-processing is quite time-consuming via single-thread execution, you could manually implement the dataset pre-processing in a multi-processing manner by splitting the vid range, i.e., process a subset of vid files each time by:

all_file_list = sorted(glob.glob(path_to_files_that_you_want_to_process), key=os.path.getmtime)
subset_file_list = all_file_list[start_idx:end_idx]
for each_file in subset_file_list:
   # execute the processing code here

In this way, you may get multiple dataset subsets files, you could merge them together into a single pickle file and finally transform into dataset file of lmdb format in consistent with our paper's implementation. A sample dataset merge file is given in merge_dataset.py. You may need to do some modifications to make it work properly according your dataset split implementation.

Pre-built TED gesture dataset

Running whole data collection pipeline is complex and takes several days, so we provide the pre-built dataset for the videos in the TED channel.

OneDrive Download Link

Download videos and transcripts

We do not provide the videos and transcripts of TED talks due to copyright issues. You should download actual videos and transcripts by yourself as follows:

  1. Download and copy [video_ids.txt] file which contains video ids into ./videos_ted directory.
  2. Run download_video.py. It downloads the videos and transcripts in video_ids.txt. Some videos may not match to the extracted poses that we provided if the videos are re-uploaded. Please compare the numbers of frames, just in case.

Citation

If you find our code or data useful, please kindly cite our work as:

@inproceedings{liu2022learning,
  title={Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation},
  author={Liu, Xian and Wu, Qianyi and Zhou, Hang and Xu, Yinghao and Qian, Rui and Lin, Xinyi and Zhou, Xiaowei and Wu, Wayne and Dai, Bo and Zhou, Bolei},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={10462--10472},
  year={2022}
}

Since the dataset is built upon previous works of Yoon et al., we also kindly ask you to cite their great paper:

@INPROCEEDINGS{
  yoonICRA19,
  title={Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots},
  author={Yoon, Youngwoo and Ko, Woo-Ri and Jang, Minsu and Lee, Jaeyeon and Kim, Jaehong and Lee, Geehyuk},
  booktitle={Proc. of The International Conference in Robotics and Automation (ICRA)},
  year={2019}
}

Acknowledgement