This repository contains the basic scripts that are provided for the challenge. The goal is to develop an audio-visual diarization model. We provide the following observations that must be used as input for your model:
- RGB videos
- Sound Source Localisation (SSL) as heatmaps.
- full body pose estimation
The goal is twofold:
- Perform visual tracking to associate the provided detections through time
- Predict the speaking activity of each tracked person.
We provide the some basic visualization and evaluation scripts. The data are based on the AVDIAR dataset. The data can be downloaded here.
The test data can be downloaded here.
You need to install opencv-python and numpy. The scripts have been tested with opencv (cv2) version 2.4.13 and python 2.7.12.
Refer to for installation instruction.
We provide the following scripts:
- visualize the observations (visual and audio) we provide for one given video
- visualize the prediction for one given video. It can be also used to visualize the groundTruth of a video.
We strongly recommend to use these scipts to code the loading function of your own programm.
./data contains one folder per video sequence. Each folder contains:
- video.avi: the video it-self.
- ssl.avi: the video of the Sound Source Localization (SSL) heat map (downsample by a factor 2). It gives the probability according to our SSL that their is a sound source at each pixel.
- detections.txt: full body detections of the video. We ran a multiple person detector in order to obtained the coordinates (x, y) of 18 joints (nose, neck, Rsho, Relb, Rwri, Lsho, Lelb, Lwri, Rhip, Rkne, Rank, Lhip, Lkne, Lank, Leye, Reye, Lear, Rear) of the persons of each frame. The file contains one detection per line. The format of each line is the following:
frameNumber x0 y0 x1 y1....x17 y17
When a joint is not detected, the corresponding xi and yi are set to -1
- groundTruth.txt: full body detections with and with an index to indentify them across time and a seapking label. It contains one person per line. The format of each line is the following:
frameNumber personIndex speakingLabel x0 y0 x1 y1....x17 y17
speakingLabel=1 if the person is speaking, 0 otherwise.
- prediction.txt: this is what you have to generate. We give an example of this file in data/video1/. It has to respect the same format than groundTruth.txt to be used by and
To evaluate your tracking results, the MOTA metric is used. See equation (1) of and (2.1.3)
- Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion:
- Tracking a Varying Number of People with a Visually-Controlled Robotic Head:
- Mot challenge:
- opencv :
- scikit-learn:
- slides of the introductory presentation by Radu Horaud: