The system provided performs speaker diarization (speech segmentation and clustering in homogeneous speaker clusters) on a given list of audio files. It is based on the binary key speaker modelling technique. Thanks to the in-session training of a binary key background model (KBM), the system does not require any external training data, providing an easy to run and tune option for speaker diarization tasks.
This implementation is based on that of Delgado, which is also available for MATLAB. Besides the binary key related code, useful functions for a speaker diarization system pipeline are included. Extra details and functionalities were added, following our participation at EURECOM on the Albayzin 2016 Speaker Diarization Evaluation described here, the first DIHARD challenge, detailed in the Interspeech 2018 paper, and the IberSPEECH-RTVE Speaker Diarization Evaluation, explained here.
This code is written and tested in python 3.6 using conda. It relies on a few common packages to get things done:
- numpy
- scipy
- scikit-learn
- librosa for audio processing and feature extraction
- py-webrtvad for voice activity detection
If you are using conda:
$ conda create -n pyBK python=3.6
$ source activate pyBK
$ conda install numpy
$ conda install -c conda-forge librosa
$ pip install webrtcvad
$ git clone https://github.com/josepatino/pyBK.git
Five files from the SAIVT-BNEWS database are included in order to test the system (all rights reserved to their respective owners). These comprise audio files in wav format, speech activity detection (SAD) and unpartitioned evaluation map (UEM) files obtained from the references. For a quick run:
$ cd pyBK
$ python main.py
In the case of not finding UEM files, the complete audio content will be considered. In the case of not finding VAD files, automatic VAD based in py-webrtvad will be applied. Automatic VAD may also be enforced in the config file.
System configuration is provided in the form of an INI configuration file, and comments are provided in the example config.ini file. To use this system on your data create a config file of your own and run:
$ python main.py yourconfig.ini
Finally, a config file following our DIHARD submission is also included. Note that this configuration is meant to be used with IIR-CQT Mel-frequency cepstral coefficients (ICMC) which can be replicated using MATLAB code available here.
The system will have generated a RTTM file which you can evaluate using the NIST md-eval script provided,
$ eval-tools/md-eval-v21.pl -c 0.25 -s out/[experiment_name].rttm -r eval-tools/reference.rttm
which should return a 5.32% diarization error rate (DER) using a standard 0.25s collar. By using the automatic VAD you should get a 10.04% DER. As per the DIHARD config file, when using ICMCs as features, this system returns a DER of 30.69% on the evaluation set, with a 0s collar.
Please feel free to contact me for any questions related to this code:
- Jose Patino: patino[at]eurecom[dot]fr
If you use pyBK
in your research, please use the following citation:
@inproceedings{patino2018,
author = {Patino, Jose and Delgado, H{\'e}ctor and Evans, Nicholas},
title = {{The EURECOM submission to the first DIHARD Challenge}},
booktitle = {{Interspeech 2018, 19th Annual Conference of the International Speech Communication Association}},
year = {2018},
month = {September},
address = {Hyderabad, India},
}