Online (streaming) speaker change detection model implemented in Pytorch

This repository contains an implementation of an online streaming speaker change detection model. It is implemented in Pytorch.

The model consists of several 1-D convolutional layers acting as speech encoder, a multi-layer LSTM that models speaker change, and a final softmax layer. The model uses a step size of 100 ms (i.e., it outputs 10 decisions per second).

The model is trained using a special version of cross-entropy training which tolerates small errors in the hypthesized speaker change timestamps. Due to this, the softmax outputs of the trained model are very peaky and do not require any local maxima tracking for extracting the final speaker turn points. This makes the model suitable for online appications.

The test directory contains a model trained on Estonian broadcast data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Online (streaming) speaker change detection model implemented in Pytorch

Demo:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Online (streaming) speaker change detection model implemented in Pytorch

Demo: