Several challenges in speaker diarization:
(1) to segment and separate overlapping speech from two speakers;
(2) to estimate the number of speakers when participants may enter or leave the conversation at any time;
(3) to provide accurate speaker identification on short text-independent utterances;
(4) to track down speakers movement during the conversation;
(5) to detect speaker change incidence real-time.
Speaker diarization is a process of finding the optimal segmentation based on speaker identity and determining the identity of each segments. i.e. "who speaks when". Match a list of audio segments to a list of different speakers.
Propose a speaker diarization system that effectively incorporates spatial information.
Microphone array contributes to our speaker diarization system in two ways.
(1) the ability to localize sound source enables the system to find the optimal segmentation points with remarkable accuracy. Locations of each segments are effective complements for speaker embeddings in joint clustering, especially for short segments.
(2) differential directional microphone array significantly improves the quality of speaker's voice in far-field, noisy environment, which in turn enhances the representative power of speaker embeddings.
Our speaker diarization system with spatial spectrum (SDSS).
Audio segmentation and finding the exact point in time of a speaker change incidence are determined by the joint efforts of spatial localization and NN-VAD.
The output signals of the beamformers are spatially separated from each other.
Circular Differential Directional Microphone Array is based on a uniform circular array with directional microphones depicted in Fig.1. All the directional elements are uniformly distributed on a circle and directions are pointing outward.
The output angle goes through an online clustering one after another. Every time an angle incidence that lies outside of the current cluster is spotted, we mark that current frame as a possible speaker change timestamp.
Consider two utterances
With additional estimated DOA
An online agglomerative hierarchical clustering (AHC) is performed on the audio segments and source location, based on the joint conditional probability.
Let
Beamforming allows us to separate signals from different DOAs.
Useful References