-
Notifications
You must be signed in to change notification settings - Fork 15
/
README.md
312 lines (254 loc) · 13.8 KB
/
README.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
## Python Speaker Diarization
## Spectral Clustering Python
## Speaker Diarization Spectral Clustering
# Auto Tuning Spectral Clustering for SpeakerDiarization Using Normalized Maximum Eigengap
<img src="./pics/adj_mat.png" width="40%" height="40%">
<img src="./pics/gp_vs_nme.png" width="40%" height="40%">
* Code for the IEEE Signal Processing Letters (SPL) paper ["Auto-Tuning Spectral Clustering for SpeakerDiarization Using Normalized Maximum Eigengap"](https://github.com/tango4j/Auto-Tuning-Spectral-Clustering/blob/master/2019_SPL_ATNMESC_tjp.pdf)
* [April/8th/2022] NME-SC, Auto-tuning spectral clustering is now main-tained at [NVIDIA NeMo, nmesc_clustering.py](https://github.com/NVIDIA/NeMo/blob/c7a5a33b37c2202e1e0af7ad0eafa7b9c68f1367/nemo/collections/asr/parts/utils/nmesc_clustering.py). NeMo is opensource conversational AI toolkit. NeMo version of NME-SC clustering is faster and more accurate while GPU based NME-SC is supported.
* You could test speaker diarization with freely accessible pre-trained embeddings with NME-SC clustering algorithm. Checkout [NeMo Speaker Diarization document](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speaker_diarization/intro.html). In addition, speaker diarization can be done with ASR too.
* Bibtex:
@article{park2019auto,
title={Auto-Tuning Spectral Clustering for Speaker Diarization Using Normalized Maximum Eigengap},
author={Park, Tae Jin and Han, Kyu J and Kumar, Manoj and Narayanan, Shrikanth},
journal={IEEE Signal Processing Letters},
year={2019},
publisher={IEEE}
}
* Spectral Clustering with auto tuning approach for speaker diarization tasks.
* Based on Kaldi binaries, python and bash script
## Features of Auto-tuning NME-SC method
Auto-tuning NME-SC poposed method -
* does not need to be tuned on dev-set. (Unlike PLDA-AHC)
* only requires speaker embedding. (No PLDA or supervised method for distance measuring)
* also estimates the number of speakers in the given session.
* shows better performance than AHC+PLDA method in general. (See the below table)
## Performance Table
* All the results are based on X-vector speaker embedding from [[1]](https://kaldi-asr.org/models/m6)
* Cosine distance (COS) is used for distance measure. [[2]](https://github.com/tango4j/Auto-Tuning-Spectral-Clustering/blob/master/2019_SPL_ATNMESC_tjp.pdf)
* Sparse-Search-20 is the current version of the code, which is way faster (2~10 times) and usually shows similar performance.
### Track 1: Oracle VAD
* SAD is not applied and the oracle speaker segments are extracted from ground truth to factor out SAD performance.
| System | CALLHOME | CHAES-eval | CH109 | RT03(SW) | AMI |
| ----------------------------------------------|:---------:|:------:|:--------:|:------:|:------:|
| Kaldi PLDA + AHC [[1]](https://david-ryan-snyder.github.io/2018/05/04/model_callhome_diarization_v2.html) | 8.39% | 24.27% | 9.72% | 1.73% | - % |
| Spectral Clustering COS+B-SC [[2]](https://github.com/tango4j/Auto-Tuning-Spectral-Clustering/blob/master/2019_SPL_ATNMESC_tjp.pdf) | 8.78% | 4.4% | 2.25% | __0.88%__ | - % |
| __Auto-Tuning COS+NME-SC__ [[2]](https://github.com/tango4j/Auto-Tuning-Spectral-Clustering/blob/master/2019_SPL_ATNMESC_tjp.pdf) | 7.29% | 2.48% | 2.63% | 2.21% | - % |
| __Auto-Tuning COS+NME-SC Sparse-Search-20__ [[2]](https://github.com/tango4j/Auto-Tuning-Spectral-Clustering/blob/master/2019_SPL_ATNMESC_tjp.pdf)| __7.24%__ | __2.48%__ | __2.00%__ | 0.92% | 4.21% |
### Track 2: System VAD
- Based on the [ASpIRE SAD Model](https://kaldi-asr.org/models/m1) to get SAD output.
- The performance is: Speaker Error(%) ( Total DER(%) )
| System | CALLHOME | CHAES-eval | CH109 | RT03(SW) |
| ----------------------------------------------|:--------------:|:-------------:|:------------:|:-------------:|
| Kaldi PLDA + AHC [[1]](https://david-ryan-snyder.github.io/2018/05/04/model_callhome_diarization_v2.html) | 6.64% <br> (12.96%) | 1.45% <br> (5.52%) | 2.6% <br> (6.89%) | 0.99% <br> (3.53%) |
| Spectral Clustering COS+B-SC [[2]](https://github.com/tango4j/Auto-Tuning-Spectral-Clustering/blob/master/2019_SPL_ATNMESC_tjp.pdf) | 6.91% <br> (13.23%) | 1.00% <br> (5.07%) | 1.46% <br> (5.75%)| **0.56% <br> (3.1%)** |
| __Auto-Tuning COS+NME-SC__ [[2]](https://github.com/tango4j/Auto-Tuning-Spectral-Clustering/blob/master/2019_SPL_ATNMESC_tjp.pdf) | **5.41% <br> (11.73%)** | **0.97% <br> (5.04%)** | **1.32% <br> (5.61%)**| 0.59% <br> (3.13%) |
| __Auto-Tuning COS+NMME-SC Sparse-Search-20__ [[2]](https://github.com/tango4j/Auto-Tuning-Spectral-Clustering/blob/master/2019_SPL_ATNMESC_tjp.pdf) | 5.55% <br> (11.87%) | 1.00% <br> (5.06%) | 1.42% <br> (5.72%) | 0.58% <br> (3.13%) |
### Datasets
__CALLHOME__ NIST SRE 2000 (LDC2001S97): The most popular diarization dataset.
__CHAES-eval__ CALLHOME American English Subset (CHAES) (LDC97S42): English corpora for speaker diarization. train/valid/eval set.
__CH-109__ (LDC97S42): Sessions with 2 speakers in CHAES. Usually tested by providing the number of speakers.
__RT03(SW)__ (LDC2007S10) : SwitchBoard part of RT03 dataset.
### Reference
[1] [PLDA + AHC, Callhome Diarization Xvector Model](https://david-ryan-snyder.github.io/2018/05/04/model_callhome_diarization_v2.html)
[2] Tae Jin Park et. al., [Auto Tuning Spectral Clustering for SpeakerDiarization Using Normalized Maximum Eigengap](https://github.com/tango4j/Auto-Tuning-Spectral-Clustering/blob/master/2019_SPL_ATNMESC_tjp.pdf), IEEE Singal Processing Letters, 2019
## Getting Started
### TLDR; One-click demo script
* [**virtualenv**](https://docs.python-guide.org/dev/virtualenvs/) should be installed on your machine.
* _run_demo_clustering.sh_ installs a virtualenv and runs spectral a clustering example.
* This script runs two utterances from [CALLHOME](https://catalog.ldc.upenn.edu/LDC2001S97) dataset with precomputed segment files and affinity matrices in ./sample_CH_xvector folder.
```bash
source run_demo_clustering.sh
```
### Prerequisites
* This repo is based on python 3.7.
* The mainly required python3 libraries:
```
joblib==0.14.0
numpy==1.17.4
scikit-learn==0.22
scipy==1.3.3
kaldi_io==0.9.1
```
* [Kaldi](https://kaldi-asr.org/doc/about.html) is required to reproduce the numbers in the paper. Go to [Kaldi install](http://jrmeyer.github.io/asr/2016/01/26/Installing-Kaldi.html) to install Kaldi software.
* [Kaldi](https://kaldi-asr.org/doc/about.html) should be installed in your home folder `~/kaldi` to be successfully loaded.
* You can still run the clustering algorithm without [Kaldi](http://jrmeyer.github.io/asr/2016/01/26/Installing-Kaldi.html) by saving your affinity matrix into .npy.
### Installing
You have to first have [**virtualenv**](https://docs.python-guide.org/dev/virtualenvs/) installed on your machine. Install [**virtualenv**](https://docs.python-guide.org/dev/virtualenvs/) with the following command:
```
sudo pip3 install virtualenv
```
If you installed virtualenv, run the "install_venv.sh" script to make a virtual-env.
```
source install_venv.sh
```
This command will create a folder named "env_nmesc".
### Usage Example
You need to prepare the followings:
1. **Segmentation files** in Kaldi style format:
<segment_id> <utt_id> <start_time> <end_time>
ex) segments
```
iaaa-00000-00327-00000000-00000150 iaaa 0 1.5
iaaa-00000-00327-00000075-00000225 iaaa 0.75 2.25
iaaa-00000-00327-00000150-00000300 iaaa 1.5 3
...
iafq-00000-00272-00000000-00000150 iafq 0 1.5
iafq-00000-00272-00000075-00000225 iafq 0.75 2.25
iafq-00000-00272-00000150-00000272 iafq 1.5 2.72
```
3. **Affinity matrix files** in Kaldi scp/ark format: Each affinity matrix file should be N by N square matrix.
2. **Speaker embedding files**: If you don't have affinity matrix, you can calculate cosine similarity ark files using _./sc_utils/score_embedding.sh_. See run_demo_clustering.sh file to see how to calcuate cosine similarity files. (You can choose scp/ark or npy)
#### Running the python code with arguments:
```bash
python spectral_opt.py --distance_score_file $DISTANCE_SCORE_FILE \
--threshold $threshold \
--score-metric $score_metric \
--max_speaker $max_speaker \
--spt_est_thres $spt_est_thres \
--segment_file_input_path $SEGMENT_FILE_INPUT_PATH \
--spk_labels_out_path $SPK_LABELS_OUT_PATH \
--reco2num_spk $reco2num_spk
```
#### Arguments:
* **distance_score_file**: A list of affinity matrix files.
```
# If you want to use kaldi .ark score file as an affinity matrix
DISTANCE_SCORE_FILE=$PWD/sample_CH_xvector/cos_scores/scores.scp
# If you want to use .npy numpy file as an affinity matrix
DISTANCE_SCORE_FILE=$PWD/sample_CH_xvector/cos_scores/scores.txt
```
Two options are available:
(1) scores.scp: Kaldi style scp file that contains the absolute path to .ark files and its binary address. Space separted \<utt_id\> and \<path\>.
ex) scores.scp
```
iaaa /path/sample_CH_xvector/cos_scores/scores.1.ark:5
iafq /path/sample_CH_xvector/cos_scores/scores.1.ark:23129
...
```
(2) scores.txt: List of <utt_id> and the absolute path to .npy files.
ex) scores.txt
```
iaaa /path/sample_CH_xvector/cos_scores/iaaa.npy
iafq /path/sample_CH_xvector/cos_scores/iafq.npy
...
```
* **score-metric**: Use 'cos' to apply for affinity matrix based on cosine similarity.
ex)
```bash
score_metric='cos'
```
* **max_speaker**: If you do not provide oracle number of speakers (reco2num_spk), the estimated number of speakers is capped by _max_speaker_. Default is 8.
```bash
max_speaker=8
```
* **threshold**: Manually setup a threshold. We apply this threshold for all utterances. This should be setup in conjuction with **spt_est_thres**.
ex)
```bash
threshold=0.05
```
* **spt_est_thres**:
spt_est_thres $spt_est_thres \
```bash
# You can specify a threshold.
spt_est_thres='None'
threshold=0.05
# Or you can use NMESC in the paper to estimate the threshold.
spt_est_thres='NMESC'
threshold='None'
# Or you can specify different threshold for each utterance.
spt_est_thres="thres_utts.txt"
threshold='None'
```
thres_utts.txt has a format as follows:
<utt_id> <threshold>
ex) thres_utts.txt
```
iaaa 0.105
iafq 0.215
...
```
* **segment_file_input_path**: "segments" file in Kaldi format. This file is also necessary for making rttm file and calculating DER.
```bash
segment_file_input_path=$PWD/sample_CH_xvector/xvector_embeddings/segments
```
ex) segments
```
iaaa-00000-00327-00000000-00000150 iaaa 0 1.5
iaaa-00000-00327-00000075-00000225 iaaa 0.75 2.25
iaaa-00000-00327-00000150-00000300 iaaa 1.5 3
...
iafq-00000-00272-00000000-00000150 iafq 0 1.5
iafq-00000-00272-00000075-00000225 iafq 0.75 2.25
iafq-00000-00272-00000150-00000272 iafq 1.5 2.72
```
* **reco2num_spk**: A list of oracle number of speakers. Default is 'None'.
reco2num_spk $reco2num_spk
```bash
reco2num_spk='None'
reco2num_spk='oracle_num_of_spk.txt'
```
In the text file, you must include <utt_id> and <oracle_number_of_speakers>
ex) oracle_num_of_spk.txt
```
iaaa 2
iafq 2
iabe 4
iadf 6
...
```
### Cosine similarity calculator script
#### Running the python code for cosine similarity calculation:
```
data_dir=$PWD/sample_CH_xvector
pushd $PWD/sc_utils
text_yellow_info "Starting Script: affinity_score.py"
./score_embedding.sh --cmd "run.pl --mem 5G" \
--score-metric $score_metric \
$data_dir/xvector_embeddings \
$data_dir/cos_scores
popd
```
* **cmd**: This is for following the kaldi style argument.
* **score-metric**: Should be 'cos' for cosine similarity.
```
score_metric='cos'
```
* **\<src-embedding-dir\>**: Source embedding directory where xvector.scp and xvector.ark files are
* **\<out-dir\>**: Output directory. This script will create scores.1.ark file and scores.scp
### Expected output result of one-click script
```bash
$ source run_demo_clustering.sh
=== [INFO] The python_envfolder exists: /.../Auto-Tuning-Spectral-Clustering/env_nmesc
=== [INFO] Cosine similariy scores exist: /.../Auto-Tuning-Spectral-Clustering/sample_CH_xvector/cos_scores
=== [INFO] Running Spectral Clustering with .npy input...
=== [INFO] .scp file and .ark files were provided
Scanning eig_ratio of length [19] mat size [76] ...
1 score_metric: cos affinity matrix pruning - threshold: 0.105 key: iaaa Est # spk: 2 Max # spk: 8 MAT size : (76, 76)
Scanning eig_ratio of length [15] mat size [62] ...
2 score_metric: cos affinity matrix pruning - threshold: 0.194 key: iafq Est # spk: 2 Max # spk: 8 MAT size : (62, 62)
Method: Spectral Clustering has been finished
=== [INFO] Computing RTTM
=== [INFO] RTTM calculation was successful.
=== [INFO] NMESC auto-tuning | Total Err. (DER) -[ 0.32 % ] Speaker Err. [ 0.32 % ]
=== [INFO] .scp file and .ark files were provided
1 score_metric: cos affinity matrix pruning - threshold: 0.050 key: iaaa Est # spk: 2 Max # spk: 8 MAT size : (76, 76)
2 score_metric: cos affinity matrix pruning - threshold: 0.050 key: iafq Est # spk: 5 Max # spk: 8 MAT size : (62, 62)
Method: Spectral Clustering has been finished
=== [INFO] Computing RTTM
=== [INFO] RTTM calculation was successful.
=== [INFO] Threshold 0.05 | Total Err. (DER) -[ 20.57 % ] Speaker Err. [ 20.57 % ]
Loading reco2num_spk file: reco2num_spk
=== [INFO] .scp file and .ark files were provided
1 score_metric: cos Rank based pruning - RP threshold: 0.0500 key: iaaa Given Number of Speakers (reco2num_spk): 2 MAT size : (76, 76)
2 score_metric: cos Rank based pruning - RP threshold: 0.0500 key: iafq Given Number of Speakers (reco2num_spk): 2 MAT size : (62, 62)
Method: Spectral Clustering has been finished
=== [INFO] Computing RTTM
=== [INFO] RTTM calculation was successful.
=== [INFO] Known Num. Spk. | Total Err. (DER) -[ 0.15 % ] Speaker Err. [ 0.15 % ]
```
## Authors
Tae Jin Park: [email protected], [email protected]
Kyu J.
Manoj Kumar
Shrikanth Narayanan