-
Notifications
You must be signed in to change notification settings - Fork 0
/
calibration.qmd
112 lines (81 loc) · 7.53 KB
/
calibration.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
---
title: "Analysis of the pretrained powerset speaker diarization model"
about:
template: marquee
links:
- icon: github
text: Github
href: https://github.com/FrenchKrab/IS2024-powerset-calibration
- icon: book
text: Google Scholar
href: https://scholar.google.com/citations?user=7gJ465gAAAAJ
---
# Raw results table
We could not include the raw result table in the paper. We show it here, and include some additional metrics (Expected Calibration Error using different binning schemes and bin counts). It is pretty clear that the bins used to compute the ECE does not have a huge impact on the metric.
::: {.callout-warning title="About the reported DERs"}
Do note that the DER given here is the **local** diarization error rate. It **can not be compared to DERs usually reported in the litterature** ! Since the powerset speaker diarization model works on local windows of a few seconds (5 seconds in our case), we compute compute and sum the DER component on each of these windows. There is no clustering involved here (or in any DER we provide) and it cannot be interpreted as the final pipeline DER.
:::
| Dataset | DER (%) | Accuracy (%) | ECE uniform 10 bins (%) | ECE uniform 20 bins (%) | ECE adaptive 10 bins (%) | ECE adaptive 20 bins (%) |
|:--------------------|----------:|---------------:|--------------------------:|--------------------------:|---------------------------:|---------------------------:|
| AISHELL | 11.86 | 89.10 | 0.39 | 0.48 | 0.50 | 0.50 |
| AMI-SDM | 19.49 | 82.79 | 3.98 | 3.98 | 3.98 | 3.98 |
| AMI | 17.50 | 84.55 | 3.53 | 3.53 | 3.53 | 3.53 |
| AVA-AVD | 34.85 | 81.87 | 4.30 | 4.31 | 4.30 | 4.30 |
| AliMeeting | 19.59 | 79.46 | 3.04 | 3.04 | 3.04 | 3.04 |
| CALLHOME | 22.49 | 77.07 | 2.57 | 2.57 | 2.57 | 2.57 |
| MSDWILD | 20.03 | 80.52 | 2.89 | 2.89 | 2.89 | 2.89 |
| RAMC | 10.69 | 91.12 | 1.67 | 1.67 | 1.67 | 1.67 |
| REPERE | 7.67 | 92.48 | 1.83 | 1.83 | 1.83 | 1.83 |
| VoxConverse | 9.94 | 91.05 | 0.70 | 0.70 | 0.69 | 0.69 |
||
| audiobooks | 12.22 | 90.44 | 3.22 | 3.26 | 3.28 | 3.28 |
| broadcast interview | 16.77 | 86.90 | 6.50 | 6.50 | 6.44 | 6.44 |
| clinical | 32.15 | 79.48 | 3.94 | 3.98 | 3.93 | 3.94 |
| court | 16.46 | 86.00 | 8.19 | 8.19 | 8.17 | 8.17 |
| cts | 16.47 | 83.68 | 1.37 | 1.38 | 1.37 | 1.37 |
| maptask | 28.15 | 81.25 | 8.20 | 8.20 | 7.97 | 8.08 |
| meeting | 39.70 | 64.26 | 16.63 | 16.63 | 16.63 | 16.63 |
| restaurant | 45.82 | 54.11 | 14.31 | 14.31 | 14.31 | 14.31 |
| socio field | 21.74 | 82.45 | 2.65 | 2.65 | 2.65 | 2.65 |
| socio lab | 22.06 | 82.60 | 4.36 | 4.36 | 4.24 | 4.33 |
| webvideo | 40.01 | 69.75 | 10.52 | 10.52 | 10.52 | 10.52 |
# DER / ECE scatter plot
The paper contains two scatter plots for DER / ECE. Here we grouped all datasets and domains into one unique scatter plot. Feel free to zoom in and filter out in/out-domain datasets.
<!-- 09c_view_calibration_eval.ipynb -->
```{=html}
<iframe width=800px height=600 src="site_media/calibration/ece_der_scatter.html" scrolling="no"></iframe>
```
# Reliability diagrams
Here are reliability diagrams for all 11 DIHARD 3 domains. The paper only shows uniform binning, but we also propose diagrams for adaptive binning.
We put the figures under foldable sections since they take a lot of vertical space.
## Uniform binning with 10 bins
<!-- 09c_view_calibration_eval.ipynb with BINNING_METHOD='uniform' -->
::: {.callout-note appearance="detail" collapse=true title="Using uniform binning with 10 bins"}
![](site_media/calibration/reliability_uniform10bins.png)
:::
<!-- 09c_view_calibration_eval.ipynb with BINNING_METHOD='adaptive' -->
## Adapative binning with 10 bins
Note that the X axis is not linear at all. Since most predictions are confident, the higher bins contain very similar confidence values.
::: {.callout-note appearance="detail" collapse=true title="Using adaptive binning with 10 bins"}
![](site_media/calibration/reliability_adaptive10bins.png)
:::
# Analysis of low-confidence regions
We sample low-confidence data (left column) and random regions of data (right column), and compare the composition of the data as well as the model performance. As usual we provide the figures of all DIHARD domains instead of a select few.
## Data composition
<!-- 21_selected_al_analysis.ipynb -->
::: {.callout-note appearance="detail" collapse=true title="Data composition of low-confidence regions"}
![](site_media/calibration/data_composition.png)
:::
## Model performance (DER)
<!-- 21_selected_al_analysis_der.ipynb -->
::: {.callout-note appearance="detail" collapse=true title="DER on low-confidence regions"}
![](site_media/calibration/der_analysis.png)
:::
# Reproducibility
Pretrained model checkpoint downloads:
- [Github](https://github.com/FrenchKrab/IS2024-powerset-calibration/tree/master/data/calibration/[email protected])
- [HuggingFace (mirror)](https://huggingface.co/aplaquet/IS2024-powerset-calibration/blob/main/pretrained%40epoch109.ckpt)
Composition of the training dataset:
- [pyannote.database protocol specifications](https://github.com/FrenchKrab/IS2024-powerset-calibration/tree/master/data/calibration/database.yml)
Parquet inference files, containing model probabilities and targets for all of the datasets:
- [.parquet inference files](https://huggingface.co/aplaquet/IS2024-powerset-calibration/tree/main/model_inference)