Model information
This package contains the final model trained in "CTC-Segmentation of Large Corpora for German End-to-End Speech Recognition".
1460 h of German speech datasets were used to train this model:
-
Tuda-DE train set
-
CTC segmented Librivox
-
CTC segmented Spoken Wikipedia Corpus
-
Mozilla Commonvoice
-
Tuda-DE test set was used as testing set for decoding results:
german.transformer.v1/exp/train_tuda_commonvoice_libriswc_pytorch_train/decode_test/result.txt
| SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err |
| Sum/Avg | 4100 228916 | 92.9 4.5 2.6 2.0 9.0 73.4 |
german.transformer.v1/exp/train_tuda_commonvoice_libriswc_pytorch_train/decode_test/result.wrd.txt
| SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err |
| Sum/Avg | 4100 69600 | 89.0 9.5 1.5 1.8 12.8 71.7 |
More information about the data, Transformer model configuration and CTC segmentation can be found in the paper.
One remark: Tokens in the dictionary were derived from the Tuda-DE cleaned version of the train set. Some of those tokens may be unusual for German language, e.g., see data/lang_char/input.txt
. So, this model is not perfect, but already yields acceptable results on the Tuda-DE task.
Requirements
This model was trained with ESPNet API version 1, and successfully tested with ESPnet version 0.9.4. Use the utils/recog_wav.sh
of ESPnet for speech recognition.
The format of the input wav file shall be:
Channels : 1
Sample Rate : 16000
Precision : 16-bit