Skip to content

German Espnet 1 Transformer ASR model

Latest
Compare
Choose a tag to compare
@lumaku lumaku released this 24 Aug 13:52
· 4 commits to master since this release

Model information

This package contains the final model trained in "CTC-Segmentation of Large Corpora for German End-to-End Speech Recognition".

1460 h of German speech datasets were used to train this model:

  • Tuda-DE train set

  • CTC segmented Librivox

  • CTC segmented Spoken Wikipedia Corpus

  • Mozilla Commonvoice

  • Tuda-DE test set was used as testing set for decoding results:

german.transformer.v1/exp/train_tuda_commonvoice_libriswc_pytorch_train/decode_test/result.txt
    | SPKR     | # Snt   # Wrd | Corr    Sub    Del    Ins    Err  S.Err |
    | Sum/Avg  | 4100   228916 | 92.9    4.5    2.6    2.0    9.0   73.4 |
german.transformer.v1/exp/train_tuda_commonvoice_libriswc_pytorch_train/decode_test/result.wrd.txt
    | SPKR     | # Snt   # Wrd | Corr    Sub    Del    Ins    Err  S.Err |
    | Sum/Avg  | 4100    69600 | 89.0    9.5    1.5    1.8   12.8   71.7 |

More information about the data, Transformer model configuration and CTC segmentation can be found in the paper.

One remark: Tokens in the dictionary were derived from the Tuda-DE cleaned version of the train set. Some of those tokens may be unusual for German language, e.g., see data/lang_char/input.txt. So, this model is not perfect, but already yields acceptable results on the Tuda-DE task.

Requirements

This model was trained with ESPNet API version 1, and successfully tested with ESPnet version 0.9.4. Use the utils/recog_wav.sh of ESPnet for speech recognition.
The format of the input wav file shall be:

Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit