Alignments on Librivox and Spoken Wikipedia Corpus (SWC) with CTC segmentation:
Dataset | Length | Speakers | Utterances |
---|---|---|---|
SWC | 210h | 363 | 78214 |
Librivox | 804h | 251 | 368532 |
This repository contains pre-processed text and alignments. Both corpora are combined to one recipe, audio file and corpus can be attributed by file names and utterance IDs. The audio files can be downloaded separately:
- SWC: German Spoken Wikipedia Corpus
- Librivox: The audio files can be retrieved via IDs in the metadata file
books-German.json
and then automatically retrieved viaid
using the LibriVox API, e.g. https://librivox.org/api/feed/audiobooks/?id=82&format=json , and then downloading the URL. As downloading the files separately takes time, there is an MP3 boundle is available at the MMK website.
For librivox, the naming scheme is librivox_{book_id}_{chapter}_{utterance_id}
. The separate file librivox_utt2spk
contains speaker information.
A pretrained ASR model (Transformer) is in the Releases section of this repository.
Further description can be found in the CTC segmentation paper (on Springer Link, on ArXiv)
The repository on Github has limited download capacity due to its Git-LFS data quota. Here is a Gitlab mirror of this repo: https://gitlab.com/Lumaku/german-corpus-aligned