HTS-style context labels of JSUT corpus available for speech synthesis system such as HTS, Merlin, and nnmnkwii. Phonetic and prosodic information are based on manual annotation. Time information are automatically estimated using Julius. Currently, this repository provides the labels of the BASIC5000 subset. Also, the pronounced texts and kanas are listed in ./text_kana. Input sequences available for end-to-end speech synthesis are provided in ./e2e_symbol.
The context labels are not completely the same format as those created using OpenJTalk. The followings are NOT supported in the labels.
- Unvoiced vowels, which are genrally annotated as A, E, I, O and U.
- Word information (part-of-speech, conjugation type, and inflected form)
The label data is licensed with the CC-BY-SA 4.0, etc. See LICENSE.txt file for the detail.
- Tomoki Koriyama (Main contributor) (@hyama5)
- Shinnosuke Takamichi
This work was supported by the following grants:
- KAKENHI Grant Number 17K12711
- The GAP foundation program of the University of Tokyo
- JSUT (Japanese speech corpus of Saruwatari-lab., University of Tokyo)
- r9y9/just-lab ... provides automatically generated labels by using OpenJTalk.
- HMM/DNN-based Speech Synthesis System (HTS) ... provides label format in the demo scripts.
- Ryosuke Sonobe, Shinnosuke Takamichi, and Hiroshi Saruwatari, "JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis," arXiv preprint, 1711.00354, Sep. 2017.
- Shinnosuke Takamichi, Ryosuke Sonobe, Kentaro Mitsui, Yuki Saito, Tomoki Koriyama, Naoko Tanji, Hiroshi Saruwatari, "JSUT and JVS: free Japanese voice corpora for accelerating speech synthesis research," Acoustical Science and Technology, Vol.xxx, No.xxx, pp.xxx-xxx, 2020.