Tacotron 2

RNN-based TTS model introduced by Shen et al. (2017). Tacotron 2 predicts mel spectrogram frames from characters of the input text. Mel frames are predicted autoregressive-ly, where the prediction of the next mel frame is dependent on the previous one. Therefore the model can only predict one frame at a time, which makes inference considerably slow.

In its days the model was very strong benchmark. When it was surpassed it was still useful, since the trained model provides mapping between input characters (or phonemes) and mel frames.

Architecture

The model is composed of an RRN encoder and RNN decoder, with attention between them. Tacotron 2 used slightly adjusted attention, to make attention weights more continuous throughout the output sequence.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tacotron_2.md

tacotron_2.md

Tacotron 2

Architecture

Files

tacotron_2.md

Latest commit

History

tacotron_2.md

File metadata and controls

Tacotron 2

Architecture