FastSpeech 2

FastSpeech 2 is a second generation of STT model FastSpeech. FastSpeech 2 was introduced by Ren et al. (2020). The second generation improvements are:

The model not-only predicts length, but also pitch and energy.
No student-teacher training. Mel-spectograms are learned directly, and phoneme duration is extracted using ...

Architecture

The model architecture is similar as in the previous generation: two stacks of transformer layers, delimitted by 'Variance Adapter', which consists of:

Duration Predictor: repeats each phoneme according to its predicted duration (in mel frames).
Pitch Predictor: predicts pitch and adds the results back to the hidden representation of the mel frame
Energy Predictor: dtto for energy

Duration predictor

In FastSpeech 2 Duration predictor is trained using ground truth extracted using Montreal forced alignment (MFA) (TODO: how this works?).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fastspeech2.md

fastspeech2.md

FastSpeech 2

Architecture

Duration predictor

Files

fastspeech2.md

Latest commit

History

fastspeech2.md

File metadata and controls

FastSpeech 2

Architecture

Duration predictor