tags | ||
---|---|---|
|
FastSpeech 2 is a second generation of STT model FastSpeech. FastSpeech 2 was introduced by Ren et al. (2020). The second generation improvements are:
- The model not-only predicts length, but also pitch and energy.
- No student-teacher training. Mel-spectograms are learned directly, and phoneme duration is extracted using ...
The model architecture is similar as in the previous generation: two stacks of transformer layers, delimitted by 'Variance Adapter', which consists of:
- Duration Predictor: repeats each phoneme according to its predicted duration (in mel frames).
- Pitch Predictor: predicts pitch and adds the results back to the hidden representation of the mel frame
- Energy Predictor: dtto for energy
In FastSpeech 2 Duration predictor is trained using ground truth extracted using Montreal forced alignment (MFA) (TODO: how this works?).