tags | ||
---|---|---|
|
FastSpeech is a Transformer-based speech-to-text model with faster inference speed than the autoregressive competition. FastSpeech was introduced by Ren et al. (2019).
The model is quite old, but I wanted to read it to answer two questions that crept up during reading FastSpeech2:
- why predicting phoneme length as part of the model (and in the middle of it)?
- why using 1D Convolutions in Transformer layers instead of the typical FFN?
The premise of the model is to avoid the slow autoregressive mel spectogram prediction, thereby making the prediction parallel and much quicker.
The whole architecture is split up to two parts:
- phoneme processing
- mel frame generation
Both steps are done using several transformer layers. 'Length Regulator'
separates the two parts, which maps the
With parallel prediction, the model needed to resolve the many-to-many mapping. Whereas autoregressive models can just predict mel frames until it 'feels right' and use something like a CTC loss, FastSpeech needs to know how many frames it needs to generate before hand. To do that, authors propose to insert a 'Length Regulator' that is trained to predict how many mel frames each phoneme will take. With this information, phonemes' hidden states can be duplicated to account their duration. From that point onwards, the number of mel frames is set and the many-to-many problem resolved.
Length Regulator is trained according to an autoregressive TTS Transformer encoder-decoder model. So it is a student-model training, where the true phoneme to duration mapping is obtained from encoder-decoder attention coefficients.
Transformer layers are same as in the original model except for FFN after the self-attention layer. FFN layer is replaced by:
- 1D 3k Convolution
- ReLU
- 1D 3k Convolution
The authors justified this by saying that phonemes are more locally dependent than tokens of text. I find this rather vague but their ablation study showed that 1D Convolutions produce more human-preferable results.