Skip to content

Latest commit

 

History

History
62 lines (46 loc) · 2.99 KB

fastspeech.md

File metadata and controls

62 lines (46 loc) · 2.99 KB
tags
ml
speech_to_text

FastSpeech

FastSpeech is a Transformer-based speech-to-text model with faster inference speed than the autoregressive competition. FastSpeech was introduced by Ren et al. (2019).

The model is quite old, but I wanted to read it to answer two questions that crept up during reading FastSpeech2:

  • why predicting phoneme length as part of the model (and in the middle of it)?
  • why using 1D Convolutions in Transformer layers instead of the typical FFN?

The premise of the model is to avoid the slow autoregressive mel spectogram prediction, thereby making the prediction parallel and much quicker.

Architecture

The whole architecture is split up to two parts:

  • phoneme processing
  • mel frame generation

Both steps are done using several transformer layers. 'Length Regulator' separates the two parts, which maps the $N$-long sequence of phonemes to $M$-long sequence of what will be mel frames. After this mapping, new positional embeddings are added.

Predicting phoneme length

With parallel prediction, the model needed to resolve the many-to-many mapping. Whereas autoregressive models can just predict mel frames until it 'feels right' and use something like a CTC loss, FastSpeech needs to know how many frames it needs to generate before hand. To do that, authors propose to insert a 'Length Regulator' that is trained to predict how many mel frames each phoneme will take. With this information, phonemes' hidden states can be duplicated to account their duration. From that point onwards, the number of mel frames is set and the many-to-many problem resolved.

Length Regulator is trained according to an autoregressive TTS Transformer encoder-decoder model. So it is a student-model training, where the true phoneme to duration mapping is obtained from encoder-decoder attention coefficients.

Transformer layers

Transformer layers are same as in the original model except for FFN after the self-attention layer. FFN layer is replaced by:

  • 1D 3k Convolution
  • ReLU
  • 1D 3k Convolution

The authors justified this by saying that phonemes are more locally dependent than tokens of text. I find this rather vague but their ablation study showed that 1D Convolutions produce more human-preferable results.