tags | ||
---|---|---|
|
EfficientSpeech is a small text-to-speech model for edge and mobile devices introduced by Atienza (2023).
EfficientSpeech is a system composed of:
- g2p phoneme generator
- the model itself
- HifiGAN vocoder
I've identified slight error in the image in the paper. According to the image the "upsampling" operation done on the output of the first feature encoder block, takes in also the upsampled output of the second feature encoder's block. I've looked into the source code and this is clearly not the case -- the features are 'fused' independently. So its as the paper describes in the paragraphs below the image.
The model is devided into several parts:
- feature encoder -- processes and contextualizes phonemes
- acoustic features predictor -- predicts pitch, energy and duration
- feature fuser and upsampler -- concatenates acoustic features w/ contextualized phonemes and repeats phonemes according to their duration
- mel spectogram predictor -- predicts mel spectograms from fused mel-frame features
Feature encoder is composed of two blocks that consist of:
- Depth-separable convolution
- Self-attention layer
- 1D Convolution
- plus some Layer Norms mixed in
Feature encoder follows a U-network style architecture. First block leaves the sequence dimension alone, but the second block uses convolution with stride 2, which halves the sequence length. Outputs of each blocks are further processed by:
- linear layer
- transposed convolution for features of second block, identity for first block
These are then concatenated, and run through a linear layer to produce contextualized representations of phonemes.
EfficientSpeech follows FastSpeech 2 and also predicts acoustic features including duration per phoneme. Each predictor is trained, and all three predictions are embedded.
Feature fuser concatenates the embedded acoustic features with contextualized phoneme features. Then the phoneme features are turned into mel frame features by repeating them according to the predicted phonemes' durations.
To predict the final spectogram combination of depth-wise separable convolutions, linear layers, tanh activations and layer norms are used.