EfficientSpeech

EfficientSpeech is a small text-to-speech model for edge and mobile devices introduced by Atienza (2023).

EfficientSpeech is a system composed of:

g2p phoneme generator
the model itself
HifiGAN vocoder

Architecture of the model

I've identified slight error in the image in the paper. According to the image the "upsampling" operation done on the output of the first feature encoder block, takes in also the upsampled output of the second feature encoder's block. I've looked into the source code and this is clearly not the case -- the features are 'fused' independently. So its as the paper describes in the paragraphs below the image.

The model is devided into several parts:

feature encoder -- processes and contextualizes phonemes
acoustic features predictor -- predicts pitch, energy and duration
feature fuser and upsampler -- concatenates acoustic features w/ contextualized phonemes and repeats phonemes according to their duration
mel spectogram predictor -- predicts mel spectograms from fused mel-frame features

Feature encoder

Feature encoder is composed of two blocks that consist of:

Depth-separable convolution
Self-attention layer
1D Convolution
plus some Layer Norms mixed in

Feature encoder follows a U-network style architecture. First block leaves the sequence dimension alone, but the second block uses convolution with stride 2, which halves the sequence length. Outputs of each blocks are further processed by:

linear layer
transposed convolution for features of second block, identity for first block

These are then concatenated, and run through a linear layer to produce contextualized representations of phonemes.

Acoustic feature predictor

EfficientSpeech follows FastSpeech 2 and also predicts acoustic features including duration per phoneme. Each predictor is trained, and all three predictions are embedded.

Feature fuser and upsampler

Feature fuser concatenates the embedded acoustic features with contextualized phoneme features. Then the phoneme features are turned into mel frame features by repeating them according to the predicted phonemes' durations.

Mel spectogram predictor

To predict the final spectogram combination of depth-wise separable convolutions, linear layers, tanh activations and layer norms are used.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

efficientspeech.md

efficientspeech.md

EfficientSpeech

Architecture of the model

Feature encoder

Acoustic feature predictor

Feature fuser and upsampler

Mel spectogram predictor

Files

efficientspeech.md

Latest commit

History

efficientspeech.md

File metadata and controls

EfficientSpeech

Architecture of the model

Feature encoder

Acoustic feature predictor

Feature fuser and upsampler

Mel spectogram predictor