Skip to content

Latest commit

 

History

History
70 lines (51 loc) · 2.75 KB

efficientspeech.md

File metadata and controls

70 lines (51 loc) · 2.75 KB
tags
ml
text_to_speech

EfficientSpeech

EfficientSpeech is a small text-to-speech model for edge and mobile devices introduced by Atienza (2023).

EfficientSpeech is a system composed of:

Architecture of the model

I've identified slight error in the image in the paper. According to the image the "upsampling" operation done on the output of the first feature encoder block, takes in also the upsampled output of the second feature encoder's block. I've looked into the source code and this is clearly not the case -- the features are 'fused' independently. So its as the paper describes in the paragraphs below the image.

The model is devided into several parts:

  1. feature encoder -- processes and contextualizes phonemes
  2. acoustic features predictor -- predicts pitch, energy and duration
  3. feature fuser and upsampler -- concatenates acoustic features w/ contextualized phonemes and repeats phonemes according to their duration
  4. mel spectogram predictor -- predicts mel spectograms from fused mel-frame features

Feature encoder

Feature encoder is composed of two blocks that consist of:

Feature encoder follows a U-network style architecture. First block leaves the sequence dimension alone, but the second block uses convolution with stride 2, which halves the sequence length. Outputs of each blocks are further processed by:

  • linear layer
  • transposed convolution for features of second block, identity for first block

These are then concatenated, and run through a linear layer to produce contextualized representations of phonemes.

Acoustic feature predictor

EfficientSpeech follows FastSpeech 2 and also predicts acoustic features including duration per phoneme. Each predictor is trained, and all three predictions are embedded.

Feature fuser and upsampler

Feature fuser concatenates the embedded acoustic features with contextualized phoneme features. Then the phoneme features are turned into mel frame features by repeating them according to the predicted phonemes' durations.

Mel spectogram predictor

To predict the final spectogram combination of depth-wise separable convolutions, linear layers, tanh activations and layer norms are used.