Skip to content

Latest commit

 

History

History
111 lines (78 loc) · 3.29 KB

beginners_guide_to_tts.md

File metadata and controls

111 lines (78 loc) · 3.29 KB
tags
ml
text_to_speech

Beginners guide to Text-to-Speech

TODOs:

  • flow-based models, e.g.
    • VITS (2021)
    • Flowtron (2020)
    • Glow-tts (2020)
  • StyleTTS 2 (2023)
  • Deep Voice 3 (2017)
  • TransformerTTS
  • Mixer-TTS (2021)
  • Neturalspeech (2022)

In text-to-speech (TTS) tasks the model is asked to predict a waveform based on an input text. Usually this model is actually whole system of models containing:

  • grapheme to phoneme transcriber -- converts spelling (letters) to pronunciation symbols

  • mel-spectogram predictor -- predicts mel-spectograms from pronunciation symbols, the main part of TTS systems

  • vocoder -- converts mel-spectogram to waveforms

Types of mel-spectogram generators

There are several types of models for mel-spectogram generation.

Autoregressive

Predicting next mel frame based on the previous

  • Classical approach to speech synthesis which, initially, gave the best results
  • RNNs with attentions or causal CNN architectures
  • the big drawback is slow inference, which is why these models are not really used anymore
  • examples: Tacotron 2 or WaveNet

Parallel

Predicting all frames at the same time.

Types of vocoders

Vocoders need to predict very high-definition output: second of speech is typically between 16k and 22k scalars. Ergo this is a realm of generative models.

GANs

Flow-based

Glossary

  • phoneme: smallest speech sound of a particular language (pronunciation)
  • grapheme: smallest unit of writing system (letters)

grapheme vs phoneme

graphemes:

  • the quick brown
  • fox jumps over
  • the lazy dog

phonemes:

  • DH AH0 K W IH1 K B R AW1 N
  • F AA1 K S JH AH1 M P S OW1 V ER0
  • DH AH0 L EY1 Z IY0 D AO1 G
  • prosody: properties of larger phonetic segments such as rhythm, stress and intonation
  • vocoder: turns mel-spectograms into waveforms

Evaluation metrics

Mean Opinion Score (MOS)

Mean Opinion Score (MOS) is the mean of opinion scores given by annotators. The annotators can score a recording on a categorical scale from 1-5, 5 being 'excelent', 1 being 'bad'. To compare model's generated audio, the authors should also disclose what was the MOS of ground truth (typically $\sim$4.5).

Tools

TTS models