Skip to content

CookiePPP/papers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

46 Commits
Β 
Β 

Repository files navigation

papers


The researchers train a BERT model on Phoneme+[SEP]+Text inputs with the standard masked language modelling objective. They use both word-position and token-position embeddings, along with the usual 2 BERT embeddings for segment A and B. The position embeddings are sinusoidal and a linear projection is used so the model can seperate each type of embedding in the latent space. The PnG BERT model is trained on a plain text dataset taken from wikipedia. Their proprietary system is used to convert text into phonemes. Since there is no ground truth data, only predictions taken from their g2p model, this technique might work poorly with accented speakers or multispeaker datasets if no fine-tuning is used. They find that applying the [mask] token to entire words instead of individual tokens improves performance in downstream tasks.

When using the PnG BERT model for TTS, they only use the latents from the phoneme side of the model's output. Their TTS model is trained on 240 hours of data from 31 speakers.

Results The baseline is slightly worse than ground truth. The PnG-BERT augmented solution is better/equal to ground truth. Reviewers found that the PnG BERT solution had better prosody and naturalness.

UPDATE:

I created https://github.com/CookiePPP/pngnw_bert where I experimented with my own modified version of PnG BERT and found;

  • PnG BERT uses significantly more compute than normal BERT. The input length is around 6x longer when using chars+phones instead of just wordpieces.
  • fine-tuned PnG BERT performs about on par with fine-tuned normal BERT + DeepMoji + G2p. It's important to include a language model for emotive TTS, but there seems to be nothing about PnG BERT that makes it especially better that other language models.

Following the success of large language models, the researchers experiment with large speech models. They use facebook's "EnCodec" VQVAE model to convert 24Khz audio into descrete audio tokens, then train a BERT-large style model with casual language modelling task on the audio tokens. The EnCodec model can output {1,2,4,8} tokens per timestep depending on the amount of compression being targeted. To improve performance, the CLM only predicts the first token for each timestep autoregressively, then a (non-autoregressive) BERT-large predicts the remaining 7 audio tokens in parallel. The authors use a lot of weight/layer sharing among the 7 parallel BERT-large models, presumably to speed up training or reduce parameter count since the task of each model is almost identical.

To train this architecture, they use the Libri-light 60,000 hour dataset (where an ASR model transcribes each file). They use 10-20 second slices to train the autoregressive model. 3 Seconds of ground truth audio tokens are given at the start of sample (so the model can learn the audio conditions and speaker identity for the sample). They also provide the phoneme transcript at the start of each sample. Different position embeddings are used for the phoneme transcript to allow the model to align the transcript and audio tokens properly.

Results They find that VALL-E has better WER and speaker similarity than YourTTS (zero-shot model based on VITS). The MOS values suggest VALL-E has very good audio quality and naturalness.

image

image

Thoughts: The WER rate should be ignored in this paper since the model is trained on ASR transcriptions. If the dataset transcripts were created using the same model that was used to evaluate the WER rate then the WER rate may be inflated/incorrect.

They also evaluated speaker similarity using 5s and 10s ground truth tokens for input and found that speaker similarity improves as more data is used (unsurprisingly).

image

They also mention the model being able to recreate reverberation when it's given in ground truth samples. That's very interesting given models like Tacotron2 and FastSpeech2 struggle with reverberation.


The researchers found that the text-to-speech model FastSpeech2 does not produce sharp/clear spectrograms, especially with challenging speakers or large multi-speaker datasets. In order to fix this problem, they train a diffusion model to learn the offset between FastSpeech2's output and the ground truth data. Effectively using the Diffusion model as a postnet. They find that it's very effective, as few as 4 sampling steps is enough to improve the MOS from 3.3 to 4.1. The authors spend a long time talking about how their method is faster than other Diffusion TTS models, however they seem to completely misunderstand or mis-explain WHY it's faster. The model is faster because their model uses ground truth pitch during training and only outputs 1 pitch value during inference. Because of this, their model can only generate 1 version of each audio file. I threw away FastSpeech2 and trained a normal Diffusion model with ground truth pitch as aux input, and it also produces samples in less than 10 steps, while enjoying the significantly simpler architecture design.


The researchers find that Tacotron2 often learns non-monotonic alignments. Inorder to fix this, they calculate the mean text-position of each mel-frames alignment vector, and minimize the negative difference between the position of neighbouring frames. Basically, they add a new loss the will penalize the model if the position in the text goes backwards by any amount between mel frames. image They find that this improves the number of skipped or repeated words and show a minor increase in MOS.

I like this paper. It's an extremely simple technique that just works and doesn't seem to have any downsides (at least with phoneme input text).


The researchers find that using seperate Alignment, Text-to-Spectrogram and Vocoder models may reduce the quality of text-to-speech samples. (FastSpeech2 has actually tried this before with "FastSpeech2s", however they reported worse scores in their paper.)

For their experiment, they attach FastSpeech2 to HiFiGAN and replace FastSpeech2's hard alignments with a Guassian upsampling aligner. They use MAS to compute their alignments and remove the mel-spectrogram loss from FastSpeech2 so there are no spectrograms used in this pipeline. LJSpeech with default model/data parameter are used for training. They find that despite achieving worse MCD compared to the normal FastSpeech2+HiFiGAN pipeline, they have better F0, MOS and CER. image The difference is significant, however I can't say how much of the difference comes from the alignment change and how much comes from training end-to-end without using spectrograms. Vocoders are expensive to train so I don't see this architecture becoming common in research anytime soon, but it's still interesting to see and suggests that end-to-end training may be a way to improve audio quality in the future.


The researchers propose a Text-to-Speech architecture that can copy a reference audio files prosody/emotion while being given a new piece of text.

At first glance, this architecture appears to be a parallel version of Global-Style-Tokens image

They use Tacotron2's text encoder, and Tacotron2's LSA Attention + LSTM Stack for Alignment during training. The style encoder is just 4 ResBlocks followed by a timewise average over each channel. Pitch is extracted/used like normal. The decoder is 7 ResBlocks using AdaIN normalization They use 4 ResBlocks for a spectrogram discriminator. They use 3 BiLSTM's with AdaIN for the duration predictor (wow, that's a weird/interesting design decision). And they predict the NRG+F0 using GT GSTs and the text.

For some reason they train this model in 2 stages, first they train the Decoder with GT F0 + GT GST, then they freeze most of the model and train the GST, Dur, NRG, F0 predictors

They train on LibriTTS 250 hour dataset with 1151 speakers.

image image

They show very good MOS values for Naturalness and Similarity. I'm definitely skeptical of their conclusions / results. They claim their Style Encoder was able to extract correct emotion from other speakers when the model was trained on a single speaker dataset, yet I don't see anything in their paper that would explain how this is possible.

image

They perform lots of Ablations and show that enforcing hard monotonicity is required for parallel architectures to align well (disappointing but expected). They also show an extremely large drop in quality when the discriminator is removed, which makes me more interested in their discriminator design. I've tried multiple spectrogram discriminators and while having one is better than none, I've found that there's a lot of room for failure and improvements in disciminator design (e.g: including text encoder information, using 2d convs). They also show that their use of Instance Norm is essential for their architecture, however Adapative Instance Norm is not specifically required.

TODO: Check out their github repo and clean this section up. The paper is very dense with information and there's too much to understand with a quick skim.


image

The researchers attempt to solve the issue of parallel models requiring external alignment by using the durations from a duration predictor for training the model. The image above says everything you need to know.


The researchers experiment with various methods of improving TTS quality, reducing model size and increasing throughput.

They note that hard-alignments may reduce naturalness since in actual speech phonemes blend together and don't have well defined boundardies. To fix this they add a word encoder and predict word-level durations instead of phoneme level, then they train an attention module to expand the word-level alignments to phoneme-level. image

I absolutely love this idea. You get the robustness of hard alignments and the naturalness of soft alignments at the same time, and it doesn't use almost any additional compute. This idea could also be extended to other prosody based features, or added as an additional step for cascading inference like a better FastSpeech2.

They also experiment with using a unconditional VAE to compress the spectrogram, then a conditional NF to infer the VAE latent. I'm not sure why this method has become common but VITS found success with it so I guess it has some merit. They also have a NF Postnet to produce the final spectrogram, which is typically done because VAEs trained with MSE produce blurry outputs. The postnet significantly improves audio quality, while the VAE+NF latent modelling significantly improves prosody.


These are the first papers to apply DDPMs to the text-to-spectrogram task.

Grad-TTS uses a UNet architecture while Diff-TTS uses Gated WaveNet blocks.

image

image

Both papers report SOTA results, with Diff-TTS getting BETTER THAN GT MOS values. Crazy stuff.


image

The researchers experiment with a word-level VAE for encoding prosody of the text. They find that phoneme-level prosody is hard to predict from text, while utterance level prosody doesn't contain enough information to improve the decoder's results. To convert the VAE frame-level latents to word-level, they select the middle mel-frame in each word instead of using an RNN or other seq2vec technique. They use a pretrained BERT model to predict the VAE's latents for inference and achieve extremely good MOS results.

image


In Guided-TTS2 the researchers experiment with training a large diffusion model on unlabelled Librispeech 60,000 hours dataset. They train the DDPM with speaker embeddings taken from a pretrained speaker encoder. They train the DDPM without conditioning during some of the samples so they can use Classifier-Free-Guidance during inference to closer match the encoded speaker embedding. No text is used to train the DDPM.

image


Inference

They use a pretrained phoneme classifier to guide the DDPM towards the text for inference.

To make outputs closer to the reference speaker, they use Classifier-Free-Guidance. They notice that CFG reduces the text accuracy but increases the speaker similarity and settle on CFG scale of 1.0 for their evaluations.

image

image

Unless otherwise specified, they also fine-tune the DDPM model on the reference files. This gives a very large improvement in speaker similarity without affecting text accuracy at lower CFGs. (Yellow vs Blue line in the above images)


image

Their results are very impressive, outperforming YourTTS, StyleSpeech and matching or exceeding GT in MOS. They achieve slightly below GT speaker similarity, but still significantly better than the competition. Their CER results might be misleading though since they use a pretrained ASR model to guide their DDPM.

https://ksw0306.github.io/guided-tts2-demo/

Listening to their demo samples, I notice each sample sounds very clear and well paced, however emotion seems to be completely missing. It makes sense given their method, but I am curious if semantic information could be added to their method while still being able to train on Librilight. (maybe Whisper(?) transcribed audio, but with a 'AI Transcript' and 'Human Transcript' label added?)


image

Also super interesting, they find that resetting the optimizer improves fine-tuning results significantly.


This paper continues the research of PnG BERT in Prosody/Pronunciation prediction using Large Language Models.

They experiment with training a BERT model with only phonemes. They use BPE to create a large vocabulary and train with both sub-phonemes and phonemes as can be seen in the image below.

image


image

The results are impressive and match PnG BERT. Original text encoder shows 3.75 MOS, randn init MP-BERT shows 3.9 MOS, and pretrained MP-BERT shows 4.04 MOS.

The pretrained MP BERT text encoder achieves BETTER than GT MOS values. A second paper to go with PnG BERT that shows the importance of pretraining the text encoder if possible.


They show that

image

removing the text side of the model/input significantly improves performance without any decrease in MOS.


They show that

image

the sub-phoneme vocabulary is required for the LLM to learn the required information.

The results are great and show that their method is recommended if you have phonemes available for inference or a dataset with a common nationality/accent.


The TTS front-end normalizes the sentences and converts them into phoneme sequences.

Since they use a generic g2p algorithm for inference, this approach may not work as well with strong accented speakers or a very diverse multispeaker dataset.


In this paper the researchers show that Mixed-Phoneme BERT and Parallel Tacotron2 results are reproducible. The researchers also propose a new architecture with "Bidirectional Prior/Posterior" and "VAE with Memory" however in my opinion they do not show how these techniques compare to existing methods properly.

Click for Details

In this paper they merge;

  • Mixed-Phoneme BERT
  • Parallel tacotron 2: A non-autoregressive neural tts model with differentiable duration modeling
  • VITS

and use a slightly customized NF prior.

image

image

They train their model on LJSpeech and get almost perfect MOS values. They do not however show GT MEL + HiFiGAN, so as far as I know, Grad-TTS/Glow-TTS is actually better than NaturalSpeech and they're just being held back by the Vocoder. Rather frustrating that they spend so much of the paper on their sampling method, but don't ever perform an apples-to-apples comparison to show if it's actually any better.


image

They show that each change has a minor positive impact on the MOS. Phoneme Pretraining and Differentiable Durator come from Mixed-Phoneme BERT and Parallel tacotron 2 respectively.


image

They evaluate FastSpeech2 + HifiGANV1 and show how train/inference mismatch at each stage results in lower in MOS scores. They use GT PT HiFiGAN so the Mel Decoder result should be ignored.

The Vocoder result is interesting. Lots of vocoders are better than HiFiGAN and HiFiGAN appears to be almost perfect... so maybe we don't have to waste lots of compute training E2E models in the future after all?

The Phoneme Encoder result is really nice to see. We've now got 9(?) examples of LLM based text encodering massively improving results, so it's clearly one of the next big things to hit open-source.


In this paper the researchers evaluate various sampling methods side-by-side

image

image

image

Read the paper if you want details, there's lots of stuff in this one and the results page doesn't really do it justice.

This paper's architecture could be used to also evaluate the newer VAE and DDPM designs.


The researchers note that Common Voice is full of noise, reverb, incorrect transcripts and thousands of challenging accents/prosodies.

Typically, a TTS model would perform poorly when using the Common Voice dataset. The reseachers experiment with using a Deep Learning MOS predictor to filter the dataset and find a subset that is suitable for training TTS models.

image


Results

image

They find that training on a Common Voice subset with pred MOS >= 4.0 gives then better quality AND speaker similarity than the TTS model trained on LibriTTS, and much better results than training on unfiltered Common Voice.

I do think the MOS scores are quite low in all results, but this may be due to their 16Khz vocoder or having a slightly different MOS evaluation system.


In this paper the researchers train a ALBERT model on phonemes with MLM task, and also P2G aux task.

image

They use a similar dataset to previous papers in this area, however for SOME REASON, they don't include MOS values from in domain samples.

image

They also ONLY FINE TUNE ONE OF THE MODELS. They fine tune PL-BERT (theirs) but leave MP-BERT completely frozen. Ridiculous paper.

image


image

They show that PL-BERT > PL-BERT-without-P2G > BERT > PL-BERT-without-MLM > Nothing. It has been shown in other papers before but it's nice to have another confirmation.

This paper leaves me with more questions than answers (in a bad way). Someone will need to evaluate

  • ALBERT against BERT
  • fine-tuning against frozen
  • P2G + P-MLM vs G+P-MLM

seperately in order to identify if the method outlined in this paper is actually an improvement or not.


TODO


TODO


TODO


TODO


image

In this paper the researchers show a new self-supervised technique to pretrain models for ASR task. They train an encoder to convert raw waveforms into Q (a codebook like VQVAE), and they train a (randomly masked input) Transformer with contrastive loss objective, to output a pred Q that is close to the GT Q and distant from randomly selected Q's from other frames.

Conclusion:

Our model achieves results which achieve a new state of the art on the full Librispeech benchmark for noisy speech. On the clean 100 hour Librispeech setup, wav2vec 2.0 outperforms the previous best result while using 100 times less labeled data. The approach is also effective when large amounts of labeled data are available.


In this paper ...

They use w2v-BERT to convert audio into semantic (text) tokens. They use SoundStream to convert audio into and out-of compressed form. Similar to EnCodec VQVAE used by VALL-E.

They use extensive pre-training with each model to significantly improve results.


OFFICIAL CODE HERE

The researchers train common TTS models on YouTube and Podcasts subset of the GigaSpeech dataset. They also propose their own architecture, MQTTS (multi-codebook vector quantized TTS), which uses Tranformer-TTS but with a codebook instead of spectrograms and a modified alignment/attention setup. The multi-codebook is created using HiFi-GAN as a VQVAE-style autoencoder. The codebook is similar to SoundStream and EnCodec where multiple codebooks learn progressively more detailed information, and the AR model learns to predict the first codebook index before a conditional parallel model predicts the remaining ones in a code-AR order instead of time-AR. They remove lower cross-attention layers from Transformer-TTS since they believe/find that only the higher level representations are processed enough to align cleanly/properly. They use ALiBi position encoding scheme to allow their TTS model to extrapolate to text/audio lengths unseen during training. A simple trick they use to improve inference quality is appending 3 seconds of silence to the start of the time-AR inference model.


image

They show that using multiple smaller codebooks performs better than 1 extremely large codebook. (MOS-Q = Audio Quality, MOS-N = Naturalness)


They use a large suite of models/techniques to evaluate their results so I will outline them here. P-FID is a FID metric using a pretrained emotion classifier which roughly correlates with Prosody Distance between GT and PRED samples. SSS is speaker similarity measured as Cosine Similarity with a pretrained speaker encoder. MOS-N is a Mean Opinion Score where the raters are specifically asked to rate the naturalness of the samples.

image

They find that

  • VITS and MQTTS perform well on the YouTube+Podcasts
  • Transformer-TTS performs poorly, even with refined attention or duration predictor
  • VITS gets better Audio Quality
  • MQTTS gets significantly better Prosody/Emotion
  • MQTTS gains a lot of audio quality by code-AR predicting the extra 3 codebooks.
  • MQTTS gains a lot from the monotonic alignment constraint (as previously shown in Regotron)

They also spend a bit talking about duration predictor resulting in abrupt transitions with AR models, however as I've tested before, all you need is a couple of high kernel size convs on the cond and that problem fixes itself.


The researchers propose a shared TTS/VC model. They split speech into 3 components.

  • Speaker(and recording conditions)
  • Content(text)
  • Prosody(how the speaker says the text)

image

For "prosody", they use f0 taken from gt audio. Since f0 is speaker dependant, they train a speaker-conditioned f0 predictor to replace the gt f0 during inference.

Their main enhancement is training the TTS text encoder and VC content encoder to use a shared codebook and close output representations. By doing this, the voice conversion model is forced to encode textual information into the content encoder and all the global information is rerouted to the shared speaker encoder.


image

They find a significant improvement in zero-shot TTS performance from this change.

image

and find that their VC fails without the TTS encoder constraint.


This the first relatively large conversational speech dataset I've seen.

image

It has 20 hours of audio with emotion labels included.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published