[TTS]VITS #1699

yt605155624 · 2022-04-14T06:13:54Z

论文：

https://arxiv.org/abs/2106.06103

paper reading:

复现：

官方：https://github.com/jaywalnut310/vits
espnet：https://github.com/espnet/espnet/blob/master/egs2/csmsc/tts1/conf/tuning/train_vits.yaml
https://github.com/espnet/espnet/blob/master/egs2/csmsc/tts1/conf/tuning/train_full_band_vits.yaml
coqui tts：https://github.com/coqui-ai/TTS/blob/main/TTS/tts/models/vits.py
coqui tts：https://github.com/Edresson/YourTTS/ （基于 VITS 的 zero shot 的多说话人和多语言模型）

参考文献：

2-Stage 语音合成面临的挑战

声学模型 + 声码器有可能需要用 GTA mel finetune 声码器，如果声码器是 HiFiGAN、MB MelGAN 这种输入不带 noise 的 GAN Vocoder，不 finetune 可能合成的音频会有明显的金属感，这导致训练过程很复杂
a. 训练声学模型（不考虑 MFA 训练的话）
b. 训练声码器
c. 用训练好的声学模型生成的 GTA mel finetune 声码器
使用 mel 频谱等特征作为中间特征，限制了语音合成效果的进一步提升，直接一步到位可能更好
推理时间长、部署复杂

FastSpeech2s

text2wav 面临的挑战
a. wav 含有相位信息，text2mel 比 text2wav 面临的 gap 更大
b. 由于 GPU 内存的限制，输入只能是 clip of audio，part of text，这损害的输入文本之间的相关性，使得文本 embedding 更难学习
解决办法：
a. wav decoder，对抗性训练，结构是类似于 WaveNet 的结构，判别器用的是 PWGAN 的判别器
b. mel-spectrogram decoder来辅助文本特征表示的学习
效果，原论文的 MOS 值是 fastspeech2s < fastspeech2 + pwgan

VITS

Posterior encoder
非因果 WaveNet 残差模块
Prior encoder (self.text_encoder 和 self.flow)
包括文本编码器和提升先验分布多样性的标准化流, 标准化流模块包含若干 WaveNet 的残差块
Decoder
与 HiFi-GAN V1 的生成器结构相同
Discriminator
与 HiFI-GAN 中的多周期判别器结构相同
Stochastic duration predictor
与 Glow-TTS 相似的单调对齐搜索 (Monotonic Alignment Search, MAS)

yt605155624 · 2022-06-06T06:57:42Z

training iters:

about csmsc vits train espnet/espnet#3737

add blank:

Why putting a blank token between any two input tokens can improve pronunciation? jaywalnut310/glow-tts#43
Mispronunciation and what is the purpose of the add_blank config ? jaywalnut310/vits#20
Result getting worse when i use ground truth duration. jaywalnut310/vits#9
vits - where to add blank intersperse between phonemes like in the official implementation espnet/espnet#4235
来自 @dtx525942103 的结论: 在 phone 后面 add blank 的效果没有在字后面 add blank 的效果好，参考 https://github.com/lutianxiong/vits_chinese

^ 表示 0 声母的占位符
see [TTS]add blank between characters for vits #2040

yt605155624 · 2022-06-28T09:28:10Z

cpython version monotonic_align is better than 'EXPERIMETAL' numba version

cython version is recommend: espnet/espnet#4475

see

[TTS]install CPython version monotonic_align before training #2087

The Purple Curve new_align means using cpython monotonic_align when training

LifeIsStrange · 2022-07-08T13:57:33Z

So what is the current status of VITS support?
It is the #1 state of the art model on LJspeech (well actually second but the first one has no open source implementation)
I'm not seeing it documented on https://paddlespeech.readthedocs.io/en/latest/tts/models_introduction.html

yt605155624 · 2022-07-15T07:17:13Z

So what is the current status of VITS support? It is the #1 state of the art model on LJspeech (well actually second but the first one has no open source implementation) I'm not seeing it documented on https://paddlespeech.readthedocs.io/en/latest/tts/models_introduction.html

We are trying to train VITS for CSMSC (a Mandarin Dataset), and there is a release model now, see csmsc/vits. We mainly focus on the Mandarin Dataset, and the training of this model is time-consuming, So we have no plan to train on the English dataset for the time being, I haven't add VITS in models_introduction, thx for your recommend.

Can you please tell me which is the best model on LJspeech? I don't think VITS is good enough, It's hard to converge, I need to cost 4 GPUs training for two weeks, but may not achieve good results, but for FastSpeech2, I only cost 2 GPUs training for twos days, and I will have enough time to try different strategies.

Why don't you try FastSpeech2 + HiFiGAN of PaddleSpeech for LJSpeech?

jucaowei · 2023-01-05T07:32:37Z

你好，非常感谢您关于vits的工作，目前我在基于您的工作上训练中英混合的vits，收敛的曲线和公布出来的很接近，但是目前在8w迭代的时候效果很差，我有个疑惑：mel损失曲线和kl散度其实在很早的时候（5w左右iters）就收敛了（看起来）那么20w iters的模型和30w iters的checkpoint 效果是不是差别很大？

yt605155624 · 2023-01-11T06:11:33Z

@jucaowei 你好，可以基于最新的代码重新训练一下，之前的效果不太好，近期做了重大修复

修复后的收敛曲线比上面公布的要好很多（尤其是 generator_dur_loss），具体等我们训练完成后公布最新的曲线，但是中训练的训练效果看已经比修复之前好很多了

Chopin68 · 2023-01-11T11:00:11Z

@jucaowei 你好，请问您训练中英混合的vits是采用注音还是国际音标或者其他方法呢，可以参考一下吗，十分感谢

yt605155624 · 2023-01-16T08:35:32Z

jucaowei · 2023-01-28T09:44:05Z

@jucaowei 你好，请问您训练中英混合的vits是采用注音还是国际音标或者其他方法呢，可以参考一下吗，十分感谢

不好意思由于paddlespeech框架训练不出来，我后续就用了原版vits的代码，没关注这边的讨论，我是采用的和paddlespeech一样的音素策略，英文基于g2p_en库，中文基于pypinyin的出来的音素。实际单语种我之前测试过音素和拼音字母都可以训练出来，语种混合之后就不清楚了，混合后中英文都是字母符号，可能导致明显的中国人说英语的腔调（个人猜测）

jucaowei · 2023-01-28T09:55:20Z

感谢您的回复，我目前放弃了关于paddlespeech的尝试，转而在vits原版代码上修改，很多设定也是参考了paddlespeech的设定，同fastspeech2一样，我是基于四个数据集进行训练，模型在四卡3090上训练到9w迭代（差不多一晚上）基本就收敛了，后续100w的迭代几乎没啥变化。最终也是实现了基本的中英文说话人音色解耦和内容解耦，但是还是达不到特别好的效果（中文说话人说英语还是有点生硬，甚至说中文都有点英语腔）感觉一定得有单说话人的中英文语料才行

Chopin68 · 2023-01-28T10:28:42Z

@jucaowei 你好，请问您训练中英混合的vits是采用注音还是国际音标或者其他方法呢，可以参考一下吗，十分感谢

不好意思由于paddlespeech框架训练不出来，我后续就用了原版vits的代码，没关注这边的讨论，我是采用的和paddlespeech一样的音素策略，英文基于g2p_en库，中文基于pypinyin的出来的音素。实际单语种我之前测试过音素和拼音字母都可以训练出来，语种混合之后就不清楚了，混合后中英文都是字母符号，可能导致明显的中国人说英语的腔调（个人猜测）

感谢您的悉心答复！多语言有尝试https://github.com/CjangCjengh/vits 的clean吗，可以试试比较一下相关效果~

jucaowei · 2023-01-29T07:43:14Z

悉心

@jucaowei 你好，请问您训练中英混合的vits是采用注音还是国际音标或者其他方法呢，可以参考一下吗，十分感谢

不好意思由于paddlespeech框架训练不出来，我后续就用了原版vits的代码，没关注这边的讨论，我是采用的和paddlespeech一样的音素策略，英文基于g2p_en库，中文基于pypinyin的出来的音素。实际单语种我之前测试过音素和拼音字母都可以训练出来，语种混合之后就不清楚了，混合后中英文都是字母符号，可能导致明显的中国人说英语的腔调（个人猜测）

感谢您的悉心答复！多语言有尝试https://github.com/CjangCjengh/vits 的clean吗，可以试试比较一下相关效果~

您好，您是指的这个库里的不同语种的cleaner吗，我用的那四个数据集好像都是clean数据，不需要norm以及clean，而推理的时候同样是用的百度的mix_front进行文本norm

yt605155624 added the feature request label Apr 14, 2022

yt605155624 self-assigned this Apr 14, 2022

yt605155624 added this to the r1.0.0 milestone Apr 14, 2022

yt605155624 added this to PaddleSpeech Apr 14, 2022

yt605155624 mentioned this issue Apr 22, 2022

[tts] will vits support #1377

Closed

zh794390558 moved this to Todo in PaddleSpeech Apr 23, 2022

zh794390558 moved this from Todo to In Progress in PaddleSpeech Apr 23, 2022

zh794390558 moved this from In Progress to Todo in PaddleSpeech Apr 23, 2022

zh794390558 modified the milestones: r1.0.0, r1.1.0 May 5, 2022

zh794390558 moved this from Todo to In Progress in PaddleSpeech May 5, 2022

yt605155624 mentioned this issue May 25, 2022

[TTS]add vits network scripts, test=tts #1855

Merged

yt605155624 changed the title ~~VITS~~ 【TTS】VITS Jul 15, 2022

yt605155624 mentioned this issue Jul 15, 2022

[TTS]update vits ckpt #2159

Merged

lym0302 closed this as completed in #2159 Jul 18, 2022

Repository owner moved this from In Progress to Done in PaddleSpeech Jul 18, 2022

yt605155624 changed the title ~~【TTS】VITS~~ [TTS]VITS Dec 28, 2022

yt605155624 mentioned this issue Dec 28, 2022

[TTS] JETS -> E2E FastSpeech2 + HiFiGAN #2773

Closed

Ray961123 mentioned this issue Mar 25, 2024

Are there any plans to launch vits in Chinese TTS? #3720

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TTS]VITS #1699

[TTS]VITS #1699

yt605155624 commented Apr 14, 2022 •

edited

Loading

yt605155624 commented Jun 6, 2022 •

edited

Loading

yt605155624 commented Jun 28, 2022 •

edited

Loading

LifeIsStrange commented Jul 8, 2022 •

edited

Loading

yt605155624 commented Jul 15, 2022 •

edited

Loading

jucaowei commented Jan 5, 2023

yt605155624 commented Jan 11, 2023 •

edited

Loading

Chopin68 commented Jan 11, 2023

yt605155624 commented Jan 16, 2023

jucaowei commented Jan 28, 2023

jucaowei commented Jan 28, 2023

Chopin68 commented Jan 28, 2023 •

edited

Loading

jucaowei commented Jan 29, 2023

[TTS]VITS #1699

[TTS]VITS #1699

Comments

yt605155624 commented Apr 14, 2022 • edited Loading

yt605155624 commented Jun 6, 2022 • edited Loading

yt605155624 commented Jun 28, 2022 • edited Loading

LifeIsStrange commented Jul 8, 2022 • edited Loading

yt605155624 commented Jul 15, 2022 • edited Loading

jucaowei commented Jan 5, 2023

yt605155624 commented Jan 11, 2023 • edited Loading

Chopin68 commented Jan 11, 2023

yt605155624 commented Jan 16, 2023

jucaowei commented Jan 28, 2023

jucaowei commented Jan 28, 2023

Chopin68 commented Jan 28, 2023 • edited Loading

jucaowei commented Jan 29, 2023

yt605155624 commented Apr 14, 2022 •

edited

Loading

yt605155624 commented Jun 6, 2022 •

edited

Loading

yt605155624 commented Jun 28, 2022 •

edited

Loading

LifeIsStrange commented Jul 8, 2022 •

edited

Loading

yt605155624 commented Jul 15, 2022 •

edited

Loading

yt605155624 commented Jan 11, 2023 •

edited

Loading

Chopin68 commented Jan 28, 2023 •

edited

Loading