-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TTS]VITS #1699
Comments
training iters: add blank:
|
cpython version monotonic_align is better than 'EXPERIMETAL' numba version cython version is recommend: espnet/espnet#4475 see The Purple Curve |
So what is the current status of VITS support? |
We are trying to train VITS for CSMSC (a Mandarin Dataset), and there is a release model now, see csmsc/vits. We mainly focus on the Mandarin Dataset, and the training of this model is time-consuming, So we have no plan to train on the English dataset for the time being, I haven't add VITS in models_introduction, thx for your recommend. Can you please tell me which is the best model on LJspeech? I don't think VITS is good enough, It's hard to converge, I need to cost 4 GPUs training for two weeks, but may not achieve good results, but for FastSpeech2, I only cost 2 GPUs training for twos days, and I will have enough time to try different strategies. Why don't you try FastSpeech2 + HiFiGAN of PaddleSpeech for LJSpeech? |
你好,非常感谢您关于vits的工作,目前我在基于您的工作上训练中英混合的vits,收敛的曲线和公布出来的很接近,但是目前在8w迭代的时候效果很差,我有个疑惑 :mel损失曲线和kl散度其实在很早的时候(5w左右iters)就收敛了(看起来) 那么20w iters的模型和30w iters的checkpoint 效果是不是差别很大? |
@jucaowei 你好,可以基于最新的代码重新训练一下,之前的效果不太好,近期做了重大修复
修复后的收敛曲线比上面公布的要好很多(尤其是 generator_dur_loss),具体等我们训练完成后公布最新的曲线,但是中训练的训练效果看已经比修复之前好很多了 |
@jucaowei 你好,请问您训练中英混合的vits是采用注音还是国际音标或者其他方法呢,可以参考一下吗,十分感谢 |
不好意思 由于paddlespeech框架训练不出来,我后续就用了原版vits的代码,没关注这边的讨论,我是采用的和paddlespeech一样的音素策略,英文基于g2p_en库,中文基于pypinyin的出来的音素。实际单语种我之前测试过 音素和拼音字母都可以训练出来,语种混合之后就不清楚了,混合后中英文都是字母符号,可能导致明显的中国人说英语的腔调(个人猜测) |
感谢您的悉心答复!多语言有尝试https://github.com/CjangCjengh/vits 的clean吗,可以试试比较一下相关效果~ |
您好,您是指的这个库里的不同语种的cleaner吗,我用的那四个数据集好像都是clean数据,不需要norm以及clean,而推理的时候同样是用的百度的mix_front进行文本norm |
论文:
paper reading:
复现:
https://github.com/espnet/espnet/blob/master/egs2/csmsc/tts1/conf/tuning/train_full_band_vits.yaml
参考文献:
2-Stage 语音合成面临的挑战
a. 训练声学模型(不考虑 MFA 训练的话)
b. 训练声码器
c. 用训练好的声学模型生成的 GTA mel finetune 声码器
FastSpeech2s
a. wav 含有相位信息,text2mel 比 text2wav 面临的 gap 更大
b. 由于 GPU 内存的限制,输入只能是 clip of audio,part of text,这损害的输入文本之间的相关性,使得文本 embedding 更难学习
a. wav decoder,对抗性训练 ,结构是类似于 WaveNet 的结构,判别器用的是 PWGAN 的判别器
b. mel-spectrogram decoder来辅助文本特征表示的学习
VITS
非因果 WaveNet 残差模块
包括文本编码器和提升先验分布多样性的标准化流, 标准化流模块包含若干 WaveNet 的残差块
与 HiFi-GAN V1 的生成器结构相同
与 HiFI-GAN 中的多周期判别器结构相同
与 Glow-TTS 相似的单调对齐搜索 (Monotonic Alignment Search, MAS)
The text was updated successfully, but these errors were encountered: