-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTMOS上比GT还高? #1
Comments
Thank you for your interest and observation! Moreover, we've open-sourced the codec's checkpoint, making it easy for you to replicate our experiments using VALL-E (https://github.com/lifeiteng/vall-e). I've also just upload my VALL-E results for you to listen to (https://drive.google.com/file/d/1irlGr-5fpnPwIzHMkMTGbU5T3OpiPsIS/view?usp=sharing). I look forward to your thoughts and further discussion. |
It sounds really fantastic. As I understand it can be also used with StyleTTS2? Do you have an example how it could be applied? |
Thank you for your question! StyleTTS2 is trained end-to-end, so it might be challenging to apply our approach directly. For non-autoregressive (NAR) TTS models like NS2, our method might be more applicable, but I'm not sure if it will work. It would be interesting to explore whether unifying semantic and acoustic representations could further improve NAR audio generation models. |
So as I understand the biggest problem in StyleTTS2 is vocoder? But maybe it could be replaced with codes based one? |
You're right |
震惊,难道GT还不如生成的?
The text was updated successfully, but these errors were encountered: