Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the multi_speaker implemention. #37

Open
LeoniusChen opened this issue Mar 3, 2021 · 4 comments
Open

About the multi_speaker implemention. #37

LeoniusChen opened this issue Mar 3, 2021 · 4 comments

Comments

@LeoniusChen
Copy link

Hi, I read about your multi_speaker implemention of Tacotron2. It means different speakers correspond to different text inputs, and you did not use the speaker embedding. Am i right ? If so, the speaker information is involved in the text which is unnecessary.

@begeekmyfriend
Copy link
Owner

I just expended the symbol table and each of the symbol offset represents one speaker as implicit embedding.

@LeoniusChen
Copy link
Author

I just expended the symbol table and each of the symbol offset represents one speaker as implicit embedding.

Thanks for your reply! I understand what you have done. I think this implemention may introduce unnecessary trouble if I want to preserve the prosody of the reference utterence (from speaker A) and have the timbre of speaker B. Do you know some other implemention of multi_speaker tacotron ?

@begeekmyfriend
Copy link
Owner

Well, this project does not implement the prosody memory of speakers. In other words, the prosody of every speaker are independent with each other. If you want to refer the prosody of other speakers, extra explicit prosody embedding is needed. Unfortunately current implementation on deep learning is not perfect enough for this issue as I know. Global style token (aka. GST) on Tacotron is one kind of it. A good project on PyTorch is provided and it is based on Tacotron 1 though. I do not know if this project suits you.

@LeoniusChen
Copy link
Author

Thanks for your kind help ! I've read this GST project before, it only uses the dataset of a single speaker. My issue is that a multi_speaker tacotron is needed where the speaker embedding is explicitly given 🤣 Anyway, I'll try to implement it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants