Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Help]: more duration predictor details & '50' meaning of target len in text2semantic of MaskGCT #346

Open
dobby-seo opened this issue Nov 13, 2024 · 0 comments

Comments

@dobby-seo
Copy link

dobby-seo commented Nov 13, 2024

Problem Overview

(Briefly and clearly describe the issue you're facing and seeking help with.)
I want to understand whether the duration predictor is prepared for getting duration as training data or not. Additionally, I'm also curious about number '50' of rule-based calculating duration in text2semantic.

Steps Taken

1. duration predictor
In this paper, we also train a flow matching [45]
based duration prediction model to predict the total duration conditioned on the text and prompt
speech duration, leveraging in-context learning. More details can be found in Appendix A.5.

Is the duration predictor used for generating duration as training input for text to semantic model? Or just given approximated duration is used for evaluating?

2. '50' meaning of target len in text2semantic of MaskGCT
Below code is snippet of text2semantic in maskgct.

@torch.no_grad()
def text2semantic(
    self,
    prompt_speech,
    prompt_text,
    prompt_language,
    target_text,
    target_language,
    target_len=None,
    n_timesteps=50,
    cfg=2.5,
    rescale_cfg=0.75,
):
    prompt_phone_id = g2p_(prompt_text, prompt_language)[1]
    target_phone_id = g2p_(target_text, target_language)[1]

    if target_len is None:
        target_len = int(
            (len(prompt_speech) * len(target_phone_id) / len(prompt_phone_id))
            / 16000
            * 50
        )
    else:
        target_len = int(target_len * 50)

I was trying to find out this value in paper, but i wasn't.
I'm curious about what constant number '50' is. I guess this number is minimum frame numbers for uttering a one phoneme. Please let me know this number 🥲

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant