You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(Briefly and clearly describe the issue you're facing and seeking help with.)
I want to understand whether the duration predictor is prepared for getting duration as training data or not. Additionally, I'm also curious about number '50' of rule-based calculating duration in text2semantic.
Steps Taken
1. duration predictor In this paper, we also train a flow matching [45]
based duration prediction model to predict the total duration conditioned on the text and prompt
speech duration, leveraging in-context learning. More details can be found in Appendix A.5.
Is the duration predictor used for generating duration as training input for text to semantic model? Or just given approximated duration is used for evaluating?
2. '50' meaning of target len in text2semantic of MaskGCT
Below code is snippet of text2semantic in maskgct.
I was trying to find out this value in paper, but i wasn't.
I'm curious about what constant number '50' is. I guess this number is minimum frame numbers for uttering a one phoneme. Please let me know this number 🥲
The text was updated successfully, but these errors were encountered:
Problem Overview
(Briefly and clearly describe the issue you're facing and seeking help with.)
I want to understand whether the duration predictor is prepared for getting duration as training data or not. Additionally, I'm also curious about number '50' of rule-based calculating duration in text2semantic.
Steps Taken
1. duration predictor
In this paper, we also train a flow matching [45]
based duration prediction model to predict the total duration conditioned on the text and prompt
speech duration, leveraging in-context learning. More details can be found in Appendix A.5.
Is the duration predictor used for generating duration as training input for text to semantic model? Or just given approximated duration is used for evaluating?
2. '50' meaning of target len in text2semantic of MaskGCT
Below code is snippet of text2semantic in maskgct.
I was trying to find out this value in paper, but i wasn't.
I'm curious about what constant number '50' is. I guess this number is minimum frame numbers for uttering a one phoneme. Please let me know this number 🥲
The text was updated successfully, but these errors were encountered: