You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have done my due diligence in trying to find the answer myself.
Topic
The paper
Question
I have a strange issue: Why does Mimi have such a strong ability to reconstruct long audio? I tried using Mimi to reconstruct a peech audio clip longer than a minute, and the result was surprisingly good. The attention length used in the network during training is clearly limited to just 10 seconds according to paper. And I haven't seen any code that performs like segmentation stratagy on long audio.
It seems to me that relying solely on the RoPE positional encoding isn't enough to allow a model trained on 10-second audio to easily generalize to reconstructing audio longer than a minute.
How do you achieve such a good performence in long speech audio reconstruction?
The text was updated successfully, but these errors were encountered:
Due diligence
Topic
The paper
Question
I have a strange issue: Why does Mimi have such a strong ability to reconstruct long audio? I tried using Mimi to reconstruct a peech audio clip longer than a minute, and the result was surprisingly good. The attention length used in the network during training is clearly limited to just 10 seconds according to paper. And I haven't seen any code that performs like segmentation stratagy on long audio.
It seems to me that relying solely on the RoPE positional encoding isn't enough to allow a model trained on 10-second audio to easily generalize to reconstructing audio longer than a minute.
How do you achieve such a good performence in long speech audio reconstruction?
The text was updated successfully, but these errors were encountered: