Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does Mimi have such a strong ability to reconstruct long audio? #172

Open
1 task done
RobinWitch opened this issue Dec 13, 2024 · 1 comment
Open
1 task done
Labels
question Further information is requested

Comments

@RobinWitch
Copy link

Due diligence

  • I have done my due diligence in trying to find the answer myself.

Topic

The paper

Question

I have a strange issue: Why does Mimi have such a strong ability to reconstruct long audio? I tried using Mimi to reconstruct a peech audio clip longer than a minute, and the result was surprisingly good. The attention length used in the network during training is clearly limited to just 10 seconds according to paper. And I haven't seen any code that performs like segmentation stratagy on long audio.

It seems to me that relying solely on the RoPE positional encoding isn't enough to allow a model trained on 10-second audio to easily generalize to reconstructing audio longer than a minute.

How do you achieve such a good performence in long speech audio reconstruction?

@RobinWitch RobinWitch added the question Further information is requested label Dec 13, 2024
@yukiarimo
Copy link

+1

I tried to remove 1/3 of the audio tokens randomly and it still detokenizes audio well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants