Why does Mimi have such a strong ability to reconstruct long audio? #172

RobinWitch · 2024-12-13T08:06:00Z

Due diligence

I have done my due diligence in trying to find the answer myself.

Topic

The paper

Question

I have a strange issue: Why does Mimi have such a strong ability to reconstruct long audio? I tried using Mimi to reconstruct a peech audio clip longer than a minute, and the result was surprisingly good. The attention length used in the network during training is clearly limited to just 10 seconds according to paper. And I haven't seen any code that performs like segmentation stratagy on long audio.

It seems to me that relying solely on the RoPE positional encoding isn't enough to allow a model trained on 10-second audio to easily generalize to reconstructing audio longer than a minute.

How do you achieve such a good performence in long speech audio reconstruction?

yukiarimo · 2025-01-14T04:21:30Z

+1

I tried to remove 1/3 of the audio tokens randomly and it still detokenizes audio well!

RobinWitch added the question Further information is requested label Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does Mimi have such a strong ability to reconstruct long audio? #172

Why does Mimi have such a strong ability to reconstruct long audio? #172

RobinWitch commented Dec 13, 2024

yukiarimo commented Jan 14, 2025

Why does Mimi have such a strong ability to reconstruct long audio? #172

Why does Mimi have such a strong ability to reconstruct long audio? #172

Comments

RobinWitch commented Dec 13, 2024

Due diligence

Topic

Question

yukiarimo commented Jan 14, 2025