Audio length 1s #87

9B8DY6 · 2022-11-30T12:24:44Z

Is it okay to extract audio feature whose length is 1~3s? Its fbank shape is (139, 128)....^^. It means that n_frames = 139.

YuanGongND · 2022-11-30T19:18:06Z

Yes, the AST model certainly supports 1s as our SpeechCommands recipe runs on 1s audios and achieves state-of-the-art performance.

If you use our training pipeline, you just need to change

ast/egs/speechcommands/run_sc.sh

Line 34 in 97e57e7

audio_length=128

to your desired n_frames, in your case, 139. For optimal performance, you might want to set

ast/egs/speechcommands/run_sc.sh

Line 21 in 97e57e7

audiosetpretrain=False

as True and do other tuning, please check the readme file for details.

-Yuan

9B8DY6 · 2022-12-01T03:32:42Z

@YuanGongND If audio length is much shorter than 10s like 1s~3s, do i have to pretrain ast from scratch? I just want to use pretrained ast model to extract audio tokens.

YuanGongND · 2022-12-01T03:57:02Z

In my experience, audioset pretraining does not hurt the performance in almost all cases, so you can certainly have a try to set audiosetpretrain=True and imagenetpretrain=True like we did for the ESC-50 recipe. You can use AudioSet pretraining no matter your target audio length is shorter or longer than 10s, we adapt the positional embedding to fit the length internally in the model

ast/src/models/ast_models.py

Lines 143 to 147 in 5f50e00

    
           if t_dim < 101: 
        
               new_pos_embed = new_pos_embed[:, :, :, 50 - int(t_dim/2): 50 - int(t_dim/2) + t_dim] 
        
           # otherwise interpolate 
        
           else: 
        
               new_pos_embed = torch.nn.functional.interpolate(new_pos_embed, size=(12, t_dim), mode='bilinear')

. A finetuning stage is crucial for AST to achieve the optimal solution.

If you want to freeze the AST model and get the feature, it might be better to pad your input to 10s, I would suggest to use this inferernce script: https://colab.research.google.com/github/YuanGongND/ast/blob/master/colab/AST_Inference_Demo.ipynb, but instead of getting the last layer output (prediction logits), get the penultimate layer output as the feature. It should be a relatively easy modification and you don't need to worry about your input. The script loads audio and pad it to 10s.

-Yuan

9B8DY6 · 2022-12-01T04:57:48Z

How about your pretrained model? Your pretrained model also works well in short audio?

YuanGongND · 2022-12-01T05:11:35Z

By audiosetpretrain=True, I meant our pretrained model.

For end-to-end fine-tuning, it works well for shorter audios, please see our ESC-50 recipe.

For freeze and feature extraction, I think padding it to longer audio is the best choice, please check the colab script.

YuanGongND added the question Further information is requested label Nov 30, 2022

9B8DY6 closed this as completed Dec 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Audio length 1s #87

Audio length 1s #87

9B8DY6 commented Nov 30, 2022 •

edited

Loading

YuanGongND commented Nov 30, 2022 •

edited

Loading

9B8DY6 commented Dec 1, 2022 •

edited

Loading

YuanGongND commented Dec 1, 2022 •

edited

Loading

9B8DY6 commented Dec 1, 2022

YuanGongND commented Dec 1, 2022

Audio length 1s #87

Audio length 1s #87

Comments

9B8DY6 commented Nov 30, 2022 • edited Loading

YuanGongND commented Nov 30, 2022 • edited Loading

9B8DY6 commented Dec 1, 2022 • edited Loading

YuanGongND commented Dec 1, 2022 • edited Loading

9B8DY6 commented Dec 1, 2022

YuanGongND commented Dec 1, 2022

9B8DY6 commented Nov 30, 2022 •

edited

Loading

YuanGongND commented Nov 30, 2022 •

edited

Loading

9B8DY6 commented Dec 1, 2022 •

edited

Loading

YuanGongND commented Dec 1, 2022 •

edited

Loading