Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audio length 1s #87

Closed
9B8DY6 opened this issue Nov 30, 2022 · 5 comments
Closed

Audio length 1s #87

9B8DY6 opened this issue Nov 30, 2022 · 5 comments
Labels
question Further information is requested

Comments

@9B8DY6
Copy link

9B8DY6 commented Nov 30, 2022

Is it okay to extract audio feature whose length is 1~3s? Its fbank shape is (139, 128)....^^. It means that n_frames = 139.

@YuanGongND YuanGongND added the question Further information is requested label Nov 30, 2022
@YuanGongND
Copy link
Owner

YuanGongND commented Nov 30, 2022

Yes, the AST model certainly supports 1s as our SpeechCommands recipe runs on 1s audios and achieves state-of-the-art performance.

If you use our training pipeline, you just need to change

audio_length=128
to your desired n_frames, in your case, 139. For optimal performance, you might want to set
audiosetpretrain=False
as True and do other tuning, please check the readme file for details.

-Yuan

@9B8DY6
Copy link
Author

9B8DY6 commented Dec 1, 2022

@YuanGongND If audio length is much shorter than 10s like 1s~3s, do i have to pretrain ast from scratch? I just want to use pretrained ast model to extract audio tokens.

@YuanGongND
Copy link
Owner

YuanGongND commented Dec 1, 2022

In my experience, audioset pretraining does not hurt the performance in almost all cases, so you can certainly have a try to set audiosetpretrain=True and imagenetpretrain=True like we did for the ESC-50 recipe. You can use AudioSet pretraining no matter your target audio length is shorter or longer than 10s, we adapt the positional embedding to fit the length internally in the model

if t_dim < 101:
new_pos_embed = new_pos_embed[:, :, :, 50 - int(t_dim/2): 50 - int(t_dim/2) + t_dim]
# otherwise interpolate
else:
new_pos_embed = torch.nn.functional.interpolate(new_pos_embed, size=(12, t_dim), mode='bilinear')
. A finetuning stage is crucial for AST to achieve the optimal solution.

If you want to freeze the AST model and get the feature, it might be better to pad your input to 10s, I would suggest to use this inferernce script: https://colab.research.google.com/github/YuanGongND/ast/blob/master/colab/AST_Inference_Demo.ipynb, but instead of getting the last layer output (prediction logits), get the penultimate layer output as the feature. It should be a relatively easy modification and you don't need to worry about your input. The script loads audio and pad it to 10s.

-Yuan

@9B8DY6
Copy link
Author

9B8DY6 commented Dec 1, 2022

How about your pretrained model? Your pretrained model also works well in short audio?

@YuanGongND
Copy link
Owner

By audiosetpretrain=True, I meant our pretrained model.

For end-to-end fine-tuning, it works well for shorter audios, please see our ESC-50 recipe.

For freeze and feature extraction, I think padding it to longer audio is the best choice, please check the colab script.

@9B8DY6 9B8DY6 closed this as completed Dec 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants