-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Audio length 1s #87
Comments
Yes, the AST model certainly supports 1s as our SpeechCommands recipe runs on 1s audios and achieves state-of-the-art performance. If you use our training pipeline, you just need to change ast/egs/speechcommands/run_sc.sh Line 34 in 97e57e7
n_frames , in your case, 139. For optimal performance, you might want to set ast/egs/speechcommands/run_sc.sh Line 21 in 97e57e7
True and do other tuning, please check the readme file for details.
-Yuan |
@YuanGongND If audio length is much shorter than 10s like 1s~3s, do i have to pretrain ast from scratch? I just want to use pretrained ast model to extract audio tokens. |
In my experience, audioset pretraining does not hurt the performance in almost all cases, so you can certainly have a try to set Lines 143 to 147 in 5f50e00
If you want to freeze the AST model and get the feature, it might be better to pad your input to 10s, I would suggest to use this inferernce script: https://colab.research.google.com/github/YuanGongND/ast/blob/master/colab/AST_Inference_Demo.ipynb, but instead of getting the last layer output (prediction logits), get the penultimate layer output as the feature. It should be a relatively easy modification and you don't need to worry about your input. The script loads audio and pad it to 10s. -Yuan |
How about your pretrained model? Your pretrained model also works well in short audio? |
By For end-to-end fine-tuning, it works well for shorter audios, please see our ESC-50 recipe. For freeze and feature extraction, I think padding it to longer audio is the best choice, please check the colab script. |
Is it okay to extract audio feature whose length is 1~3s? Its fbank shape is (139, 128)....^^. It means that n_frames = 139.
The text was updated successfully, but these errors were encountered: