-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tag audio at a higher resolution #3
Comments
Hi, you are right, the audio tagging performance deteriorates a lot if you try to label very short audio snippets. I would say this is to some extent natural as fine-grained labeling of short audios is difficult. However, MobileNet also downsamples the input strongly (x32 for our models) since this saves a lot of computation. For instance, if you try to label a one-second audio snippet the output of the conv. part before adaptive pooling will be of shape bs x channels x 4 x 4. With this small image sizes, padding seems to be responsible for the decreasing performance. If you just repeat the 1 second snippet 10 times you are getting reasonable predictions again. I will run experiments with less down-sampling and with fully-convolutional classification heads. This should make it easier to get predictions for shorter audio snippets. If it works out I'll add the new pre-trained models to the repository. |
Ok thank you for your answer, this makes sense.
Great, thank you! |
Hi @fschmid56 have you had time to give it a try by any chance? If not I can give it a try but I won't be as fast you. Edit:
To be clear, in my case, it's not so much about tagging short ~1sec file but rather about running sound event detection on a long file at a high precision (say 1s, instead of 10 sec). So I think that just retraining with fully-convolutional classification heads (without any down-sampling) would already be very helpful! |
Yes, I already gave it a shot. Using the fully convolutional head and less down-sampling didn't work so well out of the box. I probably need to tweek learning rate and other hyperparameters a bit. So far, I needed the limited available compute otherwise. I started new experiments today switching only to the fully convolutional mode first and then reducing the down-sampling slowly in the following experiments.
Let me understand that in more detail. If the network is given a 10 sec. audio file, the feature maps before adaptive average pooling will be of size t=32 and f=4. Using a fully convolutional head, you will therefore get an output of size (c=527, f=4, t=32). If you feed a longer audio sequence you will scale up 't' accordingly. Is it just about the convolutional head, or do you need 'f' and 't' in higher resolution? This would mean I have to reduce the strides (down-sampling) in the network. Currently the input spectrogram is down-sampled by a factor of 32 (5 layers with a stride of 2). Do you need that down-sampling factor also reduced? |
Hi Florian, thank you for giving it a shot!
It is just about the convolutional head in my use case. I understand that padding might create issues at the beginning/end of a file (especially a short file) but it shouldn't be a big deal in my case since I deal with long recordings. I've read your paper in detail, it's great work! I should have more time next week and I hope I can dig deeper in your code. PS: do you still have the weights stored somewhere of the fully-conv head models that you trained? I would be able to test them immediately. Otherwise don't worry I will train some models next week. |
Hi Alexandre, thanks for the nice feedback! I had some time and free resources today, so I started the experiments and I guess they will work out well this time. You can follow them if you like: I'll upload the weights as soon as they are finished. |
I've added two models to the github releases "mn10_as_fc_mAP_465.pt " and "mn10_as_fc_s2221_mAP_466.pt". You should be able to run inference on them like this:
I will add more models next week. Also, I will attach the correct config to load pre-trained models to the model name. This is currently a bit of a pitfall, e.g. if you forget to specify the strides argument when you try to load a model trained with modified strides. |
FYI: I've taken the fully-conv models and removed the Adaptative Avg Pooling along the time dimension. Unfortunately they don't give good results at segmenting precisely the file. tags.mp4I guess it's probably due to the large receptive fields of MobileNetV3. Do you happen to know its value by any chance? |
Yes, I guess this is because of the huge receptive field. For the standard 'mn10' model the receptive field spans ~26k pixels. Even if the effective receptive field is much smaller, it still spans multiple seconds of audio. I strongly assume that this is why the model detects speech and siren almost everywhere. Have you tried the model with reduced strides? Do the detected events have a shorter span over time? What would be a desired receptive field size for you? I could imagine training a model with reduced depth/kernel sizes. |
Yes both give similar results.
Ok it makes sense then. It would be nice to have a function to compute this effective receptive field given the architecture.
Thank you for proposing, something around 0.5s or 1s would be great! Let me know if I can help. In the meantime, out of curiosity, I'm going to try the less ideal approach you suggested earlier: taking 1sec chunks and repeating them 10 times (not sure what's the right amount, ideally the receptive field) before feeding them to some models with the "mlp" head. |
The easiest solution I could think of is to set some kernels to 1 in the function _mobilenet_v3_conf in MobileNetV3.py and retrain on AudioSet. The next simplest thing is to remove entire blocks from the config. |
Makes sense, thank you for the suggestions! I'm also busy with other projects at the moment but will give it a try when I find time - I'll keep you posted if I get any good results. |
Hey there @adbrebs @fschmid56 , I'd like to comment on the following issues of imprecise time-stamps.
That is to be expected for these types of models, since they are not trained to provide precise time-stamps, due to their design ( not necessarily due to the receptive field ). However, for audio classification, this 2d pooling operation is less reasonable since the model is trained to correlate say low-frequency information from frame 0 and high-frequency information from say the frame at 10s.
The overall 2d pooling works as long as your training and testing durations are somewhat similar and might be better to achieve a "higher mAP". Onto this end, I'd like to advocate our previous work here pseudo strong labels, since we encountered the exact some problem. The code for this picture is: wget https://raw.githubusercontent.com/RicherMans/PSL/main/src/models.py
wget http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/class_labels_indices.csv import seaborn as sns
import matplotlib.pyplot as plt
import models
import numpy as np
import pandas as pd
import librosa
import torch
maps = pd.read_csv('class_labels_indices.csv',sep=',').set_index('index')['display_name'].to_dict()
data, sr = librosa.load('./217709574-c427fb6b-110b-4947-9a22-961292f632c7.mp4', sr = 16000)
mdl = models.MobileNetV2_DM()
mdl_state = torch.hub.load_state_dict_from_url('https://zenodo.org/record/6003838/files/mobilenetv2_mAP40_53.pt?download=1')
mdl.load_state_dict(mdl_state)
mdl.eval()
with torch.no_grad():
y, y_time = mdl(torch.as_tensor(data).unsqueeze(0))
y_time = y_time.squeeze(0)
idxs = y_time.topk(3).indices.numpy()
scores = y_time.topk(3).values.numpy()
time_arr = np.arange(0, data.shape[-1]/sr, 0.32)
res = []
pred_names = []
for i in range(len(idxs)):
names = [maps[f] for f in idxs[i]]
for j in range(len(names)):
res.append({'score':scores[i][j], 'name':names[j], 'time':time_arr[i]})
r = pd.DataFrame(res)
r['name'] = r['name'].astype('category')
plt.figure(figsize=(14,8))
sns.lineplot(data=r,x='time',y='score',hue='name')
plt.show() Hope that I can help! |
Hey @RicherMans, thanks for the additional input on this matter!
I do understand that it is problematic to mix low- and high-level frequency information in general. Even applying the same conv. kernels to high- and low-freq. regions is not well justified in my opinion as objects in images are position-invariant, while this might not hold for patterns along the frequency dimension. What is not so obvious to me right now, is why it is especially problematic if you average that over time and how this problem causes smeared-out probabilities. If I train models with global channel pooling and have a limited receptive field, I should still get valid time information if I don't do the pooling over time at inference time, no? I will definitely look deeper into your paper and the code as soon as I have time!
Have you tried using the KD approach from this repo together with decision-level pooling? |
To be honest I also thought like that before, but just during my research for PSL, I initially trained models with global average pooling as I did and obtained very very wrong results for sub-10s resolutions ( like high probabilities for say cat meowing, even though there is water in an audio clip).
As far as I understand it, since you pool your features in time-frequency, the resulting embedding is specifically in time-frequency space, not in an independent time/frequency space, which means you can't simply during inference now expect that this space can be disentangled to a time / frequency space ( like a spectrogram) In my point of view your embeddings are likely to be somewhat superior to time/frequency independent ones, since you contain more information within them.
Surely tried but not with your provided code nor pretrained ensemble weights. Thanks again! |
Okay, thanks for the input, I'll definitely have a closer look at this in the near future. In general, I would like to experiment more with audio-specific architectural components as I still find it a bit frustrating that vision architectures work so well out of the box without significant adaptation to audio. Decision-level pooling is now definitely on my list.
This is already the third request regarding this. I'll put it on top of my list for after 20th Feb. (current Eusipco deadline). |
Agreed, its a bit of a problem that many architectures from vision do not directly work.
Thanks and good luck for that conference! We maybe will see each other in June in ICASSP :) |
I've uploaded the file fname_to_index, which contains a dict converting the file IDs to the indices in the predictions file. I tried to make it compatible with the IDs provided in the official csv files.
For sure! :-) |
Hi @RicherMans, thank you for taking the time to write and sharing some insights! In my use case, I would need a resolution of around 0.5-1s, so a receptive field of around ~1sec max. I think my best bet is to slightly change @fschmid56 mobilenet architecture to reduce the receptive field (and remove the global avg pooling) and retrain it with @fschmid56 's Transformer KD. Unfortunately I have a hard time downloading the data (PaSST scripts fail). By any chance, does one of you have it stored somewhere? |
Hi @adbrebs, as far as I know, we are not allowed to distribute AudioSet because of possible copyright issues (this is why AudioSet is available as a set of URLs to download it yourself). I can only tell you that we got AudioSet by using the instructions in the PANNs repo. I hope this somehow helps. Best, |
Ok thank you @fschmid56. I've managed to get it. |
Hey @adbrebs , The goal of that work is to further improve "high resolution" performance, while also being capable of tracking long-range events. As a side feature of SAT, they can somewhat effectively track events up to some very small delay of 160ms. I again used your provided sample above and ran With scores: With scores: |
Thank you for your great work and sharing it!
Do you have any recommendation to use your models to label audio at a higher resolution, say 1 sec or lower? Or even mel frame level?
I've tried applying your models on short windows but below 5 seconds, the results deteriorate a lot (for 1sec it seems to fail completely). I guess it's because the training AudioSet samples are ~10 seconds long.
I've also tried to modify the model to obtain frame-level predictions but it seems that they all use the "mlp" head and getting rid of the adaptative pooling would require a full retrain?
Thank you in advance!
The text was updated successfully, but these errors were encountered: