Tag audio at a higher resolution #3

adbrebs · 2023-01-10T10:47:34Z

Thank you for your great work and sharing it!

Do you have any recommendation to use your models to label audio at a higher resolution, say 1 sec or lower? Or even mel frame level?

I've tried applying your models on short windows but below 5 seconds, the results deteriorate a lot (for 1sec it seems to fail completely). I guess it's because the training AudioSet samples are ~10 seconds long.

I've also tried to modify the model to obtain frame-level predictions but it seems that they all use the "mlp" head and getting rid of the adaptative pooling would require a full retrain?

Thank you in advance!

fschmid56 · 2023-01-19T17:11:53Z

Hi,

you are right, the audio tagging performance deteriorates a lot if you try to label very short audio snippets. I would say this is to some extent natural as fine-grained labeling of short audios is difficult. However, MobileNet also downsamples the input strongly (x32 for our models) since this saves a lot of computation. For instance, if you try to label a one-second audio snippet the output of the conv. part before adaptive pooling will be of shape bs x channels x 4 x 4. With this small image sizes, padding seems to be responsible for the decreasing performance. If you just repeat the 1 second snippet 10 times you are getting reasonable predictions again.

I will run experiments with less down-sampling and with fully-convolutional classification heads. This should make it easier to get predictions for shorter audio snippets. If it works out I'll add the new pre-trained models to the repository.

adbrebs · 2023-01-19T17:16:14Z

Ok thank you for your answer, this makes sense.

I will run experiments with less down-sampling and with fully-convolutional classification heads. This should make it easier to get predictions for shorter audio snippets. If it works out I'll add the new pre-trained models to the repository.

Great, thank you!

adbrebs · 2023-01-31T16:24:38Z

Hi @fschmid56 have you had time to give it a try by any chance? If not I can give it a try but I won't be as fast you.
Thank you!

Edit:

I will run experiments with less down-sampling and with fully-convolutional classification heads

To be clear, in my case, it's not so much about tagging short ~1sec file but rather about running sound event detection on a long file at a high precision (say 1s, instead of 10 sec). So I think that just retraining with fully-convolutional classification heads (without any down-sampling) would already be very helpful!

fschmid56 · 2023-02-01T10:06:12Z

Yes, I already gave it a shot. Using the fully convolutional head and less down-sampling didn't work so well out of the box. I probably need to tweek learning rate and other hyperparameters a bit. So far, I needed the limited available compute otherwise.

I started new experiments today switching only to the fully convolutional mode first and then reducing the down-sampling slowly in the following experiments.

To be clear, in my case, it's not so much about tagging short ~1sec file but rather about running sound event detection > on a long file at a high precision (say 1s, instead of 10 sec). So I think that just retraining with fully-convolutional
classification heads (without any down-sampling) would already be very helpful!

Let me understand that in more detail. If the network is given a 10 sec. audio file, the feature maps before adaptive average pooling will be of size t=32 and f=4. Using a fully convolutional head, you will therefore get an output of size (c=527, f=4, t=32). If you feed a longer audio sequence you will scale up 't' accordingly. Is it just about the convolutional head, or do you need 'f' and 't' in higher resolution? This would mean I have to reduce the strides (down-sampling) in the network. Currently the input spectrogram is down-sampled by a factor of 32 (5 layers with a stride of 2). Do you need that down-sampling factor also reduced?

adbrebs · 2023-02-02T20:21:30Z

Hi Florian, thank you for giving it a shot!

Is it just about the convolutional head, or do you need 'f' and 't' in higher resolution?

It is just about the convolutional head in my use case. I understand that padding might create issues at the beginning/end of a file (especially a short file) but it shouldn't be a big deal in my case since I deal with long recordings.

I've read your paper in detail, it's great work! I should have more time next week and I hope I can dig deeper in your code.

PS: do you still have the weights stored somewhere of the fully-conv head models that you trained? I would be able to test them immediately. Otherwise don't worry I will train some models next week.

fschmid56 · 2023-02-02T20:42:03Z

Hi Alexandre, thanks for the nice feedback!

I had some time and free resources today, so I started the experiments and I guess they will work out well this time. You can follow them if you like:

Experiments on W&B

I'll upload the weights as soon as they are finished.

fschmid56 · 2023-02-03T13:33:59Z

I've added two models to the github releases "mn10_as_fc_mAP_465.pt " and "mn10_as_fc_s2221_mAP_466.pt".

You should be able to run inference on them like this:

python inference.py --cuda --model_name=mn10_as_fc --audio_path="resources/metro_station-paris.wav" --head_type=fully_convolutional

python inference.py --cuda --model_name=mn10_as_fc_s2221 --audio_path="resources/metro_station-paris.wav" --head_type=fully_convolutional --strides 2 2 2 1

I will add more models next week. Also, I will attach the correct config to load pre-trained models to the model name. This is currently a bit of a pitfall, e.g. if you forget to specify the strides argument when you try to load a model trained with modified strides.

adbrebs · 2023-02-09T04:24:26Z

FYI:

I've taken the fully-conv models and removed the Adaptative Avg Pooling along the time dimension. Unfortunately they don't give good results at segmenting precisely the file.

For example:

tags.mp4

I guess it's probably due to the large receptive fields of MobileNetV3. Do you happen to know its value by any chance?

fschmid56 · 2023-02-09T08:31:01Z

Yes, I guess this is because of the huge receptive field. For the standard 'mn10' model the receptive field spans ~26k pixels. Even if the effective receptive field is much smaller, it still spans multiple seconds of audio. I strongly assume that this is why the model detects speech and siren almost everywhere.

Have you tried the model with reduced strides? Do the detected events have a shorter span over time?

What would be a desired receptive field size for you? I could imagine training a model with reduced depth/kernel sizes.

adbrebs · 2023-02-09T15:54:13Z

Have you tried the model with reduced strides? Do the detected events have a shorter span over time?

Yes both give similar results.

Even if the effective receptive field is much smaller, it still spans multiple seconds of audio.

Ok it makes sense then. It would be nice to have a function to compute this effective receptive field given the architecture.

What would be a desired receptive field size for you? I could imagine training a model with reduced depth/kernel sizes.

Thank you for proposing, something around 0.5s or 1s would be great! Let me know if I can help.

In the meantime, out of curiosity, I'm going to try the less ideal approach you suggested earlier: taking 1sec chunks and repeating them 10 times (not sure what's the right amount, ideally the receptive field) before feeding them to some models with the "mlp" head.

fschmid56 · 2023-02-10T17:56:38Z

The easiest solution I could think of is to set some kernels to 1 in the function _mobilenet_v3_conf in MobileNetV3.py and retrain on AudioSet. The next simplest thing is to remove entire blocks from the config.
If you have some time and resources, you could try that. I'm currently busy with other stuff, so it will take me a bit but I'm also planning to experiment with that.

adbrebs · 2023-02-12T05:19:05Z

Makes sense, thank you for the suggestions! I'm also busy with other projects at the moment but will give it a try when I find time - I'll keep you posted if I get any good results.

RicherMans · 2023-02-13T05:30:50Z

Hey there @adbrebs @fschmid56 ,
I stumbled upon this thread by chance and thought that I might add some insight into the core problems of Audio tagging at "finer" resolutions.
First of all I'd like to thank Florian @fschmid56 for this awesome work!

I'd like to comment on the following issues of imprecise time-stamps.

I've taken the fully-conv models and removed the Adaptative Avg Pooling along the time dimension. Unfortunately they don't give good results at segmenting precisely the file.

Yes, I guess this is because of the huge receptive field. For the standard 'mn10' model the receptive field spans ~26k pixels. Even if the effective receptive field is much smaller, it still spans multiple seconds of audio. I strongly assume that this is why the model detects speech and siren almost everywhere. What would be a desired receptive field size for you? I could imagine training a model with reduced depth/kernel sizes.

That is to be expected for these types of models, since they are not trained to provide precise time-stamps, due to their design ( not necessarily due to the receptive field ).
The main problem for the provided CNN's is the adaptive2d-avgpool operation employed.
Here Florian merges all subsamples time x frequency features of size i.e., T/32 \times F/32 into a single embedding that is then send to the classifier.
This is a standard procedure for most image-classification models, since one does not want the model to "learn" the position of a specific object in the image, it shouldnt matter anyway.

However, for audio classification, this 2d pooling operation is less reasonable since the model is trained to correlate say low-frequency information from frame 0 and high-frequency information from say the frame at 10s.
Thus it can happen that the model actually "confuses" high-frequency information with low-frequency one and probabiltiies are "smeared out" over the entire time duration which can be observed in this post ( check out Foghorn and Bass guitar):

FYI:

I've taken the fully-conv models and removed the Adaptative Avg Pooling along the time dimension. Unfortunately they don't give good results at segmenting precisely the file.

For example:

tags.mp4
I guess it's probably due to the large receptive fields of MobileNetV3. Do you happen to know its value by any chance?

The overall 2d pooling works as long as your training and testing durations are somewhat similar and might be better to achieve a "higher mAP".

Onto this end, I'd like to advocate our previous work here pseudo strong labels, since we encountered the exact some problem.
Thus - at least for us - we generally avoid doing global average pooling and always prefer "decision-level" one, i.e., first average the frequency dimension, then obtain your probabilities and then average these, such that your model never correlates high-frequency and low-frequency information from completely different time-frames.
I used my simple mobilenet trained in the described fashion and obtained the following result on your sample:

The code for this picture is:

wget https://raw.githubusercontent.com/RicherMans/PSL/main/src/models.py
wget http://storage.googleapis.com/us_audioset/youtube_corpus/v1/csv/class_labels_indices.csv

import seaborn as sns
import matplotlib.pyplot as plt
import models
import numpy as np
import pandas as pd
import librosa
import torch

maps = pd.read_csv('class_labels_indices.csv',sep=',').set_index('index')['display_name'].to_dict()
data, sr = librosa.load('./217709574-c427fb6b-110b-4947-9a22-961292f632c7.mp4', sr = 16000)
mdl = models.MobileNetV2_DM()
mdl_state = torch.hub.load_state_dict_from_url('https://zenodo.org/record/6003838/files/mobilenetv2_mAP40_53.pt?download=1')
mdl.load_state_dict(mdl_state)

mdl.eval()
with torch.no_grad():
    y, y_time = mdl(torch.as_tensor(data).unsqueeze(0))
y_time = y_time.squeeze(0)


idxs = y_time.topk(3).indices.numpy()
scores = y_time.topk(3).values.numpy()
time_arr = np.arange(0, data.shape[-1]/sr, 0.32)

res = []
pred_names = []
for i in range(len(idxs)):
    names = [maps[f] for f in idxs[i]]
    for j in range(len(names)):
        res.append({'score':scores[i][j], 'name':names[j], 'time':time_arr[i]})
    
r = pd.DataFrame(res)
r['name'] = r['name'].astype('category')

plt.figure(figsize=(14,8))
sns.lineplot(data=r,x='time',y='score',hue='name')
plt.show()

Hope that I can help!

fschmid56 · 2023-02-15T08:50:09Z

Hey @RicherMans,

thanks for the additional input on this matter!

However, for audio classification, this 2d pooling operation is less reasonable since the model is trained to correlate say low-frequency information from frame 0 and high-frequency information from say the frame at 10s. Thus it can happen that the model actually "confuses" high-frequency information with low-frequency one and probabilities are "smeared out" over the entire time duration which can be observed in this post ( check out Foghorn and Bass guitar):

I do understand that it is problematic to mix low- and high-level frequency information in general. Even applying the same conv. kernels to high- and low-freq. regions is not well justified in my opinion as objects in images are position-invariant, while this might not hold for patterns along the frequency dimension. What is not so obvious to me right now, is why it is especially problematic if you average that over time and how this problem causes smeared-out probabilities. If I train models with global channel pooling and have a limited receptive field, I should still get valid time information if I don't do the pooling over time at inference time, no?

I will definitely look deeper into your paper and the code as soon as I have time!

I used my simple mobilenet trained in the described fashion and obtained the following result on your sample

Have you tried using the KD approach from this repo together with decision-level pooling?

RicherMans · 2023-02-15T10:08:52Z

I do understand that it is problematic to mix low- and high-level frequency information in general. Even applying the same conv. kernels to high- and low-freq. regions is not well justified in my opinion as objects in images are position-invariant, while this might not hold for patterns along the frequency dimension. What is not so obvious to me right now, is why it is especially problematic if you average that over time and how this problem causes smeared-out probabilities.

To be honest I also thought like that before, but just during my research for PSL, I initially trained models with global average pooling as I did and obtained very very wrong results for sub-10s resolutions ( like high probabilities for say cat meowing, even though there is water in an audio clip).

If I train models with global channel pooling and have a limited receptive field, I should still get valid time information if I don't do the pooling over time at inference time, no?

As far as I understand it, since you pool your features in time-frequency, the resulting embedding is specifically in time-frequency space, not in an independent time/frequency space, which means you can't simply during inference now expect that this space can be disentangled to a time / frequency space ( like a spectrogram)
If you do decision level pooling, the pooled (frequency ) embeddings are all within the same frequency-only space.
These are then pooled over time, such that there is not a "mixup" of time and frequency information in your embeddings, which also allows them to be used to predict sub-scale (like 1s) audio tags.

In my point of view your embeddings are likely to be somewhat superior to time/frequency independent ones, since you contain more information within them.
For training some other downstream tasks like in HEAR, it would seem to me that your embeddings should be better, but if you would one day need to do audio tagging for a real-world application that does have a shorter response time than 10s (that's super long btw), then try decision-level pooled approaches.

Have you tried using the KD approach from this repo together with decision-level pooling?

Surely tried but not with your provided code nor pretrained ensemble weights.
My own baseline with imagenet pretraining for a decision-mean pooled mobilenetv2 is at 42.15, with 64 mels and a sampling rate of 16k.
With your proposed approach and some ViT teacher models I can get up to ~43.51 so far, but might get better results just by simply training longer, so thanks for the work @fschmid56 !
I actually would like to use your logits , but I can't since there are no "filenames" provided in your saved object.
If possible could you provide the corresponding filenames or youtube-ids of each element in your saved object?

Thanks again!

fschmid56 · 2023-02-16T07:44:29Z

Okay, thanks for the input, I'll definitely have a closer look at this in the near future. In general, I would like to experiment more with audio-specific architectural components as I still find it a bit frustrating that vision architectures work so well out of the box without significant adaptation to audio. Decision-level pooling is now definitely on my list.

I actually would like to use your logits, but I can't since there are no "filenames" provided in your saved object.
If possible could you provide the corresponding filenames or youtube-ids of each element in your saved object?

This is already the third request regarding this. I'll put it on top of my list for after 20th Feb. (current Eusipco deadline).

RicherMans · 2023-02-16T09:26:33Z

Okay, thanks for the input, I'll definitely have a closer look at this in the near future. In general, I would like to experiment more with audio-specific architectural components as I still find it a bit frustrating that vision architectures work so well out of the box without significant adaptation to audio. Decision-level pooling is now definitely on my list.

Agreed, its a bit of a problem that many architectures from vision do not directly work.

This is already the third request regarding this. I'll put it on top of my list for after 20th Feb. (current Eusipco deadline).

Thanks and good luck for that conference! We maybe will see each other in June in ICASSP :)

fschmid56 · 2023-02-17T12:31:34Z

I've uploaded the file fname_to_index, which contains a dict converting the file IDs to the indices in the predictions file. I tried to make it compatible with the IDs provided in the official csv files.

We maybe will see each other in June in ICASSP :)

For sure! :-)

adbrebs · 2023-02-17T16:15:01Z

Hi @RicherMans, thank you for taking the time to write and sharing some insights!

In my use case, I would need a resolution of around 0.5-1s, so a receptive field of around ~1sec max. I think my best bet is to slightly change @fschmid56 mobilenet architecture to reduce the receptive field (and remove the global avg pooling) and retrain it with @fschmid56 's Transformer KD.

Unfortunately I have a hard time downloading the data (PaSST scripts fail). By any chance, does one of you have it stored somewhere?

fschmid56 · 2023-02-21T09:18:32Z

Hi @adbrebs,

as far as I know, we are not allowed to distribute AudioSet because of possible copyright issues (this is why AudioSet is available as a set of URLs to download it yourself). I can only tell you that we got AudioSet by using the instructions in the PANNs repo. I hope this somehow helps.

Best,
Florian

adbrebs · 2023-02-21T18:53:56Z

Ok thank you @fschmid56. I've managed to get it.

RicherMans · 2023-05-25T01:45:22Z

Hey @adbrebs ,
just for the record here and as an "advertisement" for some of our work. I recently released the source for our streaming audio transformers.

The goal of that work is to further improve "high resolution" performance, while also being capable of tracking long-range events.
For example, if you deploy a tagger on a static web camera and want to for example notify a user that he forgot to turn off his water faucet, you would need to track the sound event over a prolonged period of time, instead of only say 2 s/10s.

As a side feature of SAT, they can somewhat effectively track events up to some very small delay of 160ms. I again used your provided sample above and ran SAT_T_1s (5M Params, Streamable, mAP ~ 40.x) with chunks of 160ms, 320ms and 480ms. I got the following top-1 results:

480ms:

With scores:

320ms:

With scores:

160ms:

With scores:

fschmid56 closed this as completed Feb 8, 2023

fschmid56 reopened this Feb 9, 2023

adbrebs closed this as completed Feb 12, 2023

fschmid56 mentioned this issue Feb 15, 2023

How to accurately identify the sound event offset? #5

Closed

fschmid56 mentioned this issue Aug 1, 2023

clipwise_output #15

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tag audio at a higher resolution #3

Tag audio at a higher resolution #3

adbrebs commented Jan 10, 2023

fschmid56 commented Jan 19, 2023

adbrebs commented Jan 19, 2023 •

edited

Loading

adbrebs commented Jan 31, 2023 •

edited

Loading

fschmid56 commented Feb 1, 2023

adbrebs commented Feb 2, 2023

fschmid56 commented Feb 2, 2023

fschmid56 commented Feb 3, 2023

adbrebs commented Feb 9, 2023

fschmid56 commented Feb 9, 2023

adbrebs commented Feb 9, 2023

fschmid56 commented Feb 10, 2023

adbrebs commented Feb 12, 2023

RicherMans commented Feb 13, 2023

fschmid56 commented Feb 15, 2023

RicherMans commented Feb 15, 2023 •

edited

Loading

fschmid56 commented Feb 16, 2023

RicherMans commented Feb 16, 2023

fschmid56 commented Feb 17, 2023

adbrebs commented Feb 17, 2023

fschmid56 commented Feb 21, 2023

adbrebs commented Feb 21, 2023

RicherMans commented May 25, 2023 •

edited

Loading

Tag audio at a higher resolution #3

Tag audio at a higher resolution #3

Comments

adbrebs commented Jan 10, 2023

fschmid56 commented Jan 19, 2023

adbrebs commented Jan 19, 2023 • edited Loading

adbrebs commented Jan 31, 2023 • edited Loading

fschmid56 commented Feb 1, 2023

adbrebs commented Feb 2, 2023

fschmid56 commented Feb 2, 2023

fschmid56 commented Feb 3, 2023

adbrebs commented Feb 9, 2023

fschmid56 commented Feb 9, 2023

adbrebs commented Feb 9, 2023

fschmid56 commented Feb 10, 2023

adbrebs commented Feb 12, 2023

RicherMans commented Feb 13, 2023

fschmid56 commented Feb 15, 2023

RicherMans commented Feb 15, 2023 • edited Loading

fschmid56 commented Feb 16, 2023

RicherMans commented Feb 16, 2023

fschmid56 commented Feb 17, 2023

adbrebs commented Feb 17, 2023

fschmid56 commented Feb 21, 2023

adbrebs commented Feb 21, 2023

RicherMans commented May 25, 2023 • edited Loading

adbrebs commented Jan 19, 2023 •

edited

Loading

adbrebs commented Jan 31, 2023 •

edited

Loading

RicherMans commented Feb 15, 2023 •

edited

Loading

RicherMans commented May 25, 2023 •

edited

Loading