Skip to content
/ SAT Public

Streaming Audiotransformers for online Audio tagging

License

Notifications You must be signed in to change notification settings

RicherMans/SAT

Repository files navigation

Streaming Audio Transformers for Online Audio Tagging

Source for the Interspeech 2024 paper Streaming Audio Transformers for Online Audio Tagging.

Highlights:

  • Transformers capable of being used for online-audio tagging. The model processes at most 2s at a time and incorporated past predictions. Best used when deployed on stationary devices (cameras, speakers, electronic household items).
  • Different from most research, SAT is aimed to deploy an audio tagger, not use it as a feature extractor to some other task.
  • Performance: 45.1 mAP on the best model with 2s delay, and 43.3 mAP on a ViT-Tiny.
  • Memory and computational footprint are manageable. Our SAT-T can be easily deployed on mobile devices, with 20 Mb (size) parameters and 9 Mb (float32) RAM. For 1s inference, we only require 4.3 Mb of RAM.
  • Partially solves most AT problems that deploy transformers such as: "my transformer's performance is very bad on shorter than 10s sized clips" and "pad the input to 10s and pay the computational overhead price". SAT helps problems like 1, 2, 3, 4 and 5.
  • SAT can track long-term events effectively, whereas standard audio taggers generally have a high score-variance between successive chunks.
Model Streamable? #Token PeakMem GFlops mAP
2s delay ViT-T 48 7.6 M 0.5 39.1
ViT-S 48 15 M 2.1 40.9
ViT-B 48 30 M 8.2 41.5
AST 1212 2.2 G 202 39.7
BEATs 96 83 M 17.8 38.7
HTS-AT 1024 171 M 42 5.2
SAT-T Y 48/48 9 M 0.5 43.3
SAT-S Y 18 M 2.1 43.4
SAT-B Y 36 M 8.2 45.1
--------------------------- -------- ------- --------- ------------ -------- ------
1 s delay ViT-T 24 3.8 M 0.3 33.0
ViT-S 24 7.5 M 1.1 34.9
ViT-B 24 14 M 4.1 34.2
AST 1212 2.2 G 202 36.6
BEATs 48 83 M 17.8 35.2
HTS-AT 128 171 M 42 2.4
SAT-T Y 24/24 4.3 M 0.3 40.1
SAT-S Y 9 M 1.1 40.2
SAT-B Y 16 M 4.1 41.4

Preparation

git clone https://github.com/RicherMans/SAT
pip3 install -r requirements.txt

Inference

We prepare a simple script to run inference for all proposed models (ViT-T/S/B and SAT-T/S/B). The checkpoints are hosted on zenodo.

Running inference for the two water samples seen in the paper:

python3 inference.py samples/jkLRith2wcc.wav # Default is the SAT-T 2s model

Outputs:

#===== samples/jkLRith2wcc.wav =====
#[0.0s]-[2.0s] Topk-1 Stream                         0.9582
#[0.0s]-[2.0s] Topk-2 Trickle, dribble               0.1661
#[0.0s]-[2.0s] Topk-3 Boat, Water vehicle            0.1254
#
#[2.0s]-[4.0s] Topk-1 Stream                         0.8668
#[2.0s]-[4.0s] Topk-2 Ocean                          0.3082
#[2.0s]-[4.0s] Topk-3 Waves, surf                    0.2931
#
#[4.0s]-[6.0s] Topk-1 Stream                         0.8928
#[4.0s]-[6.0s] Topk-2 Trickle, dribble               0.1404
#[4.0s]-[6.0s] Topk-3 Boat, Water vehicle            0.1306
#
#[6.0s]-[8.0s] Topk-1 Stream                         0.8503
#[6.0s]-[8.0s] Topk-2 Trickle, dribble               0.1666
#[6.0s]-[8.0s] Topk-3 Raindrop                       0.1620
#
#[8.0s]-[10.0s] Topk-1 Stream                         0.8594
#[8.0s]-[10.0s] Topk-2 Raindrop                       0.4584
#[8.0s]-[10.0s] Topk-3 Rain                           0.1884
python3 inference.py samples/mg4kDY_hy6o.wav # Default is the SAT-T 2s model

Outputs:

#===== samples/mg4kDY_hy6o.wav =====
#[0.0s]-[2.0s] Topk-1 Stream                         0.9555
#[0.0s]-[2.0s] Topk-2 Trickle, dribble               0.5271
#[0.0s]-[2.0s] Topk-3 Rain                           0.3057
#
#[2.0s]-[4.0s] Topk-1 Stream                         0.8074
#[2.0s]-[4.0s] Topk-2 Trickle, dribble               0.6095
#[2.0s]-[4.0s] Topk-3 Water                          0.3593
#
#[4.0s]-[6.0s] Topk-1 Stream                         0.8058
#[4.0s]-[6.0s] Topk-2 Water                          0.3823
#[4.0s]-[6.0s] Topk-3 Waterfall                      0.3219
#
#[6.0s]-[8.0s] Topk-1 Stream                         0.8496
#[6.0s]-[8.0s] Topk-2 Water                          0.4074
#[6.0s]-[8.0s] Topk-3 Trickle, dribble               0.3548
#
#[8.0s]-[10.0s] Topk-1 Stream                         0.8025
#[8.0s]-[10.0s] Topk-2 Water                          0.4172
#[8.0s]-[10.0s] Topk-3 Trickle, dribble               0.3403

As we can see, SAT provides high-confidence scores over a prolonged timeframe.

As a counter-example, if one uses our ViT-B, we get:

python3 inference.py samples/mg4kDY_hy6o.wav -m audiotransformer_base 

# Prints:
#===== samples/mg4kDY_hy6o.wav =====
#[0.0s]-[2.0s] Topk-1 Trickle, dribble               0.6101
#[0.0s]-[2.0s] Topk-2 Stream                         0.4968
#[0.0s]-[2.0s] Topk-3 Water                          0.2289
#
#[2.0s]-[4.0s] Topk-1 Stream                         0.3074
#[2.0s]-[4.0s] Topk-2 Water                          0.2828
#[2.0s]-[4.0s] Topk-3 Trickle, dribble               0.2814
#
#[4.0s]-[6.0s] Topk-1 Stream                         0.7029
#[4.0s]-[6.0s] Topk-2 Trickle, dribble               0.2179
#[4.0s]-[6.0s] Topk-3 Waterfall                      0.1804
#
#[6.0s]-[8.0s] Topk-1 Stream                         0.5569
#[6.0s]-[8.0s] Topk-2 Waterfall                      0.1705
#[6.0s]-[8.0s] Topk-3 Trickle, dribble               0.1406
#
#[8.0s]-[10.0s] Topk-1 Stream                         0.2891
#[8.0s]-[10.0s] Topk-2 Waterfall                      0.1813
#[8.0s]-[10.0s] Topk-3 Water                          0.0930

One can change the models (audiotransformer_tiny, SAT_T_1s, audiotransformer_small,SAT_S_2s,SAT_S_1s, audiotransformer_base, SAT_B_2s, SAT_B_1s ):

python3 inference.py -m SAT_T_1s -c 1.0 samples/jkLRith2wcc.wav


### Prints:
#===== samples/jkLRith2wcc.wav =====
#[0.0s]-[1.0s] Topk-1 Silence                        0.2903
#[0.0s]-[1.0s] Topk-2 Vehicle                        0.2587
#[0.0s]-[1.0s] Topk-3 Boat, Water vehicle            0.0793

#[1.0s]-[2.0s] Topk-1 Stream                         0.8000
#[1.0s]-[2.0s] Topk-2 Raindrop                       0.1655
#[1.0s]-[2.0s] Topk-3 Rain                           0.1522

#[2.0s]-[3.0s] Topk-1 Stream                         0.8642
#[2.0s]-[3.0s] Topk-2 Raindrop                       0.1193
#[2.0s]-[3.0s] Topk-3 Trickle, dribble               0.1065

#[3.0s]-[4.0s] Topk-1 Stream                         0.7319
#[3.0s]-[4.0s] Topk-2 Raindrop                       0.1958
#[3.0s]-[4.0s] Topk-3 Rain                           0.1204

Very Short delay inference

Chunk-size (delay) can be controlled via -c duration. For example, in the extreme case that our ViT-B (mAP 47.40) model only processes a single patch (160 ms) for the above samples, we get:

python3 inference.py samples/mg4kDY_hy6o.wav -m audiotransformer_base -c 0.16

#===== samples/mg4kDY_hy6o.wav ===== 
#[0.0s]-[0.16s] Topk-1 Music                          0.2091                  
#[0.0s]-[0.16s] Topk-2 Whoosh, swoosh, swish          0.1027
#[0.0s]-[0.16s] Topk-3 Silence                        0.0909 
#                                                                                                
#[0.16s]-[0.32s] Topk-1 Static                         0.1030
#[0.16s]-[0.32s] Topk-2 Silence                        0.1015
#[0.16s]-[0.32s] Topk-3 Single-lens reflex camera      0.0815
#
#[0.32s]-[0.48s] Topk-1 Static                         0.0839
#[0.32s]-[0.48s] Topk-2 Clatter                        0.0661
#[0.32s]-[0.48s] Topk-3 Cacophony                      0.0509
#
#[0.48s]-[0.64s] Topk-1 Electric toothbrush            0.0995
#[0.48s]-[0.64s] Topk-2 Clatter                        0.0834
#[0.48s]-[0.64s] Topk-3 Static                         0.0701
#
#[0.64s]-[0.8s] Topk-1 Electric toothbrush            0.1715
#[0.64s]-[0.8s] Topk-2 Static                         0.0629
#[0.64s]-[0.8s] Topk-3 Idling                         0.0511
#
#[0.8s]-[0.96s] Topk-1 Static                         0.1471
#[0.8s]-[0.96s] Topk-2 Cacophony                      0.1242
#[0.8s]-[0.96s] Topk-3 Electric toothbrush            0.0929

The above showed that our best model (but also all other SOTAs) are completely unable to provide reasonable results and cannot differentiate between water and white noise.

However, if we use SAT_T_1s:

python3 inference.py samples/mg4kDY_hy6o.wav -m SAT_T_1s -c 0.16

#===== samples/mg4kDY_hy6o.wav =====
#[0.0s]-[0.16s] Topk-1 Whoosh, swoosh, swish          0.6725
#[0.0s]-[0.16s] Topk-2 Sound effect                   0.1116
#[0.0s]-[0.16s] Topk-3 Music                          0.1057
#
#[0.16s]-[0.32s] Topk-1 Silence                        0.3615
#[0.16s]-[0.32s] Topk-2 Whoosh, swoosh, swish          0.1162
#[0.16s]-[0.32s] Topk-3 Static                         0.0956
#
#[0.32s]-[0.48s] Topk-1 Rain on surface                0.5616
#[0.32s]-[0.48s] Topk-2 Rain                           0.3506
#[0.32s]-[0.48s] Topk-3 Raindrop                       0.3257
#
#[0.48s]-[0.64s] Topk-1 Rain on surface                0.4327
#[0.48s]-[0.64s] Topk-2 Rain                           0.2457
#[0.48s]-[0.64s] Topk-3 Silence                        0.2116
#
#[0.64s]-[0.8s] Topk-1 Rain on surface                0.6230
#[0.64s]-[0.8s] Topk-2 Rain                           0.4475
#[0.64s]-[0.8s] Topk-3 Raindrop                       0.3959
#
#[0.8s]-[0.96s] Topk-1 Rain on surface                0.3707
#[0.8s]-[0.96s] Topk-2 Rain                           0.2532
#[0.8s]-[0.96s] Topk-3 Raindrop                       0.2089

While the results are not perfect, it is clear that SAT is far more superior for short-delay inference.

Dataset acquisition

We propose simple preprocessing scripts in datasets/ for Audioset.

For getting the (balanced) Audioset data you need gnu-parallel and yt-dlp. Downloading the balanced Audioset is quick ( ~ 30 minutes):

cd datasets/audioset/
./1_download_audioset.sh
# After having downloaded the dataset, dump the .wav to .h5
./2_prepare_data.sh

Additionally, the 1_download_audioset.sh script downloads our PSL-labels used in this work for the balanced and full training-subsets.

Training

MAE

After having prepared the data, to pretrain using MAE:

python3 1_run_mae.py config/mae/mae_tiny.yaml # Or the other configs

SAT

Training a SAT is also simple:

python3 2_train_sat.py config/sat/balanced_sat_2_2s.yaml

After having trained the model you can use it for inference as:

python3 inference.py -m $PATH_TO_YOUR_CHECKPOINT samples/jkLRith2wcc.wav

About

Streaming Audiotransformers for online Audio tagging

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published