This repository is a RNN implementation using Tensorflow, to classify audio clips of different lengths. The input of the neural networks is not the raw sound, but the MFCC features (20 features).
As shown in the the following figure, the audio files are divided in sub-samples of 2 seconds, after it was transformed in MFCC features. The results of the preprocessing is a list of sequences with 20 features, with a fixed length (here, the file produces 3 sequences).
If necessary, the sequences are padded with 0 so the input of the neural network is fixed. But the network is able to retreive the effective time length and get rid of the 0 to be more efficient.
Since one file can be composed of several sequences, the results of sequences corresponding to one file are averaged so one label is given per file.
I used this network to classify sounds for my first kaggle competition, but I still need to dig into the data to improve the result.
- this repository and this notebook helped me to understand the mfcc features extraction.
- this post explains how to take into account the variable length of the sequences.