Audio Dataset

Overview

Deep Audio Dataset (DAD) is a library that helps with the loading of audio related data for deep learning experiments. DAD currently supports sequence-to-sequence (audio file pairs), regression targets, and multilabeling type experiments. Traditional train-test splitting as well as k-fold cross validation on metadata are also supported.

Requirements

Python 3
Tensorflow 2.2+
Numpy

General Notes

Directory Structure

DAD expects a specific directory structure to be available for all types of experiments:

root
 |---- in
 |      |- <wav file 1>
 |      |- <wav file 2>
 |      ...
 |      |- <wav file n>
 |
 |---- <index file>
 |---- <metadata file>

Note that the index file, metadata file, and the individual wav files can have any name. However, the in directory must be called that.

Index File

The index file is the way that DAD can interact with your data. Index files are essentially headless csv files with the following structure:

<input file name>,<target 1>,...,<target n>

Each type of experiment has a specific requirement for what the targets in an index file should look like. However, every experiment type currently expects the input file name to be the first column.

Metadata File

The metadata file is a JSON file that contains the associated metadata entries for each of the input files. The top level structure is an object with a key for each of the input files. The values for those keys are another object with metadata fields and their values. Here is an example metadata file:

{
    "audio0.wav": {
        "artist": "Artist 0",
        "genre": "GenreA"
    },
    "audio1.wav": {
        "artist": "Artist 1",
        "genre": "GenreA"
    }
}

There is no restriction on what kind of metadata fields are used or what the values are.

Data Consistency

The base implementation of DAD reviews the input audio files and ensures they are consistent across multiple axis. Some of the things it considers are: sample rate, bit rate, number of channels, length, etc. If inconsistent data is found then a ValueError is raised and the user is notified.

Process

DAD currently utilizes monolithic tfrecord files (protobuffer files) to store examples and streams data from them. By monolithic, we mean that the entire underlying dataset is stored inside of a single tfrecord file. Once this file is generated it will not need to be generated again. Note that it can take a significant amount of time to generate the tfrecord file.

Currently, by default, the tfrecord file is generated inside of the dataset object's constructor. This can be bypassed, however the file will be generated on the next call to train_test_split or kfold_on_metadata.

A specific naming scheme is used: {index_file}.tfrecord. So, given an index file called index.txt then the tfrecord file would be called index.txt.tfrecord. If that file already exists then DAD assumes it is valid and will use it.

Warning: this can cause issues if you have an obsolete tfrecord in the directory. DAD has no way of detecting this type of issue at this point.

With TensorFlow, DAD utilizes a StaticHashTable structure to filter records for splits and kfolds. DAD generates a list of record indices (which align with the row number in the index file) that should be include in a particular set (train or test). These indices are loaded into a table which is then used to filter the records as they are read in from storage.

Example Usage

Sequence-to-Sequence

Uses the AudioDataset class. Index file format:

<input wav file 1>,<target wav file 1>
<input wav file 2>,<target wav file 2>
...
<input wav file n>,<target wave file n>

Example index file:

input_file_0.wav,output_file_A.wav
input_file_1.wav,output_file_B.wav
...

Note: There must be a directory called out and that is where the target wave files must be located.

Regression

Uses the RegressionAudioDataset class. Index file format:

<input wav file 1>,<target value 1.1>,<target value 1.2>,...,<target value 1.m>
<input wav file 2>,<target value 2.1>,<target value 2.2>,...,<target value 2.m>
...
<input wav file n>,<target value n.1>,<target value n.2>,...,<target value n.m>

Example index file:

input_file_0.wav,0.123,0.456
input_file_1.wav,0.789,0.012
...

Note: There can be any number of target values per example but the number of values must be consistent across all examples.

Multilabel Classification

Uses the MultiLabelClassificationAudioDataset class. Index file format:

<input wav file 1>,<label bit mask 1>
<input wav file 2>,<label bit mask 2>
...
<input wav file n>,<label bit mask 3>

Example index file:

input_file_0.wav,001001
input_file_1.wav,001000

Train-Test Split

Train-test splits are generated by calling the appropriate method on the dataset object:

from deep_audio_dataset import AudioDataset

base_directory = "data_directory"
index_file_name = "index.txt"

dataset = AudioDataset(
    base_directory,
    index_file_name,
)

train, test = dataset.train_test_split(test_size=0.2)

The return objects are zipped datasets that can be passed directly into models for fitting and validating:

model.fit(
    train.batch(10),
    batch_size=10,
    epochs=10,
    validation_data=test.batch(10)
)

K-Fold on Metadata

Performing a k-fold cross validation on metadata is fairly straightforward. First, instantiate the data object and make sure to include the name of the metadata file:

from deep_audio_dataset import AudioDataset

base_directory = "data_directory"
index_file_name = "index.txt"
metadata_file_name = "metadata.txt"

dataset = AudioDataset(
    base_directory,
    index_file_name,
    metadata_file=metadata_file_name
)

You can then call the kfold_on_metadata method and pass to it the name of the metadata field you want to generate folds on. The method returns an iterator that can be iterated on in any normal way:

for train, test, metavalue in dataset.kfold_on_metadata("artist"):
    ...

The triplets returned include a train set, test set, and metadata value. The metadata value is the current holdout value for the iteration. So, all of the examples in the train set do not have the holdout value in their metadata. While all of the examples in the test set do have that holdout value. Be warned, there are no attempts made to balance the folds so some iterations can have very unbalanced train/test sets.

Process of Extension

DAD is meant to be extended with new types of experiments/datasets for TensorFlow. Currently, the easiest extension that can be supported are datasets that have audio data input. To support a new kind of the dataset you must extend the BaseAudioDataset abstract class. This will require that you implement the following abstract methods:

LoadOutputFeature(output_index: List[str]) -> tf.train.Feature: This method takes in a list of target information and returns a TensorFlow Feature that represents a single target example. The list of strings are the target values from the index file for the row you are generating a feature for. The target values have been split on commas (i.e. target_string.split(",")).
OutputFeatureType() -> Any: This method must return the feature type that LoadOuputFeature will return. This is used to populate a schema. An example of a feature type is tf.io.FixedLenFeature.

Optionally, you can override the analyze_index_outputs method as well. This method is passed a list of lists of strings. Each element of the list is a target value that has been split by commas. This method can be helpful in deciding what OutputFeatureType should return as well as a chance to validate the target data for consistency.

All three of the currently implemented dataset classes follow this pattern and they make decent examples.

DevOps

The repository on github currently has a test suite and linter action that run on each push to a PR as well as push to main. The test suite is managed using the pytest library. For linter, DAD is currently using ruff. ruff is still in early development as of this writing but it was found to be far more performant than flake8. All of the settings for ruff are managed in the pyproject.toml file. There are no autoformatters currently in use on this project.

A wheel building action is executed on every push to main (this also triggers on PR merges). The built wheel is published as an artifact and available for download from the successful run.

Dependencies are currently managed through a requirements.txt file. This should be moved to the pyproject.toml file at some point in the future.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/workflows		.github/workflows
src/deep_audio_dataset		src/deep_audio_dataset
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Audio Dataset

Overview

Requirements

General Notes

Directory Structure

Index File

Metadata File

Data Consistency

Process

Example Usage

Sequence-to-Sequence

Regression

Multilabel Classification

Train-Test Split

K-Fold on Metadata

Process of Extension

DevOps

About

Releases

Packages

Languages

Soundbendor/Deep-Audio-Dataset

Folders and files

Latest commit

History

Repository files navigation

Audio Dataset

Overview

Requirements

General Notes

Directory Structure

Index File

Metadata File

Data Consistency

Process

Example Usage

Sequence-to-Sequence

Regression

Multilabel Classification

Train-Test Split

K-Fold on Metadata

Process of Extension

DevOps

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages