Skip to content

2.2. Cleaning datasets

Jim Schwoebel edited this page Aug 3, 2020 · 10 revisions

This part of Allie's skills relates to data cleaning.

Data cleansing is the process of making clean datasets - like removing noise in audio files. It allows for data with a higher signal-to-noise ratio for modeling, increasing robustness of models.

How to use cleaning scripts

To clean an entire folder of a certain file type (e.g. audio files of .WAV format), you can run:

cd ~ 
cd allie/cleaning/audio_cleaning
python3 cleaning.py /Users/jimschwoebel/allie/load_dir

The code above will featurize all the audio files in the folderpath via the default_cleaner specified in the settings.json file (e.g. 'clean_mono16hz').

Note you can extend this to any of the augmentation types. The table below overviews how you could call each as a augmenter. In the code below, you must be in the proper folder (e.g. ./allie/augmentation/audio_augmentations for audio files, ./allie/augmentation/image_augmentation for image files, etc.) for the scripts to work properly.

Data type Supported formats Call to featurizer a folder Current directory must be
audio files .MP3 / .WAV python3 clean.py [folderpath] ./allie/cleaning/audio_cleaning
text files .TXT python3 clean.py [folderpath] ./allie/cleaning/text_cleaning
image files .PNG python3 clean.py [folderpath] ./allie/cleaning/image_cleaning
video files .MP4 python3 clean.py [folderpath] ./allie/cleaning/video_cleaning
csv files .CSV python3 clean.py [folderpath] ./allie/cleaning/csv_cleaning

Implemented

Implemented for all file types

  • delete_duplicates - deletes duplicate files in the directory
  • delete_json - deletes all .JSON files in the directory (this is to clean the featurizations)
  • clean_getfirst3secs - gets the first 3 seconds of the audio file
  • clean_keyword - keeps only keywords that are spoken based on a transcript (from the default_audio_transcriber)
  • clean_mono16hz - converts all audio to mono 16000 Hz for analysis (helps prepare for many preprocessing techniques)
  • clean_mp3towav - converts all mp3 files to wav files
  • clean_multispeaker - deletes audio files from a dataset that have been identified as having multiple speakers from a deep learning model
  • clean_normalizevolume - normalizes the volume of all audio files using peak normalization methods from ffmpeg-normalize
  • clean_opus - converts an audio file to .OPUS audio file format then back to wav (a lossy conversion) - narrowing in more on voice signals over noise signals.
  • clean_random20secsplice - take a random splice (time specified in the script) from the audio file.
  • clean_removenoise - removes noise from the audio file using SoX program and noise floors.
  • clean_removesilence - removes silence from an audio file using voice activity detectors.
  • clean_utterances - converts all audio files into unique utterances (1 .WAV file --> many .WAV file utterances) for futher analysis.
  • clean_csv - uses datacleaner, a standard excel sheet cleaning script that imputes missing values and prepares CSV spreadsheets for machine learning