2.2. Cleaning datasets

This part of Allie's skills relates to data cleaning.

Data cleansing is the process of making clean datasets - like removing noise in audio files. It allows for data with a higher signal-to-noise ratio for modeling, increasing robustness of models.

How to use cleaning scripts

To clean an entire folder of a certain file type (e.g. audio files of .WAV format), you can run:

cd ~ 
cd allie/cleaning/audio_cleaning
python3 cleaning.py /Users/jimschwoebel/allie/load_dir

The code above will featurize all the audio files in the folderpath via the default_cleaner specified in the settings.json file (e.g. 'clean_mono16hz').

Note you can extend this to any of the augmentation types. The table below overviews how you could call each as a augmenter. In the code below, you must be in the proper folder (e.g. ./allie/augmentation/audio_augmentations for audio files, ./allie/augmentation/image_augmentation for image files, etc.) for the scripts to work properly.

Data type	Supported formats	Call to featurizer a folder	Current directory must be
audio files	.MP3 / .WAV	`python3 clean.py [folderpath]`	./allie/cleaning/audio_cleaning
text files	.TXT	`python3 clean.py [folderpath]`	./allie/cleaning/text_cleaning
image files	.PNG	`python3 clean.py [folderpath]`	./allie/cleaning/image_cleaning
video files	.MP4	`python3 clean.py [folderpath]`	./allie/cleaning/video_cleaning
csv files	.CSV	`python3 clean.py [folderpath]`	./allie/cleaning/csv_cleaning

Implemented

Implemented for all file types

delete_duplicates - deletes duplicate files in the directory
delete_json - deletes all .JSON files in the directory (this is to clean the featurizations)

Audio

clean_getfirst3secs - gets the first 3 seconds of the audio file
clean_keyword - keeps only keywords that are spoken based on a transcript (from the default_audio_transcriber)
clean_mono16hz - converts all audio to mono 16000 Hz for analysis (helps prepare for many preprocessing techniques)
clean_mp3towav - converts all mp3 files to wav files
clean_multispeaker - deletes audio files from a dataset that have been identified as having multiple speakers from a deep learning model
clean_normalizevolume - normalizes the volume of all audio files using peak normalization methods from ffmpeg-normalize
clean_opus - converts an audio file to .OPUS audio file format then back to wav (a lossy conversion) - narrowing in more on voice signals over noise signals.
clean_random20secsplice - take a random splice (time specified in the script) from the audio file.
clean_removenoise - removes noise from the audio file using SoX program and noise floors.
clean_removesilence - removes silence from an audio file using voice activity detectors.
clean_utterances - converts all audio files into unique utterances (1 .WAV file --> many .WAV file utterances) for futher analysis.

Text

clean_summary - extracts a 100 word summary of a long piece of text and deletes the original work (using Text rank summarization)
clean_textacy - removes punctuation and a variety of other operations to clean a text (uses Textacy)

Image

clean_extractfaces - extract faces from an image
clean_greyscale - make all images greyscale
clean_jpg2png - make images from jpg to png to standardize image formats

Video

clean_alignfaces - takes out faces from a video frame and keeps the video for an added label
clean_videostabilize - stabilizes a video frame using vidgear (note this is a WIP)

CSV

clean_csv - uses datacleaner, a standard excel sheet cleaning script that imputes missing values and prepares CSV spreadsheets for machine learning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly