-
Notifications
You must be signed in to change notification settings - Fork 35
2.2. Cleaning datasets
This part of Allie's skills relates to data cleaning.
Data cleansing is the process of making clean datasets - like removing noise in audio files. It allows for data with a higher signal-to-noise ratio for modeling, increasing robustness of models.
To clean an entire folder of a certain file type (e.g. audio files of .WAV format), you can run:
cd ~
cd allie/cleaning/audio_cleaning
python3 cleaning.py /Users/jimschwoebel/allie/load_dir
The code above will featurize all the audio files in the folderpath via the default_cleaner specified in the settings.json file (e.g. 'clean_mono16hz').
Note you can extend this to any of the augmentation types. The table below overviews how you could call each as a augmenter. In the code below, you must be in the proper folder (e.g. ./allie/augmentation/audio_augmentations for audio files, ./allie/augmentation/image_augmentation for image files, etc.) for the scripts to work properly.
Data type | Supported formats | Call to featurizer a folder | Current directory must be |
---|---|---|---|
audio files | .MP3 / .WAV | python3 clean.py [folderpath] |
./allie/cleaning/audio_cleaning |
text files | .TXT | python3 clean.py [folderpath] |
./allie/cleaning/text_cleaning |
image files | .PNG | python3 clean.py [folderpath] |
./allie/cleaning/image_cleaning |
video files | .MP4 | python3 clean.py [folderpath] |
./allie/cleaning/video_cleaning |
csv files | .CSV | python3 clean.py [folderpath] |
./allie/cleaning/csv_cleaning |
- delete_duplicates - deletes duplicate files in the directory
- delete_json - deletes all .JSON files in the directory (this is to clean the featurizations)
- clean_getfirst3secs - gets the first 3 seconds of the audio file
- clean_keyword - keeps only keywords that are spoken based on a transcript (from the default_audio_transcriber)
- clean_mono16hz - converts all audio to mono 16000 Hz for analysis (helps prepare for many preprocessing techniques)
- clean_mp3towav - converts all mp3 files to wav files
- clean_multispeaker - deletes audio files from a dataset that have been identified as having multiple speakers from a deep learning model
- clean_normalizevolume - normalizes the volume of all audio files using peak normalization methods from ffmpeg-normalize
- clean_opus - converts an audio file to .OPUS audio file format then back to wav (a lossy conversion) - narrowing in more on voice signals over noise signals.
- clean_random20secsplice - take a random splice (time specified in the script) from the audio file.
- clean_removenoise - removes noise from the audio file using SoX program and noise floors.
- clean_removesilence - removes silence from an audio file using voice activity detectors.
- clean_utterances - converts all audio files into unique utterances (1 .WAV file --> many .WAV file utterances) for futher analysis.
- clean_summary - extracts a 100 word summary of a long piece of text and deletes the original work (using Text rank summarization)
- clean_textacy - removes punctuation and a variety of other operations to clean a text (uses Textacy)
- clean_extractfaces - extract faces from an image
- clean_greyscale - make all images greyscale
- clean_jpg2png - make images from jpg to png to standardize image formats
- clean_alignfaces - takes out faces from a video frame and keeps the video for an added label
- clean_videostabilize - stabilizes a video frame using vidgear (note this is a WIP)
- clean_csv - uses datacleaner, a standard excel sheet cleaning script that imputes missing values and prepares CSV spreadsheets for machine learning