Skip to content

Latest commit

 

History

History
107 lines (74 loc) · 4.25 KB

README.md

File metadata and controls

107 lines (74 loc) · 4.25 KB

voiceAI

The third Capstone project as part of the Artificial Intelligence Nanodegree, and focusing on Spectrograms, Voice User Interfaces, Recurrent Neural Nets, and Speech Recognition!

Overview

In this notebook, we will build a deep neural network that functions as part of an end-to-end automatic speech recognition (ASR) pipeline (!):

  • STEP 1: PRE-PROCESSING: Converts raw audio to one of two feature representations that are commonly used for ASR.
  • STEP 2: ACOUSTIC MODEL: Accept transformed audio features as input and return a probability distribution over all potential transcriptions, picking it's best guess (using a variety of models!)
  • STEP 3: PREDICTION: Lastly, the pipeline takes the output from the acoustic model and returns a predicted transcription for validation.

Models

We will use the LibriSpeech dataset to train and evaluate our models, namely:

Model 0: RNN

Model 1: RNN + TimeDistributed Dense

Model 2: CNN + RNN + TimeDistributed Dense

Model 3: Deeper RNN + TimeDistributed Dense

Model 4: Bidirectional RNN + TimeDistributed Dense

Model 5: Deep Bidirectional RNN + TimeDistributed

Model 6: Deep Bidirectional RNN + TimeDistributed with Dropout

Model 7: CNN + RNN + TimeDistributed with Dropout

Never ran. Thanks Amazon. xD

Setup

Run via Amazon Elastic Compute Cloud, using The Deep Learning AMI with Cuda Support! on a p2.xlarge GPU instance:

First, prepping the instance with [Tensorflow]((https://www.tensorflow.org/) and friends, and the audio processing library; libav:

sudo python3 -m pip install tensorflow-gpu==1.1 udacity-pa tqdm
sudo apt-get install libav-tools
sudo python3 -m pip install python_speech_features librosa soundfile
install libav

Obtain the appropriate subsets of the LibriSpeech dataset, and convert all flac files to wav format.

wget http://www.openslr.org/resources/12/dev-clean.tar.gz
tar -xzvf dev-clean.tar.gz
wget http://www.openslr.org/resources/12/test-clean.tar.gz
tar -xzvf test-clean.tar.gz
mv flac_to_wav.sh LibriSpeech
cd LibriSpeech
./flac_to_wav.sh

Create JSON files corresponding to the train and validation datasets.

cd ..
python create_desc_json.py LibriSpeech/dev-clean/ train_corpus.json
python create_desc_json.py LibriSpeech/test-clean/ valid_corpus.json

(Optional) Setup local environment

conda create --name voiceAI
source activate voiceai
pip install -r requirements.txt
pip install tensorflow-gpu==1.1.0

Start Jupyter, and connect via your IPv4 address:

jupyter notebook --ip=0.0.0.0 --no-browser

Results

Model Description Lowest Validation Loss
Model 0 RNN 752.6974
Model 1 RNN + TimeDistributed 137.8584
Model 2 CNN + RNN + TimeDistributed 80.1717
Model 3 Deep RNN + TimeDistributed 97.7769
Model 4 Bidirectional RNN + TimeDistributed 98.5222
Model 5 Deep Bidirectional RNN + TimeDistributed N/A
Model 6 Deep Bidirectional RNN + TimeDistributed with Dropout N/A
Model 7 CNN + RNN + TimeDistributed with Dropout N/A

Thanks

Udacity borrowed the create_desc_json.py and flac_to_wav.sh files from the ba-dls-deepspeech repository, along with some functions used to generate spectrograms.