Skip to content

0.4. Data modeling

Jim Schwoebel edited this page Aug 17, 2018 · 31 revisions

This section overviews all the scripts in the Chapter 4: Data modeling folder.

Definitions

Term Definition
features descriptive numerical representations to describe an object.
machine learning the process of teaching a machine something that is useful.
classification model If the goal is to separate out into classes (e.g. male or female), then this is known as a classification problem.
regression model if the end goal is to measure some correlation with a variable and the output is more a numerical range (e.g. often between 0 and 1), then this is more of a regression problem.
deep learning models models that are trained using a neural network.
unsupervised learning if machines do not require labels (e.g. just need features), this is known as a unsupervised learning problem.
supervised learning if machines require labels (e.g. male or female as separate feature arrays), this is known as a supervised learning problem.
training set Machines are fed training data in the form of feature arrays and compress patterns in these feature arrays into models through algorithms.
testing set data that is left out during training so that the accuracy can be calculated using cross-validation techniques.
validation set data that is left out during training to tune hyperparameters (often used in deep learning modeling.
label a tag of an featurized audio sample (e.g. male or female) to aid in supervised learning.
cross-validation how the performance of ML models are assessed (in terms of accuracy).

Obtaining training data

make_playlist.py (from CLI)

cd ~
cd voicebook/chapter_4_modeling/youtube_scrape
python3 make_playlist.py 
what is the name of this playlist?
what is the playlist id or URL?
… [‘n’ to stop making playlist]

download_playlist.py (from CLI)

python3 download_playlist.py 
what is the name of the playlist to download?
… downloads playlist to /playlist folder 

Labeling training data

label_samples.py (from CLI)

labeling YouTube data in spreadsheets

label_samples.py (from CLI)

python3 label_samples.py
what is the master label (e.g. stressed)? 
stressed
sample number: 0
what is the URL of the video? 
https://www.youtube.com/watch?v=47HLiAxHgdo
how long is the audio sample in seconds? (e.g. 20) 
20
what are the stop and start times of the video (e.g. 0:13-0:33)
0:05-0:25            
is this person stressed? 1 for yes, 0 for no
1
is this person a child (c, <13) or adolescent (d, 13-18) or adult  (a, >18 <70) or elderly (e, >70)? 
a
is this person male (m) or female (f)? 
m
does this person have an American (a) or foreign (f) accent? 
a
what is the audio quality? (1 - poor, 2 - moderate, 3 - good quality, 4 - high quality)3
is the environment indoors (i) or outdoors (o)?i
sample number: 1
what is the URL of the video?
...After entering [‘’] here, it ends the script and outputs excel sheet below.

downloading YouTube data from spreadsheets

y_scrape.py (from CLI)

Run script in terminal...
python3 y_scrape.py
Get file name to parse 
what is the file name? 
Stressed_1.xlsx

All the files are then downloaded (Pafy module) and converted to .wav format with FFmpeg ...

Classification models

building optimized classification models

train_audioclassify.py (from CLI)

cd ~
cd voicebook/chapter_4_modeling
python3 train_audioclassify.py 
# insert number of classes and class names 
how many classes are you training?2
what is the folder name for class 1?schizophrenia
what is the folder name for class 2?controls
# now all the classes will featurize 
SCHIZOPHRENIA - featurizing snipped38_start_2_end_22.wav
making 0.wav
[-4.51487917e+02  1.32250653e+02 -6.48964827e+02 -2.16927909e+02...
  9.57062705e-04  4.54699943e-02 -5.85259705e-02  5.74577384e-02]
...
Decision tree accuracy (+/-) 0.20779263167344933
0.5733333333333334
Gaussian NB accuracy (+/-) 0.1305543735171076
0.7866666666666667
SKlearn classifier accuracy (+/-) 0.039999999999999994
0.48
Adaboost classifier accuracy (+/-) 0.22666666666666668
0.6366666666666667
Gradient boosting accuracy (+/-) 0.1319090595827292
0.6599999999999999
Logistic regression accuracy (+/-) 0.07557189365836424
0.7366666666666667
Hard voting accuracy (+/-) 0.2341889076033373
0.6766666666666666
K Nearest Neighbors accuracy (+/-) 0.12666666666666668
0.5633333333333332
Random forest accuracy (+/-) 0.2758824226207808
0.7333333333333333
svm accuracy (+/-) 0.13556466271775172
0.7533333333333333

most accurate classifier is Gaussian NB with audio features (mfcc coefficients).
saving classifier to disk. 

Summarizing session…
GaussianNB(priors=None)

['gaussian-nb', 0.7866666666666667, 0.1305543735171076]

loading classification models

load_audioclassify.py (from CLI)

python3 load_audioclassify.py

This results in an output:

{"filename": "348.wav", "filetype": "audio file", "class": ["controls"], "model": ["schizophrenia_controls_sc_audio.pickle"], "model accuracies": [0.7866666666666667], "model deviations": [0.1305543735171076], "model types": ["gaussian-nb"], "features": [[-322.9664360980726, 59.53868288968913, -462.5294083924505, -166.3993076206564, 131.38738649438437, 52.44671783868567, -33.74398658437562, 227.8102207133376, 9.52738149362727, 28.505927165579884, -90.65927286414657, 71.52976680142815, 9.73530102063688, 25.62432182324615, -66.02663398503707, 73.87513246074612, -1.596002360610912, 22.81632350096357, -87.30807566263049, 41.72876898633217, 0.8865486997595385, 17.735652130525168, -65.99456073539176, 52.43567091641821, -14.286216477070838, 14.128449781073533, -59.836804831757654, 18.175026917411316, -9.131276510645463, 13.701302570519355, -57.44541029310883, 25.74622598177111, -4.545971824836885, 10.899138142787697, -42.116927063121395, 29.536967420470695, -3.4558647963609186, 10.31513522815575, -36.17230935229129, 26.551369428146693, -3.6667095757279236, 10.079488079876286, -33.78123311320836, 26.14112294381864, 5.366060779304841, 8.570956061981061, -19.248854886451802, 38.20513572569962, -5.458667628428172, 7.490745204714798, -31.338790159786562, 12.539046082339311, 0.024288590342538358, 10.584946850085212, -34.52340818393254, 38.15078289969128, -0.156898762979172, 11.158828455811786, -34.10403400345244, 30.973152153233336, 0.020648845552068328, 5.827064754672902, -22.052042500906857, 16.81872640844321, -0.06170338085832314, 5.229174923928, -14.518978383592026, 14.845857302315114, -0.04962607796690964, 4.5211806494022735, -14.998074177634704, 12.378100326632655, 0.07415513595268168, 3.724070455888158, -9.939566189661432, 10.85577098792062, -0.017072005372266726, 2.7463908847692204, -6.600475000502117, 6.524786791283427, -0.02310274039018664, 2.7092557498939636, -7.467322311111723, 7.481090337383571, 0.04464197716713606, 2.198722832501255, -6.88438775831641, 7.844106037059699, 0.045382707259550105, 2.0580935158253872, -6.638462605186588, 5.991186816663746, -0.013702557713332408, 1.9496130791163644, -6.458246324901151, 5.7716202748695, -0.007340250450717803, 1.6409103586116958, -5.380714141939734, 5.539025057788075, 0.011411587050311969, 1.3949062816882583, -4.390308824019425, 4.13132941219398, -291.2947346432915, 49.04737058565422, -381.6816501283554, -222.9638855557117, 158.3460978309033, 23.15415034729552, 99.62697329203677, 189.70121020164896, 7.287058326977949, 30.77474443760493, -38.71222832828984, 56.208286170618955, -1.0950341842073796, 21.3498811006992, -34.41685805740065, 31.926254848624147, -9.172025861857653, 11.511454213511039, -29.874153138705573, 8.203596981294625, 2.6663941698626865, 6.753684660513026, -5.9061357505887955, 19.305474480034082, -14.088225581455214, 17.47630600064678, -49.8886801840349, 8.935818425975743, -13.521963272886959, 8.25999525518404, -24.851695100203774, -0.11752456737790722, -12.762992506945213, 8.598616338770906, -29.72115313687536, 0.05275012294025435, -4.531403069755177, 11.8713757531457, -24.376936764599744, 12.207624665298002, -2.6914750628989266, 14.673164819510685, -22.308447521294887, 17.767626038347583, 11.80700932417913, 10.516802160193405, -13.092759032892214, 24.963056992755536, -10.390953114902164, 6.1887066403103965, -20.39253124562046, 2.7941268719402848, -4.41480192601625, 7.0550461587501, -15.045545852884578, 5.8468320221431656, 0.22555437964894862, 4.881477566532211, -4.990490269946867, 8.519079155558249, 3.4745028409138827, 2.7045163211953187, 0.5391937155699558, 8.988399905874912, 0.45051536549204274, 4.824805683998831, -4.424922867740668, 9.67554223394205, -0.8502687288362012, 3.1351941328536777, -4.844124443962841, 4.754766492721427, 0.870140923131266, 1.1137966493666094, -1.4131441258277446, 2.418345086057676, 2.4254793474500635, 1.2058772715931956, 0.5825294849801214, 4.536777131050609, 0.10251353649615984, 1.51146113365032, -1.4592806547585204, 3.291502702928505, -1.075428938064348, 1.0559521971759946, -2.4408814841825865, 1.12308565480587, -0.3002420005778045, 2.4751693616737347, -3.6333810904861688, 3.34737386167248, 0.17805269515377548, 3.7250267108754236, -5.189309157660288, 5.579262003298437, 0.24091712079378458, 2.451817967640338, -5.215650064568107, 2.3865116769275567, 0.003640041240486553, 1.4235044885102617, -2.379919268715038, 1.5581599658532437]], "count": 0, "errorcount": 0}

Regression models

training regression models

train_audioregression.py (from CLI)

cd ~
cd voicebook/chapter_4_modeling/
python3 train_audioregression.py
what is the name of the file in /data directory you would like to analyze? 
africanamerican_controls.json
RESULTS: 
+-------------------------------------------+-----------+----------------------+
|                model type                 | R^2 score | Mean Absolute Errors |
+-------------------------------------------+-----------+----------------------+
|             linear regression             |  -1.672   |        0.656         |
+-------------------------------------------+-----------+----------------------+
|             ridge regression              |   0.047   |        0.367         |
+-------------------------------------------+-----------+----------------------+
|                   LASSO                   |   0.426   |        0.273         |
+-------------------------------------------+-----------+----------------------+
|                elastic net                |   0.483   |        0.255         |
+-------------------------------------------+-----------+----------------------+
|       Least angle regression (LARS)       |   0.065   |        0.478         |
+-------------------------------------------+-----------+----------------------+
|                LARS lasso                 |  -0.025   |        0.502         |
+-------------------------------------------+-----------+----------------------+
|     orthogonal matching pursuit (OMP)     |  -0.032   |         0.39         |
+-------------------------------------------+-----------+----------------------+
|            logistic regression            |  -0.019   |        0.253         |
+-------------------------------------------+-----------+----------------------+
|     stochastic gradient descent (SGD)     |  -0.153   |         0.41         |
+-------------------------------------------+-----------+----------------------+
|                perceptron                 |  -7.297   |        1.131         |
+-------------------------------------------+-----------+----------------------+
|        passive-agressive algorithm        |   0.316   |        0.329         |
+-------------------------------------------+-----------+----------------------+
|                  RANSAC                   |   0.316   |        0.329         |
+-------------------------------------------+-----------+----------------------+
|                 Theil-Sen                 |  -1.672   |        0.674         |
+-------------------------------------------+-----------+----------------------+
|             huber regression              |  -0.582   |         0.49         |
+-------------------------------------------+-----------+----------------------+
|      polynomial (linear regression)       |  -0.582   |         0.49         |
+-------------------------------------------+-----------+----------------------+
logistic regression has the lowest mean absolute error (0.25252525252525254)
saving file to disk (africanamerican_controls_regression.pickle)...

loading regression models

load_audioregression.py

python3 load_audioregression.py 
1.0
controls

Deep learning models

keras_mlp.py

from keras.models import Sequential
from keras.layers import Dense, Activation

model = Sequential()
model.add(Dense(32, activation='relu', input_dim=100))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Generate dummy data
import numpy as np
data = np.random.random((1000, 100))
labels = np.random.randint(2, size=(1000, 1))

# Train the model, iterating on the data in batches of 32 samples
model.fit(data, labels, epochs=10, batch_size=32)

from CLI

train_audiokeras.py

cd ~
cd voicebook/chapter_4_modeling 
python3 train_audiokeras.py
folder name 1 
africanamerican     
folder name 2 
controls
...
[[1.]]
Epoch 1/20
149/149 [==============================] - 0s 2ms/step - loss: 3.8728 - acc: 0.3423
Epoch 2/20
149/149 [==============================] - 0s 29us/step - loss: 0.3178 - acc: 0.3624
Epoch 3/20
149/149 [==============================] - 0s 26us/step - loss: -0.0068 - acc: 0.4228

...
 final acc: 50.34% 
...
 Saved africanamerican_controls_dl_audio.json model to disk
summarizing data...
testing loaded model

'Loaded model from disk'

[[1.]]

AutoML approaches

using TPOT for classification

train_audioTPOT.py

cd ~
cd voicebook/chapter_4_modeling/
python3 train_audioTPOT.py
classification (c) or regression (r) problem?
c
what is the name of class 1? 
africanamerican
what is the name of class 2? 
controls
Generation 1 - Current best internal CV score: 0.9056433904259992               
Generation 2 - Current best internal CV score: 0.9100878348704435               
Generation 3 - Current best internal CV score: 0.9100878348704435               
Generation 4 - Current best internal CV score: 0.9100878348704435               
Generation 5 - Current best internal CV score: 0.9191787439613526               
                                                                                
Best pipeline: LogisticRegression(LogisticRegression(MinMaxScaler(StandardScaler(input_matrix)), C=1.0, dual=False, penalty=l1), C=5.0, dual=True, penalty=l2)
saving classifier to disk

Loading TPOT classification models: load_audioTPOT.py

Jims-MBP:~ jimschwoebel$ cd voicebook/chapter_4_modeling
Jims-MBP:chapter_4_modeling jimschwoebel$ python3 load_audiotpot.py
making 0.wav
making 1.wav
making 2.wav
...
making 36.wav
making 37.wav
making 38.wav
controls

Using TPOT for regression

cd ~
cd voicebook/chapter_4_modeling/
python3 train_audioTPOT.py

classification (c) or regression (r) problem?
r
what is the name of class 1? 
africanamerican
what is the name of class 2? 
Controls
Generation 1 - Current best internal CV score: -0.06707070707070706             
Generation 2 - Current best internal CV score: -0.06707070707070706             
Generation 3 - Current best internal CV score: -0.06707070707070706             
Generation 4 - Current best internal CV score: -0.062207740346188735            
Generation 5 - Current best internal CV score: -0.062207740346188735            
                                                                                
Best pipeline: KNeighborsRegressor(input_matrix, n_neighbors=4, p=1, weights=distance)
saving classifier to disk

Loading TPOT regression models: load_audioTPOT.py

Jims-MBP:~ jimschwoebel$ cd voicebook/chapter_4_modeling
Jims-MBP:chapter_4_modeling jimschwoebel$ python3 load_audiotpot.py
making 0.wav
making 1.wav
making 2.wav
...
making 36.wav
making 37.wav
making 38.wav
controls
controls

Resources

Datasets

Data labeling

Featurization

Classification models

Regression models

Deep learning

AutoML