forked from cap76/intro-machine-learning-2019B
-
Notifications
You must be signed in to change notification settings - Fork 19
/
01-intro.Rmd
391 lines (289 loc) · 16.3 KB
/
01-intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
---
output:
html_document: default
pdf_document: default
---
# Introduction {#intro}
In the era of large scale data collection we are trying to make meaningful intepretation of data.
There are two ways to meaningfully intepret data and they are
1. Mechanistic or mathematical modeling based
2. Descriptive or Data Driven
We are here to discuss the later approach using machine learning (ML) approaches.
## What is machine learning?
We use - computers - more precisely - algorithms to see patterns and learn concepts from data - without being explicitly programmed.
For example
1. Google ranking web pages
2. Facebook or Gmail classifying Spams
3. Biological research projects that we are doing - we use ML approaches to interpret effects of mutations in the noncoding regions.
We are given a set of
1. Predictors
2. Features or
3. Inputs
that we call 'Explanatory Variables'
and we ask different statistical methods, such as
1. Linear Regression
2. Logistic Regression
3. Neural Networks
to formulate an hypothesis i.e.
1. Describe associations
2. Search for patterns
3. Make predictions
for the Outcome Variables
A bit of a background: ML grew out of AI and Neural Networks
## Aspects of ML
There are two aspects of ML
1. Unsupervised learning
2. Supervised learning
**Unsupervised learning**: When we ask an algorithm to find patterns or structure in the data without any specific outcome variables e.g. clustering. We have little or no idea how the results should look like.
**Supervised learning**: When we give both input and outcome variables and we ask the algorithm to formulate an hypothesis that closely captures the relationship.
## What actually happened under the hood
The algorithms take a subset of observations called as the training data and tests them on a different subset of data called as the test data.
The error between the prediction of the outcome variable the actual data is evaulated as test error. The objective function of the algorithm is to minimise these test errors by tuning the parameters of the hypothesis.
Models that successfully capture these desired outcomes are further evaluated for **Bias** and **Variance** (overfitting and underfitting).
All the above concepts will be discussed in detail in the following lectures.
## Introduction to CARET
The **caret** package (short for **C**lassification **A**nd **RE**gression **T**raining) contains functions to streamline the model training process for classification and regression tasks.
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
```{r loading packages}
library(caret)
```
### Preprocessing with the Iris dataset
From the iris manual page:
The famous (Fisher’s or Anderson’s) Iris data set, first presented by Fisher in 1936 (http://archive.ics.uci.edu/ml/datasets/Iris), gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica. One class is linearly separable from the other two; the latter are not linearly separable from each other. The data base contains the following attributes: 1). sepal length in cm 2). sepal width in cm 3). petal length in cm 4). petal width in cm 5). classes: - Iris Setosa - Iris Versicolour - Iris Virginica
```{r load iris,warning=FALSE}
library(datasets)
data(iris) ##loads the dataset, which can be accessed under the variable name iris
?iris ##opens the documentation for the dataset
summary(iris) ##presents the 5 figure summary of the dataset
str(iris) ##presents the structure of the iris dataframe
```
First, we split into training and test datasets, using the proportions 70% training and 30% test. The function createDataPartition ensures that the proportion of each class is the same in training and test.
```{r split into training and test}
set.seed(23)
trainTestPartition<-createDataPartition(y=iris$Species, #the class label, caret ensures an even split of classes
p=0.7, #proportion of samples assigned to train
list=FALSE)
str(trainTestPartition)
training <- iris[ trainTestPartition,] #take the corresponding rows for training
testing <- iris[-trainTestPartition,] #take the corresponding rows for testing by removing training rows
summary(training)
nrow(training)
summary(testing)
nrow(testing)
```
We usually want to apply some preprocessing to our datasets to bring different predictors in line and make sure we are not introducing any extra bias. In caret, we can apply different preprocessing methods separately, together in the preProcessing function or just within the model training itself.
#### Applying preprocessing functions separately
```{r separate preprocessing}
training.separate = training
testing.separate = testing
```
*Near-Zero Variance*
The function nearZeroVar identifies predictors that have one unique value. It also diagnoses predictors having both of the following characteristics:
- very few unique values relative to the number of samples
- the ratio of the frequency of the most common value to the frequency of the 2nd most common value is large.
Such zero and near zero-variance predictors have a deleterious impact on modelling and may lead to unstable fits.
```{r nzv}
nzv(training.separate)
```
In this case, we have no near zero variance predictors but that will not always be the case.
*Highly Correlated*
Some datasets can have many highly correlated variables. caret has a function findCorrelation to remove highly correlated variables. It considers the absolute values of pair-wise correlations. If two variables are highly correlated, it looks at the mean absolute correlation of each variable and removes the variable with the largest mean absolute correlation. This method is also used in when you specify 'corr' in the preProcess function below.
In the case of data-sets comprised of many highly correlated variables, an alternative to removing correlated predictors is the transformation of the entire data set to a lower dimensional space, using a technique such as principal component analysis (PCA).
```{r high correlation}
calculateCor <- cor(training.separate[1:4]) #calculate correlation matrix on the predictors
summary(calculateCor[upper.tri(calculateCor)])
highlyCor <- findCorrelation(calculateCor) #pick highly correlated ones
colnames(training.separate)[highlyCor]
corrplot::corrplot(calculateCor,diag=FALSE)
training.separate.cor=training.separate[,-highlyCor] #remove highly correlated predictors from training
testing.separate.cor=testing.separate[,-highlyCor] #remove highly correlated predictors from test
```
Here, we have one highly correlated variable, Petal Length.
*Skewness*
caret provides various methods for transforming skewed variables to normality, including the Box-Cox (Box and Cox 1964) and Yeo-Johnson (Yeo and Johnson 2000) transformations. Here we try using the Box-Cox method.
```{r boxcox}
#perform boxcox scaling on each predictor
training.separate.boxcox=training.separate
training.separate.boxcox$Sepal.Length=predict(BoxCoxTrans(iris$Sepal.Length),
training.separate.cor$Sepal.Length)
training.separate.boxcox$Sepal.Width=predict(BoxCoxTrans(iris$Sepal.Width),
training.separate.cor$Sepal.Width)
training.separate.boxcox$Petal.Width=predict(BoxCoxTrans(iris$Petal.Width),
training.separate.cor$Petal.Width)
testing.separate.boxcox=testing.separate
testing.separate.boxcox$Sepal.Length=predict(BoxCoxTrans(iris$Sepal.Length),
testing.separate.cor$Sepal.Length)
testing.separate.boxcox$Sepal.Width=predict(BoxCoxTrans(iris$Sepal.Width),
testing.separate.cor$Sepal.Width)
testing.separate.boxcox$Petal.Width=predict(BoxCoxTrans(iris$Petal.Width),
testing.separate.cor$Petal.Width)
summary(training.separate.boxcox)
summary(testing.separate.boxcox)
```
In this situation it is also important to centre and scale each predictor. A predictor variable is centered by subtracting the mean of the predictor from each value. To scale a predictor variable, each value is divided by its standard deviation. After centring and scaling the predictor variable has a mean of 0 and a standard deviation of 1.
#### Using preProcess function
Instead of using separate functions, we can add all the preprocessing into one function call to preProcess.
```{r preprocess function}
#The options for preprocessing are "BoxCox", "YeoJohnson", "expoTrans", "center", "scale", "range", "knnImpute", "bagImpute", "medianImpute", "pca", "ica", "spatialSign", "corr", "zv", "nzv", and "conditionalX"
calculatePreProcess <- preProcess(training,
method = c("center", "scale","corr","nzv","BoxCox")) #perform preprocessing
calculatePreProcess
training.preprocess <- predict(calculatePreProcess, training) #apply preprocessing to training data
summary(training.preprocess)
#Petal.Length is removed
testing.preprocess <- predict(calculatePreProcess, testing) #apply same preprocessing to testing data
summary(testing.preprocess)
dtreeIris.preprocess <- train(
Species ~ .,
data = training.preprocess,
method = "rpart" #this is a decision tree but we will get to more information about that later
)
dtreeIris.preprocess
```
### Training different types of models
One of the primary tools in the package is this *train* function which can be used to evaluate, using resampling, the effect of model tuning parameters on performance, choose the 'optimal' model across these parameters and estimate model performance from a training set.
caret enables the easy use of many different types of models, a few of which we will cover in the course. The full list is here https://topepo.github.io/caret/available-models.html
We can change the model we use by changing the 'method' parameter in the train function. For example:
```{r change method}
#decision tree
dtreeIris <- train(
Species ~ .,
data = training.preprocess, ##make sure you use the preprocessed version
method = "rpart" #specifies decision tree
)
#support vector machine
svmIris <- train(
Species ~ .,
data = training.preprocess, ##make sure you use the preprocessed version
method = "svmLinear" #specifies support vector machine with linear kernel
)
#random forest
randomForestIris <- train(
Species ~ .,
data = training.preprocess, ##make sure you use the preprocessed version
method = "rf" ##specifies random forest
)
```
#### Adding preprocessing within training
We can combine the preprocessing step with training the model, using the *preProc* parameter in caret's train function.
```{r preprocessing in training}
dtreeIris <- train(
Species ~ ., ## this means the model should classify Species using the other features
data = training, ## specifies training data (without preprocessing)
method = "rpart", ## uses decision tree
preProc = c("center", "scale","nzv","corr","BoxCox") ##this performs the preprocessing within model training
)
dtreeIris
```
### Cross-validation
As we talked about in the last session, cross-validation is important to ensure the robustness of our models. We can specify how we want to perform cross-validation to caret.
```{r cross validation}
train_ctrl = trainControl(method='cv',
number=10) #10-fold cross-validation
dtreeIris.10fold <- train(
Species ~ .,
data = training,
method = "rpart",
preProc = c("center", "scale","nzv","corr","BoxCox"),
trControl = train_ctrl #train decision tree with 10-fold cross-validation
)
dtreeIris.10fold
```
You may notice that every time you run the last chunk you get slightly different answers. To make our analysis reproducible, we need to set some seeds. Rather than setting a single seed, we need to set quite a few as caret uses them in different places.
```{r set seed for cross validation}
set.seed(42)
seeds = vector(mode='list',length=11) #this is #folds+1 so 10+1
for (i in 1:10) seeds[[i]] = sample.int(1000,10)
seeds[[11]] = sample.int(1000,1)
train_ctrl_seed = trainControl(method='cv',
number=10,
seeds=seeds) #use our seeds in the cross-validation
dtreeIris.10fold.seed <- train(
Species ~ .,
data = training,
method = "rpart",
preProc = c("center", "scale","nzv","corr","BoxCox"),
trControl = train_ctrl_seed
)
dtreeIris.10fold.seed
```
If you try running this chunk multiple times, you will see the same answer each time
If you wanted to use repeated cross-validation instead of cross-validation, you can use:
```{r repeated cross validation}
set.seed(42)
seeds = vector(mode='list',length=101) #you need length #folds*#repeats + 1 so 10*10 + 1 here
for (i in 1:100) seeds[[i]] = sample.int(1000,10)
seeds[[101]] = sample.int(1000,1)
train_ctrl_seed_repeated = trainControl(method='repeatedcv',
number=10, #number of folds
repeats=10, #number of times to repeat cross-validation
seeds=seeds)
dtreeIris.10fold.seed.repeated <- train(
Species ~ .,
data = training,
method = "rpart",
preProc = c("center", "scale","nzv","corr","BoxCox"),
trControl = train_ctrl_seed_repeated
)
dtreeIris.10fold.seed.repeated
```
### Optimising hyperparameters
For different models, we need optimise different hyperparameters. To specify the different values we wish to consider, we use the tuneGrid or tuneLength parameters. In the decision tree example, we can optimise the cp value. Instead of looking at only 3 values, we may want to look at 10:
```{r tune length and tune grid}
dtreeIris.hyperparam <- train(
Species ~ .,
data = training,
method = "rpart",
preProc = c("center", "scale","nzv","corr","BoxCox"),
trControl = train_ctrl_seed_repeated,
tuneLength = 10 #pick number of different hyperparam values to try
)
dtreeIris.hyperparam
```
We will see more example of this parameter as we explore different types of models.
### Using dummy variables with the Sacramento dataset
If you have categorical predictors instead of continuous numeric variables, you may need to convert your categorical variable to a series of dummy variables. We will show this method on the Sacramento dataset.
From the documentation:
This data frame contains house and sale price data for 932 homes in Sacramento CA. The original data were obtained from the website for the SpatialKey software. From their website: "The Sacramento real estate transactions file is a list of 985 real estate transactions in the Sacramento area reported over a five-day period, as reported by the Sacramento Bee." Google was used to fill in missing/incorrect data.
```{r load Sacramento}
data("Sacramento") ##loads the dataset, which can be accessed under the variable name Sacramento
?Sacramento
str(Sacramento)
```
```{r Sacramento dummies}
dummies = dummyVars(price ~ ., data = Sacramento) #convert the categorical variables to dummies
Sacramento.dummies = data.frame(predict(dummies, newdata = Sacramento))
Sacramento.dummies$price=Sacramento$price
```
Once we have dummified, we can just split the data into training and test and train a model like with the Iris data.
```{r Sacramento split training test}
set.seed(23)
trainTestPartition.Sacramento<-createDataPartition(y=Sacramento.dummies$price, #the class label, caret ensures an even split of classes
p=0.7, #proportion of samples assigned to train
list=FALSE)
training.Sacramento <- Sacramento.dummies[ trainTestPartition.Sacramento,]
testing.Sacramento <- Sacramento.dummies[-trainTestPartition.Sacramento,]
```
```{r Sacramento decision tree dummies}
lmSacramento <- train(
price ~ .,
data = training.Sacramento,
method = "lm",
preProc = c("center", "scale","nzv","corr","BoxCox")
)
lmSacramento
```
We can also train without using dummy variables and compare.
```{r Sacramento decision tree non-dummies}
training.Sacramento.nondummy <- Sacramento[ trainTestPartition.Sacramento,]
testing.Sacramento.nondummy <- Sacramento[-trainTestPartition.Sacramento,]
lmSacramento.nondummy <- train(
price ~ .,
data = training.Sacramento.nondummy,
method = "lm",
preProc = c("center", "scale","nzv","corr","BoxCox")
)
lmSacramento.nondummy
```