forked from cap76/intro-machine-learning-2019B
-
Notifications
You must be signed in to change notification settings - Fork 19
/
10-ann.Rmd
361 lines (242 loc) · 15.8 KB
/
10-ann.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
# Artificial neural networks {#ann}
<!-- Sudhakaran -->
## Neural Networks
**What are artificial neural networks (ANNs)?**
*ANN* is actually an old idea but it came back into vogue recently and it is the state of the art technique for machine learning. The goal of ANN algorithms is to mimmick the functions of a neuron (Figure \@ref(fig:neuronalComputation)) and neuronal networks.
```{r neuronalComputation, echo = F, fig.cap = 'Neuronal computation', fig.align = 'center', fig.show='hold', out.width = '65%'}
knitr::include_graphics(c("images/neuronal_computation.png"))
```
Computational representation of a neuron (Figure \@ref(fig:perceptron)) aims to mimmick the biological input-and-activation architecture of a neuron (Figure \@ref(fig:neuronalComputation)). A single unit of a computational neuron is also called a **perceptron or ptrons**. Ptrons have a nonlinear activation function (e.g a logistic function) which determines their output value based upon the values of their inputs.
```{r perceptron, echo = F, fig.cap = 'Perceptron', fig.align = 'center', fig.show='hold', out.width = '65%'}
knitr::include_graphics(c("images/Perceptron.png"))
```
**Architecture of ANNs**
ANNs are built from ptrons. Ptrons have one or more inputs, an activation function and an output (Figure Perceptron). An ANN model is built up by combining ptrons in structured layers. Ptrons in a given layer are independent of each other, but each of them connect to all the ptrons in the next layer (Figure Neural Network Modeling).
The input layer contains a ptron for each input variable. One or more hidden layers contain a user defined number of ptrons. Each ptron in the first hidden layer receives an input from the each ptron in the input layer. If there is a second hidden layer, each ptron in this layer receives an input from each ptron in the first hidden layer, and so on with additional layers. The output layer contains a ptron for each response variable (usually one, sometimes more in multivariate response situations). Each output ptron receives one input from each ptron in the final hidden layer
*Important*: The connections between ptrons are weighted. The magnitude of the weight controls the strength of the influence of that input on the receiving ptron. The sign of the weight controls whether the influence is stimulating or inhibiting the signal to the next layer.
The weights are somewhat analogous to the parameters of a linear model. There is also a bias adjustment that represents the base value of a ptron and is analogous to the intercept in a linear model. If the inputs are near zero, the bias ensures that the output is near average.
Due to the network-like nature of the ANN a complex, non-linear relationship exist between the predictors and response.
*Acknowledgement: aspects of the above discussion are from*: https://rpubs.com/julianhatwell/annr
**Forward propagation**
Figure \@ref(fig:neuralNetworkModeling) represents a simple ANN, where we have an input later (layer 1) with three ptrons and a base unit, one hidden layer (layer 2) again with three prtons and a base unit, and an output layer (layer 3) where the $h_{\theta}$(x) is computed.
This method of computing $h_{\theta}$(x) is called *Forward Propagation*.
```{r neuralNetworkModeling, echo = F, fig.cap = 'Neural Network Modeling', fig.align = 'center', fig.show='hold', out.width = '85%'}
knitr::include_graphics(c("images/NN3.png"))
```
*where*
$a_i^{(j)}$= activation of i in layer j\
$\theta^i$ = matrix of weights controlling function mapping from layer j to layer j+1
$$\begin{align}
a_1^{(2)} &= g (\theta_{10}^{(1)}x_0 + \theta_{11}^{(1)}x_1 + \theta_{12}^{(1)}x_2+ \theta_{13}^{(1)}x_3)\\
a_2^{(2)} &= g (\theta_{20}^{(1)}x_0 + \theta_{21}^{(1)}x_1 + \theta_{22}^{(1)}x_2+ \theta_{23}^{(1)}x_3)\\
a_3^{(2)} &= g (\theta_{30}^{(1)}x_0 + \theta_{31}^{(1)}x_1 + \theta_{32}^{(1)}x_2+ \theta_{33}^{(1)}x_3)\\
h_{\theta}(x) &= a_1^{(3)}=g (\theta_{10}^{(2)}a_0^{(2)} + \theta_{11}^{(2)}a_1^{(2)} + \theta_{12}^{(1)}a_2^{(2)} + \theta_{13}^{(2)}a_3^{(2)})\\
\end{align}$$
Vectorized notations of inputs and activations.
$$\begin{align}
x &= \begin{bmatrix}x_o \\
x_1\\
x_2\\
x_3
\end{bmatrix}\\
z^{(2)} &= \begin{bmatrix} z_1^{(2)}\\
z_2^{(2)}\\
z_3^{(2)}
\end{bmatrix}
\end{align}$$
Vectorized representation of activation of hidden layer and activation layer.
$$\begin{align}
z^{(2)} &= \Theta^{(1)} a^{(1)}\\
a^{(2)} &= g(z^{(2)})
\end{align}$$
$$\begin{align}
z^{(3)} &= \Theta^{(2)} a^{(2)}\\
h_\Theta(x) &= a^{(3)} = g(z^{(3)})
\end{align}$$
**Why ANN?**
Consider the supervised learning problems below (Figure \@ref(fig:supervisedLearningProbs)). The first two are straight forward cases.
```{r supervisedLearningProbs, echo = F, fig.cap = 'Why ANN', fig.align = 'center', fig.show='hold', out.width = '85%'}
knitr::include_graphics(c("images/NN4.PNG"))
```
Whereas for the third we could probably apply a logistic regression with a lot of nonlinear features like this
$$ Y_i = g(\theta_0 + \theta_1 x_1 + \theta_2 x_2+ \theta_3 x_1x_2+ \theta_4 x_i^2x_2 + \theta_5 x_i^3x_2+ \theta_6 x_i^3x_2^2 \dots)$$
i.e. if we include enough polynomials we could arrive at an hypothesis that will separate the two classes.
This could perfectly work well if we just have two features, such as x~1~ and x~2~ but for almost all machine learning problems we usually have more than two features. Importantly if the number of features increase the number of quadratic terms increase as a function of $n^2/2$; where n is the number of features.
This would result in overfitting if the number of features increase.
Because ANN has felixibility to derive complex features from each layer of ptrons, it can be applied to any complex functional relationship and more importantly unlike generalized linear models (GLMs) it is not necessary to prespecify the type of relationship between covariates and response variables as for instance as linear combination. This makes ANN a valuable statistical tool.
Observed data are used to train the neural network and the neural network learns an approximation of the relationship by iteratively adapting its parameters.
Figure \@ref(fig:simpleLogicalANDANN) shows an example of a simple logical AND ptron architecture.
```{r simpleLogicalANDANN, echo = F, fig.cap = 'Simple Logical AND ANN', fig.align = 'center', fig.show='hold', out.width = '85%'}
knitr::include_graphics(c("images/NN5.png"))
```
**Cost function and back propagation**
Cost function of linear models
$$
\begin{equation*}
CF(\theta_{0},\theta_{1}) = 1/2n\sum_{i=1}^{n} (h_{\theta}(x^i) - y^i)^2
\end{equation*}
$$
Matrix solution to minimise the cost function in linear models
$$ \hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y} $$
Cost funciton in ANN (it is a pretty scarry equation)
$$
\begin{equation*}
CF(\Theta) = -1/m \Bigg[\sum_{i=1}^{m}\sum_{k=1}^{K} y_k^{(i)}log(h_\Theta(x^{(i)}))_k+(1-y_k^{(i)})log(1-(h_\Theta(x^{(i)}))_k)\Bigg]
\end{equation*}
$$
$$
\begin{equation*}
+\lambda/2m\sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{S_{l+1}}(\Theta_{ji}^{(l)})^2
\end{equation*}
$$
We have to minimise this ANN cost function and it is done by back propogation.
It is termed back propogation because of the fact that we compute the error from the outer most layer and go backwards.
*Acknowledgement: The above example and discussion is from Prof. Andrew Ng's coursera session on ANN*
*Example model*
This example uses the Boston data from the MASS package which contains a number of predictors of median property values in suburbs of Boston, MA, USA. The code used is based on Alice, (2015)
The Boston dataset is a collection of data about housing values in the suburbs of Boston. Our goal is to predict the median value of owner-occupied homes (medv) using all the other continuous variables available.
```{r}
library(neuralnet)
library(nnet)
library(NeuralNetTools)
library(MASS)
library(ISLR)
library(caTools) # sample.split
library(boot) # cv.glm
library(faraway) # compact lm summary "sumary" function
library(caret) # useful tools for machine learning
library(corrplot)
```
```{r}
set.seed(500)
library(MASS)
data <- Boston
```
Checking whether there are any missing data.
```{r}
apply(data,2,function(x) sum(is.na(x)))
```
We randomly splitt the data into a train and a test set and then we fit a linear regression model and test it on the test set.
```{r}
index <- sample(1:nrow(data),round(0.75*nrow(data)))
train <- data[index,]
test <- data[-index,]
lm.fit <- glm(medv~., data=train)
summary(lm.fit)
pr.lm <- predict(lm.fit,test)
MSE.lm <- sum((pr.lm - test$medv)^2)/nrow(test)
```
The sample(x,size) function simply outputs a vector of the specified size of randomly selected samples from the vector x. By default the sampling is without replacement: index is essentially a random vector of indeces.
Since we are dealing with a regression problem, we are going to use the mean squared error (MSE) as a measure of how much our predictions are far away from the real data.
Before fitting a neural network, we need to be prepare them to train and tune.
```{r}
maxs <- apply(data, 2, max)
mins <- apply(data, 2, min)
```
It is important to normalize the data before training a neural network. Avoiding normalization may lead to useless results or to a very difficult training process (most of the times the algorithm will not converge before the number of maximum iterations are allowed). There are different methods to choose to scale the data (z-normalization, min-max scale, etc…).
Here we have chosen to use the min-max method and scale the data in the interval [0,1]. Usually scaling in the intervals [0,1] or [-1,1] tends to give better results.
We therefore scale and split the data before moving on:
```{r}
scaled <- as.data.frame(scale(data, center = mins, scale = maxs - mins))
train_ <- scaled[index,]
test_ <- scaled[-index,]
```
Scale returns a matrix that needs to be coerced into a data.frame.
```{r}
library(neuralnet)
n <- names(train_)
f <- as.formula(paste("medv ~", paste(n[!n %in% "medv"], collapse = " + ")))
nn <- neuralnet(f,data=train_,hidden=c(5,3),linear.output=T)
```
There are no fixed rules as to how many layers and neurons to use although there are several more or less accepted rules of thumb. Usually, one hidden layer is enough for a vast numbers of applications. As far as the number of neurons is concerned, it should be between the input layer size and the output layer size, usually 2/3 of the input size.
Since this is a toy example, we are going to use 2 hidden layers with this configuration: 13:5:3:1. The input layer has 13 inputs, the two hidden layers have 5 and 3 neurons and the output layer has a single output since we are doing regression.
The formula y~. is not accepted in the neuralnet() function. You need to first write the formula and then pass it as an argument in the fitting function.
The hidden argument accepts a vector with the number of neurons for each hidden layer, while the argument linear.output is used to specify whether we want to do regression linear.output=TRUE or classification linear.output=FALSE
This is the graphical representation of the model with the weights on each connection:
```{r out.width='80%', fig.asp=0.744, fig.align='center', echo=T}
plot(nn)
```
The black lines show the connections between each layer and the weights on each connection while the blue lines show the bias term added in each step. The bias can be thought as the intercept of a linear model.
The net is essentially a black box so we cannot say that much about the fitting, the weights and the model. Suffice to say that the training algorithm has converged and therefore the model is ready to be used.
Now we can try to predict the values for the test set and calculate the MSE. Remember that the net will output a normalized prediction, so we need to scale it back in order to make a meaningful comparison (or just a simple prediction).
```{r}
pr.nn <- compute(nn,test_[,1:13])
pr.nn_ <- pr.nn$net.result*(max(data$medv)-min(data$medv))+min(data$medv)
test.r <- (test_$medv)*(max(data$medv)-min(data$medv))+min(data$medv)
MSE.nn <- sum((test.r - pr.nn_)^2)/nrow(test_)
```
we then compare the two MSEs
```{r}
print(paste(MSE.lm,MSE.nn))
```
```{r}
par(mfrow=c(1,2))
plot(test$medv,pr.nn_,col='red',main='Real vs predicted NN',pch=18,cex=0.7)
abline(0,1,lwd=2)
legend('bottomright',legend='NN',pch=18,col='red', bty='n')
plot(test$medv,pr.lm,col='blue',main='Real vs predicted lm',pch=18, cex=0.7)
abline(0,1,lwd=2)
legend('bottomright',legend='LM',pch=18,col='blue', bty='n', cex=.95)
```
The net is doing a better work than the linear model at predicting medv. Once again, be cautious because this result depends on the train-test split performed above. Below, after the visual plot, we are going to perform a fast cross validation in order to be more confident about the results.
A first visual approach to the performance of the network and the linear model on the test set is plotted.
By visually inspecting the plot we can see that the predictions made by the neural network are (in general) more concetrated around the line (a perfect alignment with the line would indicate a MSE of 0 and thus an ideal perfect prediction) than those made by the linear model.
A perhaps more useful visual comparison is plotted.
```{r}
plot(test$medv,pr.nn_,col='red',main='Real vs predicted NN',pch=18,cex=0.7)
points(test$medv,pr.lm,col='blue',pch=18,cex=0.7)
abline(0,1,lwd=2)
legend('bottomright',legend=c('NN','LM'),pch=18,col=c('red','blue'))
```
*Cross validation* is another very important step of building predictive models. While there are different kind of cross validation methods, the basic idea is repeating the following process a number of time:
*Train-test split*
1. Do the train-test split\
2. Fit the model to the train set\
3. Test the model on the test set\
4. Calculate the prediction error\
5. Repeat the process K times\
Then by calculating the average error we can get a grasp of how the model is doing.
We are going to implement a fast cross validation using a for loop for the neural network and the cv.glm() function in the boot package for the linear model.
Here is the 10 fold cross validated MSE for the linear model
```{r}
library(boot)
set.seed(200)
lm.fit <- glm(medv~.,data=data)
cv.glm(data,lm.fit,K=10)$delta[1]
```
```{r}
set.seed(450)
cv.error <- NULL
k <- 10
```
Note that we are splitting the data in this way: 90% train set and 10% test set in a random way for 10 times. We are also initializing a progress bar using the plyr library.
```{r}
library(plyr)
pbar <- create_progress_bar('text')
pbar$init(k)
for(i in 1:k){
index <- sample(1:nrow(data),round(0.9*nrow(data)))
train.cv <- scaled[index,]
test.cv <- scaled[-index,]
nn <- neuralnet(f,data=train.cv,hidden=c(5,2),linear.output=T)
pr.nn <- compute(nn,test.cv[,1:13])
pr.nn <- pr.nn$net.result*(max(data$medv)-min(data$medv))+min(data$medv)
test.cv.r <- (test.cv$medv)*(max(data$medv)-min(data$medv))+min(data$medv)
cv.error[i] <- sum((test.cv.r - pr.nn)^2)/nrow(test.cv)
pbar$step()
}
```
We calculate the average MSE and plot the results as a boxplot.
```{r}
mean(cv.error)
```
```{r}
boxplot(cv.error,xlab='MSE CV',col='cyan',
border='blue',names='CV error (MSE)',
main='CV error (MSE) for NN',horizontal=TRUE)
```
The average MSE for the neural network (10.33) is lower than the one of the linear model although there seems to be a certain degree of variation in the MSEs of the cross validation. This may depend on the splitting of the data or the random initialization of the weights in the net. By running the simulation different times with different seeds you can get a more precise point estimate for the average MSE.
*Acknowledgement: the above example is from* https://www.r-bloggers.com/fitting-a-neural-network-in-r-neuralnet-package/
## Exercises
Using a ANN, take a number and calculate its square root.
Solutions to exercises can be found in appendix \@ref(solutions-ann).