-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpima_diabetes_ver1.3.Rmd
522 lines (403 loc) · 16.4 KB
/
pima_diabetes_ver1.3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
---
title: "Predict The Onset of Diabetes Based on Diagnostic Measures"
author: "Hao He, Bryant Leal, Qinwen Zhou, Hyeon Gu Kim, Sunny Vidhani, Jazline Keli"
date: "7/27/2020"
output:
md_document:
variant: markdown_github
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Read Data and Data Cleaning
```{r, echo = FALSE, results = 'hide'}
library(class)
library(boot)
library(glmnet)
library(caret)
```
First we read in the data in which the zero values have all been converted to NA values.Then we omit the observations which contain the NA values.
```{r, echo = T, results = 'hide'}
rm(list=ls())
#setwd('/Users/z/Documents/MSBA/Summer 2020/Predictive Modeling/Project')
diabetes=read.csv("pima_diabetes_NA.csv")
diabetes$class <- as.factor(diabetes$class)
diabetes <- na.omit(diabetes)
attach(diabetes)
```
Next we are going to see if we have a class imbalance issue. The percentage of true class which means the patient has diabetes is 33.61% while the false class is 66.84%. There is imbalance existing in the dataset which might impact our analysis.
```{r}
D = sum(diabetes$class == 1)
TrueD = D / nrow(diabetes)
TrueD
1-TrueD
```
We plotted the box plot to get a sense of what might be the key variables to help us tell the difference between patients with diabetes and patients without.In the plot we can see plas, mass, and age is doing a better job in differentiating two classes.
```{r fig1}
par(mfrow=c(2,4))
plot(preg ~ class, data=diabetes, col=c(grey(.2),2:6), cex.lab=1)
plot(plas ~ class, data=diabetes, col=c(grey(.2),2:6), cex.lab=1)
plot(pres ~ class, data=diabetes, col=c(grey(.2),2:6), cex.lab=1)
plot(skin ~ class, data=diabetes, col=c(grey(.2),2:6), cex.lab=1)
plot(test ~ class, data=diabetes, col=c(grey(.2),2:6), cex.lab=1)
plot(mass ~ class, data=diabetes, col=c(grey(.2),2:6), cex.lab=1)
plot(pedi ~ class, data=diabetes, col=c(grey(.2),2:6), cex.lab=1)
plot(age ~ class, data=diabetes, col=c(grey(.2),2:6), cex.lab=1)
```
# Model Selection
## KNN
We are going to perform a 5-fold CV KNN model here. We are testing the K value for KNN ranging from 1 to 100. The optimal K is 17 which gives us an accuracy of 77.44%.
```{r, echo=FALSE}
X <- scale(diabetes[,-c(9)]) #scale our variables and assign it to X
label <- diabetes['class'] #assign our label
n = dim(diabetes)[1]
kcv = 5
n0 = round(n/kcv,0)
kk <- 1:100
out_accuracy = matrix(0, #matrix filled with zeroes
nrow = kcv, #number of rows
ncol = length(kk)) #number of columns
used = NULL
set = 1:n
for(j in 1:kcv){
set.seed(1)
if(n0<length(set)){
val = sample(set, size = n0)
}
if(n0>=length(set)){
val=set
}
train_x = X[-val,]
test_x = X[val,]
train_label = label[-val,]
test_label = label[val,]
for(i in kk){
near = knn(train_x, test_x, cl = train_label, k =i)
tab <- table(near,test_label)
accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
out_accuracy[j,i] = accuracy(tab)
}
used = union(used,val)
set = (1:n)[-used]
}
mAccuracy = apply(out_accuracy, 2,mean)
```
```{r fig2}
par(mfrow=c(1,1))
plot(kk,mAccuracy,
xlab="k",
ylab="CV Accuracy",
col=4, #Color of line
lwd=2, #Line width
type="l", #Type of graph = line
cex.lab=1.2, #Size of labs
main=paste("kfold(",kcv,")")) #Title of the graph
#Find the index of the minimum value of mMSE
best = which.max(mAccuracy)
text(kk[best],mAccuracy[best], #Coordinates
paste("k=",kk[best]), #The actual text
col=2, #Color of the text
cex=0.8) #Size of the text
text(kk[best]+5,mAccuracy[best]-3,paste("Best Accuracy:",mAccuracy[best]), col=2, cex=0.8)
maxAccuracy = max(mAccuracy)
```
## Tree Model
We tried Simple Tree Model and Random Forest in this part.
### Simple Tree
We grew the Simple Tree model based on one set up of train and test data following the grow-and-prune process. Because of the randomness of train and test data, **the accuracy rate of our Simple Tree model might floats**.
```{r}
library(tree)
set.seed(1)
pima = read.csv("pima_diabetes.csv",header=T)
#Data Cleanning
pima$class = as.factor(pima$class)
pima[,2:8][pima[,2:8] == 0] = NA
pima = na.omit(pima)
#set up train and test data
n = dim(pima)[1]
train = sample(1:n, size = 130, replace = FALSE) #Indice of train data
pima_tr = pima[train, ] #the train data
pima_te = pima[-train, ] # the test data
#First, grow a big tree
big_tree = tree(class ~ ., data = pima_tr)
plot(big_tree)
text(big_tree)
#make prediction on test data
pima_pred = predict(big_tree,
newdata = pima_te,
type="class")
#compare predictions to actual diabetes status
conf_mtx = table(pima_pred,pima_te$class) #diagonal are correct predictions, off-diagonal are incorrect
conf_mtx
accur_big = sum(diag(conf_mtx)) / dim(pima_te)[1]
cat("we have a big tree with accuracy rate of", accur_big, '\n')
#Second, we prune back our big tree to reach the optimal size (cp)
#Do a 10-fold CV on our big tree
tree_cv = cv.tree(big_tree,
FUN = prune.misclass,
K = 10)
#find the size# with min missclassification rate
plot(tree_cv, type = 'p')
best_size = tree_cv$size[which.min(tree_cv$dev)]
#prune back the big tree to the optimum size
pruned_tree = prune.misclass(big_tree, best = best_size)
plot(pruned_tree)
text(pruned_tree, pretty = T)
#look at predictions
pruned_tree_pred = predict(pruned_tree,
newdata = pima_te,
type="class")
conf_mtx = table(pruned_tree_pred, pima_te$class)
conf_mtx #diagonal are correct predictions, off-diagonal are incorrect
accur_pruned = sum(diag(conf_mtx)) / dim(pima_te)[1]
cat("we have a pruned tree with accuracy rate of", accur_pruned,
"with size of", best_size, '\n')
```
## Random Forest
We then built the Random Forest model based on a 10-fold CV where we tested the p value b/w 100 and 500 and the m value from 1 to 8 (total). We ended up having an optimal RF with **m = 4** and **p = 100** which yeilds an accuracy rate of **77.55%**.
```{r}
set.seed(1)
library(randomForest)
pima = read.csv("pima_diabetes_NA.csv",header=T)
#Data Cleanning
pima$class = as.factor(pima$class)
pima[,2:8][pima[,2:8] == 0] = NA
pima = na.omit(pima)
#total num of obs in pima
n = dim(pima)[1]
#Define the number of folds, aka num of groups that the data is divided into
kcv = 10
#Size of the fold (which is the number of elements in the test matrix)
n0 = round(n/kcv, 0)
#The model setups matrix
p = ncol(pima) - 1 #Number of covariates (-1 because one column is the response)
mtryv = c(1:p) #Number of candidate variables for each split, the m val
ntreev = c(100,500) #Number of trees, the B val
parmrf = expand.grid(mtryv,ntreev) #Expading grids of different models
colnames(parmrf)=c('mtry','ntree')
nset = nrow(parmrf) #Number of models
#print(parmrf)
#error matrix
out_ERR = matrix(0, #matrix filled with zeroes
nrow = kcv, #number of rows
ncol = nset) #number of columns/models
#Vector of indices that have already been used inside the for
used = NULL
#The set of indices not used (will be updated removing the used)
set = 1:n
for(j in 1:kcv){
if(n0 < length(set)) { #If the set of 'not used' is > than the size of the fold
val = sample(set, size = n0) #then sample indices from the set
#recall k-flod RADOMLY select obs from data to fill in each fold
}
if(n0 >= length(set)) { #If the set of 'not used' is <= than the size of the fold
val = set #then use all of the remaining indices as the sample
}
#Create the train and test matrices
train_i = pima[-val,] #Every observation except the ones whose indices were sampled
test_i = pima[val,] #The observations whose indices sampled
for(i in 1:nset){
#The current model
temprf = randomForest(class ~ .,
data = train_i,
mtry = parmrf[i,1],
ntree = parmrf[i,2],
maxnodes = 15)
pred = predict(temprf,
newdata = test_i,
type = 'class')
error_rate = sum(test_i$class != pred) / nrow(test_i)
#Store the current ERR
out_ERR[j,i] = error_rate
}
#The union of the indices used currently and previously, or, append val to used
used = union(used,val)
#The set of indices not used is updated
set = (1:n)[-used]
#Printing on the console the information that you want
#Useful to keep track of the progress of your loop
cat(j,"folds out of",kcv,'\n')
}
#Calculate the mean of error rate for each model
mERR = apply(out_ERR, 2, mean)
#plot m vs error rate graph for p = 100 and p = 500
plot(x = 1:8,
y = sqrt(mERR[1:8]),
xlab="m value",
ylab="out-of-sample error rate",
col=4, #Color of line
lwd=2, #Line width
type="l", #Type of graph = line
cex.lab=1.2, #Size of labs
main=paste("10-fold CV")) #Title of the graph
lines(x = 1:8,
y = sqrt(mERR[9:16]),
col = 5,
lwd = 2)
legend("topright",
legend=c("p = 100", "p = 500"),
col=c(4, 5),
lty=c(1, 1),
cex=0.8)
#pick up the best model overall
best = which.min(mERR)
bestrf = randomForest(class ~ .,
data = pima,
mtry = parmrf[best,1],
ntree = parmrf[best,2],
maxnodes = 15)
#calculate accuracy rate when predicting the whole population for pima
bestrfpred = predict(bestrf, type="class")
conf_mtx = table(bestrfpred, pima$class)
#conf_mtx #diagonal are correct predictions, off-diagonal are incorrect
accur_rf = sum(diag(conf_mtx)) / dim(pima)[1]
cat("we have a random forest model with accuracy rate of", accur_rf,
"with m =", parmrf[best, 1],
"p = ", parmrf[best, 2], '\n')
```
## Logistic Regression
In the end we performed a Logistic Regression with cross validation. The raw CV estimate of accuracy returned is **78.06%** and the **adjusted** estimate is **77.60%**.
```{r}
# first we scale the data that will be used later for variable selection
XXdia <- scale(diabetes[,-9])
DIAdata <- data.frame(class,XXdia)
#define our own cost function here
mycost <- function(r, pi){ #r = observed responses, pi = fitted responses
c1 = (r==1)&(pi<0.5) #logical vector - true if actual 1 but predict 0
c0 = (r==0)&(pi>=0.5) #logical vector - true if actual 0 but predict 1
return(mean(c1+c0))
}
diabetes.glm <- glm(class ~ ., family = 'binomial', data = DIAdata)
set.seed(1)
cv.error = cv.glm(DIAdata,diabetes.glm, cost = mycost, K=10)$delta
cv.error
```
# Variable Selection
As previously mentioned with the box plot, we notice some variables might have stronger predictive ability than others. Here,we will perform out-of-sample variable selection methods with Logistic Regression algorithm to find out the more important predictors.
```{r}
#split the dataset into training and test data for all the variable selection methods for better comparison
n = dim(diabetes)[1]
set.seed(1)
tr = sample(1:n,
size = 274, # we take 70% of our dataset as training data
replace = FALSE)
```
## Out-of-Sample Stepwise Method
We tried all three kinds of stepwise method: forward, backward and both. The results returned are the same with the model **class ~ plas + mass + age** and an accuracy of **76.27%**.
### Forward Selection
```{r}
test_x = DIAdata[-tr,][,-1]
test_y = DIAdata[-tr,][,1]
#Two models initially
null = glm(class~1, data=DIAdata[tr,], family = 'binomial' )
full = glm(class~., data=DIAdata[tr,], family = 'binomial')
regForward = step(null,
scope=formula(full),
direction="forward",
k=log(length(tr)))
forward_prob <- predict(regForward, test_x, type = "response")
forward_pred <- ifelse(forward_prob > 0.5, 1, 0)
forward_acc = mean(forward_pred==test_y)
forward_acc
```
### Backward Selection
```{r}
regBack = step(full,
direction="backward",
k=log(length(tr)))
back_prob <- predict(regBack, test_x, type = "response")
back_pred <- ifelse(back_prob > 0.5, 1, 0)
back_acc = mean(back_pred==test_y)
back_acc
```
### Both Selection
```{r}
regHybrid = step(null,
scope=formula(full),
direction="both",
k=log(length(tr)))
hybrid_prob <- predict(regHybrid, test_x, type = "response")
hybrid_pred <- ifelse(hybrid_prob > 0.5, 1, 0)
hybrid_acc = mean(hybrid_pred==test_y)
hybrid_acc
```
## Out-of-Sample Shrinkage Method
Lasso returns with an accuracy of **77.11%** with the model **class~plas+mass+age**.By looking at the coefficients, plas has a more significant predicting power, mass comes the second, and age comes the third. This verifies our guess when first looking at the box plot. And the predictors chosen by Lasso is also consistent with what we get from the stepwise method.
Ridge, on the other hand, has a compromised performance with an accuracy of **72.88%** only.
### Lasso
```{r}
train_x = as.matrix(DIAdata[tr,-1])
train_y = DIAdata[tr,1]
test_x = as.matrix(DIAdata[-tr,-1])
test_y = DIAdata[-tr,1]
Lasso.Fit = glmnet(train_x,train_y,family = 'binomial',alpha=1)
plot(Lasso.Fit)
#Cross Validation
set.seed(1)
CV.L = cv.glmnet(train_x, train_y, family = 'binomial',type.measure="class", alpha=1)
coef(CV.L,s="lambda.1se")
####Plots#####
LamL = CV.L$lambda.1se
plot(log(CV.L$lambda),sqrt(CV.L$cvm),
main="LASSO CV (k=10)",xlab="log(lambda)",
ylab = "RMSE",col=4,type="b",cex.lab=1.2)
abline(v=log(LamL),lty=2,col=2,lwd=2)
####Predict on test set####
lasso_predict = predict(CV.L,newx = test_x,s="lambda.1se",type='class')
lasso_conf_matrix <- table(lasso_predict,test_y)
lasso_acc = mean(lasso_predict==test_y)
lasso_acc
```
### Ridge
```{r}
Ridge.Fit = glmnet(train_x,train_y,family = 'binomial',alpha=0)
par(mfrow=c(1,1))
plot(Ridge.Fit)
#Cross Validation
set.seed(1)
CV.R = cv.glmnet(train_x, train_y,family = 'binomial',type.measure="class", alpha=0)
coef(CV.R,s="lambda.1se") # check coefficients of the model with optimal lambda
#this lambda is not correponding to the lowest RMSE, instead its error is within one std of the min RMSE
#Make the plot of how cv chooses the optimal lambda
LamR = CV.R$lambda.1se
plot(log(CV.R$lambda),sqrt(CV.R$cvm),
main="Ridge CV (k=10)",
xlab="log(lambda)",
ylab = "RMSE",
col=4,#Color of points
cex.lab=1.2) #Size o lab
abline(v=log(LamR),lty=2,col=2,lwd=2) #selected lambda vs RMSE
#predict on the test data
ridg_predict = predict(CV.R, newx = test_x,s="lambda.1se",type='class')
ridg_acc = mean(ridg_predict==test_y)
ridg_conf_matrix <- table(ridg_predict,test_y)
ridg_acc
```
# Model Comparison
Most of our models yield satisfactory accuracy of above 75%. We want to specifically compare the Logistic Regression Model with all variables and Lasso with three variables picked to evalute whether variable selection process is effective. Both models have similar accuracy, while our team decides to go with the Logistic Regression with all variables.
The main reason is that our model is trying to solve the problem of predicting diabetes. It is crucial to correctly identify the patients with diabetes for timely treatment. Therefore, sensitivity score of the model is an important criteria to look at besides the accuracy score.
Below we are doing the out-of-sample Logistic Regression in order to better compare with the Lasso model. We will be also generating the confusion matrix for both models.
```{r}
#
train = DIAdata[tr,]
test_x = DIAdata[-tr,][,-1]
test_y = DIAdata[-tr,][,1]
diabetes.glm <- glm(class ~ ., family = 'binomial', data = train)
glm_predict_prob <-predict(diabetes.glm, newdata = test_x, family = 'binomial',type = 'response')
glm_predict <- ifelse(glm_predict_prob > 0.5, 1, 0)
glm_predict_accuracy <- mean(glm_predict == test_y)
glm_conf_matrix <- table(glm_predict,test_y)
glm_conf_matrix
```
```{r}
#confusion matrix for lasso
lasso_conf_matrix
```
*Logistic Regression with all Variables*
Sensitivity Score: **0.6154**
Specificity Score: **0.8734**
*Lasso with class~plas+mass+age*
Sensitivity Score: **0.4103**
Specificity Score: **0.9494**
As we can see, Logistic Regression with all variables achieve a leap in the sensitivity score with a small compromise of specificity score. This might be due to the information contained in the variables that aren't selected by Lasso.
What's more, since our dataset only has 8 predictors, we are not analyzing our data in a high-dimensional space which requires varaibles selection methods to reduce the dimensionality.