-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
261 lines (178 loc) · 13.9 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
---
title: "General Classification"
author: "Delermando Branquinho Filho"
date: "January 02, 2019"
output:
md_document:
variant: markdown_github
---
```{r setup, include=FALSE,echo=FALSE,results=F,message=F}
knitr::opts_chunk$set(echo = FALSE)
```
## The Problem
1 - Build a classification problem, using the columns x, y and z, trying to classify the label column.
a) Segregate a test and training frame.
b) Use a GLM or Logistic Regression model and show the results.
c) Use other method of your choice to handle the problem
d) Compare and comment the results on the models used from b) and c)
```{r init,echo=FALSE,results=F,message=F,warning=F}
library(PerformanceAnalytics,quietly = T,verbose = F,warn.conflicts = F)
library(caret,quietly = T,verbose = F,warn.conflicts = F)
control <- trainControl(method="repeatedcv", number=10, repeats=3)
seed <- 7
metric <- "Accuracy"
dataset <- read.csv(file = "data/df_points.txt",header = T,stringsAsFactors = F,sep = "\t")
# Deleting first column because is not belong to dataset and was not
# defined into the requested problem above
dataset <- dataset[,-1]
```
### Step 0: Data Explotation
In this section, we'll see what the data looks like.
#### Look the spread data
In other hand we show you bellow a graphic we the data in 3D.
It is not a good idea bacause 3D graphics, normaly, can't revel much more than numbers. If you look in detail, you'll see a little separetion in red against blue wich are the label of classes.
All the data are visible mixed on , this show us that regression will fail, but let's to try to get the statistics numbers before any conclusion.
```{r scatter3D,echo=FALSE,results=T,message=F,warning=F}
library("plot3D",quietly = T,verbose = F,warn.conflicts = F)
scatter3D(dataset$x,dataset$y,dataset$z,colvar = dataset$label,
pch = 19, cex = 0.5,bty = "g",ticktype = "detailed",main = "disease")
```
#### Looking for correlation between variables
We try to see any correlation between all variable. The graphs below show us this relationship.
```{r chart_correlation,echo=FALSE,results=T,message=F,warning=F}
chart.Correlation(dataset)
```
As shown, there is no relationship between any variables. Values are below the weak correlation. Other view or analyse is the distribution of the data, all variable (x, y and z) have a uniform distribution.
## Step 1 - Answering the first question
On this step we are using Logistic Regression model to try classification
the dataset about *label* variable.
Like the problem ask, we try to a simple method like split dataset with 70% for train and 30% to test using glm algorithm.
```{r glm_traditional,echo=FALSE,results=T,message=F,warning=F}
set.seed(seed)
split=0.70
trainIndex <- createDataPartition(dataset$label, p=split, list=FALSE)
data_train <- dataset[ trainIndex,]
data_test <- dataset[-trainIndex,]
# train a glm model
model <- glm(label ~ ., data=data_train)
# make predictions
predictions <- as.factor(round(predict(model, data_test)))
# summarize results
```
#### summarize results
```{r conf_matrix,echo=FALSE,results=T,message=F,warning=F}
confusionMatrix(predictions, as.factor(data_test$label))
```
The accuracy is $0.559$, not so good because it's closer to a draw or a coin toss. We are in a 95% confidence interval, that means between $0.541$, $0.5769$.
#### Looking for Multicollinearity
In statistics, multicollinearity (also collinearity) is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy. In this situation the coefficient estimates of the multiple regression may change erratically in response to small changes in the model or the data. Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data set; it only affects calculations regarding individual predictors. That is, a multivariate regression model with collinear predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with respect to others.
```{r Multicollinearity,echo=FALSE,results=T,message=F,warning=F}
library("car")
vif(model)
```
All values are near to one, it is good or expected.
#### Residual Analysis
The difference between the observed value of the dependent variable ($y$) and the predicted value ($ŷ$) is called the residual ($e$). Each data point has one residual.
Residual = Observed value - Predicted value
$e = y - ŷ$
Both the sum and the mean of the residuals are equal to zero. That is, $Σ e = 0$ and $e = 0$.
```{r Residuos,echo=FALSE,results=T,message=F,warning=F}
par(mfrow=c(2,2))
plot(model)
```
A residual plot is a graph that shows the residuals on the vertical axis and the independent variable on the horizontal axis. If the points in a residual plot are randomly dispersed around the horizontal axis, a linear regression model is appropriate for the data; otherwise, a non-linear model is more appropriate.
## Step 2
We will try to have more control over the computational nuances of the training and test function, both the number of folds and the number of resampling iterations are equal to 10. Only for repeated cross-validation in k-fold: the number of complete sets of folds for compute it 3. This is more efficient than we have tried so far.
```{r transformation_1,echo=FALSE,results=F,message=F,warning=F}
# transformation #2
# We are going to convert the label variable to factor,
# it is easier to apply most classification
# algorithms, because it was loaded as numeric
# after that we have two levels
dataset$label <- as.factor(dataset$label)
```
*We will use 10 other algorithms besides logistic regression.*
Different from split dataset in training and test , we'll to use cross validation.
**Cross-validation**, sometimes called rotation estimation, or out-of-sample testing is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice.
### Other attempts with other algorithms
We'll try, again, Logistic Regression using this technique (k-fold) to get better results.
**Logistic regression** is a classification algorithm traditionally limited to only two-class classification problems. In this case we have no more than two classes (labels), but we'll use the Linear Discriminant Analysis to compare to linear classification technique.
**GLMNET**
Extremely efficient procedures for fitting the entire lasso or elastic-net
regularization path for linear regression, logistic and multinomial regression
models, Poisson regression and the Cox model.
**SVM Radial**
Support vector machines are a famous and a very strong classification technique which does not use any sort of probabilistic model like any other classifier but simply generates hyperplanes or simply putting lines, to separate and classify the data in some feature space into different regions.
**kNN** - k-nearest neighbors algorithm
In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression.
**Naive Bayes**
The Naive Bayes Classifier technique is based on the so-called Bayesian theorem and is particularly suited when the dimensionality of the inputs is high.
**Decision Trees** are commonly used in data mining with the objective of creating a model that predicts the value of a target (or dependent variable) based on the values of several input (or independent variables).
**CART**
The CART or Classification & Regression Trees methodology was introduced in 1984 by Leo Breiman, Jerome Friedman, Richard Olshen and Charles Stone as an umbrella term to refer to the following types of decision trees.
**C5.0**
Is an algorithm used to generate a decision tree developed by Ross Quinlan
**Bagged CART**
Bagging ensemble algorithm and the Random Forest algorithm for predictive modeling.
**Random Forest**
Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees
**Stochastic Gradient Boosting** (Generalized Boosted Modeling)
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.
```{r algorithms_10,echo=FALSE,results=F,message=F,warning=F}
library(caret,quietly = T,verbose = F,warn.conflicts = F)
set.seed(seed)
fit.glm <- train(label ~., data=dataset, method="glm", metric=metric, trControl=control)
set.seed(seed)
fit.lda <- train(label ~., data=dataset, method="lda",
metric=metric, preProc=c("center", "scale"), trControl=control)
set.seed(seed)
fit.glmnet <- train(label ~., data=dataset, method="glmnet", metric=metric, preProc=c("center", "scale"), trControl=control)
set.seed(seed)
fit.svmRadial <- train(label ~., data=dataset, method="svmRadial", metric=metric, preProc=c("center", "scale"), trControl=control, fit=FALSE)
set.seed(seed)
fit.knn <- train(label ~., data=dataset, method="knn", metric=metric, preProc=c("center", "scale"), trControl=control)
set.seed(seed)
fit.nb <- train(label ~., data=dataset, method="nb", metric=metric, trControl=control)
set.seed(seed)
fit.cart <- train(label ~., data=dataset, method="rpart", metric=metric, trControl=control)
set.seed(seed)
fit.c50 <- train(label ~., data=dataset, method="C5.0", metric=metric, trControl=control)
set.seed(seed)
fit.treebag <- train(label ~., data=dataset, method="treebag", metric=metric, trControl=control)
set.seed(seed)
fit.rf <- train(label ~., data=dataset, method="rf", metric=metric, trControl=control)
set.seed(seed)
fit.gbm <- train(label ~., data=dataset, method="gbm", metric=metric, trControl=control, verbose=FALSE)
results <- resamples(list(lda=fit.lda, logistic=fit.glm, glmnet=fit.glmnet,
svm=fit.svmRadial, knn=fit.knn, nb=fit.nb, cart=fit.cart, c50=fit.c50,
bagging=fit.treebag, rf=fit.rf, gbm=fit.gbm))
```
### Final considerations
**Accuracy**
Scientists evaluate experimental results for both precision and accuracy, and in most fields, it's common to express accuracy as a percentage.
**KAPPA**
Cohen's kappa coefficient (κ) is a statistic which measures inter-rater agreement for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation, as κ takes into account the possibility of the agreement occurring by chance.
### Table comparison
```{r table_compare,echo=FALSE,results=T,message=F,warning=F}
summary(results)
```
We can see in the results of the above table that the best performance of 79.87% was of the algorithm **CART**, followed by **RF **with 79.60% and 78.87 for **SVM**.
The *Kappa* value was better in **RF** with 0.574 followed by **SVM** with 0.555 and **CART** with 0.542.
Since we use the cross-validation technique, we will choose the average of 30 attempts. This leads us to believe that **RF was better**, both in percentage and Kappa, with 77.85% and 0.561 respectively.
### Boxplot comparison
In descriptive statistics, a box plot or boxplot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers may be plotted as individual points.
```{r boxplot_compare,echo=FALSE,results=T,message=F,warning=F}
bwplot(results)
```
The boxplot chart corroborates the values mentioned in the table above with the best performance.
### Dot-plot comparison
```{r dotplot_compare,echo=FALSE,results=T,message=F,warning=F}
dotplot(results)
```
The comparison of points follows the boxplot including the variation of the minimum and maximum from the average.
## Conclusion
The Random Forest (RF) algorithm for regression and classification has considerably gained popularity since its introduction in 2001. Meanwhile, it has grown to a standard classification approach competing with logistic regression in many innovation-friendly scientific fields.
In this context, we present a good scale benchmarking experiment based on one datasets comparing the prediction performance of the original version of RF with default parameters and Logistic Rregression as binary classification tools. Most importantly, the design of our benchmark experiment is inspired from clinical trial methodology, thus avoiding common pitfalls and major sources of biases.
**Random Forest** (RF) performed better than **Logistic Regression** (LR) according to the considered accuracy measured on the dataset. The mean difference between **RF** and **LR** was $0.22$ percentual points for the accuracy, and $0.443$ of Kappa, all measures thus suggesting a significantly better performance of **RF**. As a side-result of our benchmarking experiment, we observed that the results were noticeably dependent on the inclusion criteria used to select the example dataset (cross-validation against split dataset), thus emphasizing the importance of clear statements regarding this dataset selection process. We also stress that neutral studies similar to ours, based on a high number of datasets to training and test and carefully designed, will be necessary in the future to evaluate further variants, implementations or parameters of random forests which may yield improved accuracy compared to the original version with default values.
*Source*
Some explanatory texts, such as CART, RF among others were taken from the Internet (Wikipedia).