-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
155 lines (118 loc) · 6.58 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
title: "README"
author: "Qian"
date: "2018/9/8"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
##BackGround and introduction of the project
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These types of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, the goal of the project will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
## Data Preprocessing
The goal of the project is to predict the classe variable through available predictors.
The csv file has each each row represents one barbell lift performed by a distinct participant, each column represents the arrtibute of that lift.
### Load Libraries
To analyze the Data, I have imported the following libraries.
```{r, warning=FALSE, message=FALSE}
library(caret);
library(rattle);
library(rpart);
library(randomForest);
library(dplyr)
library(corrplot)
```
### Import Data
Training and testing datas are downloaded from the website.
```{r, warning=FALSE, message=FALSE}
training <- read.csv("pml-training.csv", na.strings = c("NA", "","#DIV/0!"))
testing <- read.csv("pml-testing.csv", na.strings = c("NA", "","#DIV/0!"))
```
### Data Cleaning
First, we get the # of obs. in the training set.
```{r}
nrow(training); ncol(training)
nrow(testing); ncol(testing)
```
The training data set has a total of 19622 obs. and 160 variables.
The testing data set has a total of 20 obs. and 160 variables.
```{r}
colnames(training[,1:7])
training <- training[,-c(1:7)]
testing <- testing[,-c(1:7)]
```
Through investigation, the first 7 columns in the data set are not inferencial to predicting the classe variable.
Hence, we adjust the data frame to exclude the first 7 columns.
```{r, results='hide'}
sapply(training, function(x) sum(is.na(x)))
sapply(testing, function(x) sum(is.na(x)))
training <- training[, colSums(is.na(training)) == 0]
testing <- testing[, colSums(is.na(testing)) == 0]
```
Through investigation, some variables have NA values in nearly all observations, hence columns with NA values are removed from data frames.
## Exploratory Analysis
### The correlation between numeric variables.
```{r fig.cap= "The correlation between numeric variables", fig.width= 15, fig.height=10,fig.align='center'}
numeric.var <- sapply(training, is.numeric)
corr.matrix <- cor(training[,numeric.var])
corrplot(corr.matrix, type = "upper", order = "hclust",tl.col = "black", tl.srt = 45)
```
Through out the graph we can see that variables are highly correlated to other variables with similar name(eg,"accel_belt_z" and "accel_belt_y")
To confirm this investigation, we apply a table with Row Names and the Column Names of highly correlated variables (cor > 0.9 and cor < -0.9).
```{r}
## Find index of positively correlated variables.
indPositive <- which( corr.matrix >0.9 & corr.matrix< 1, arr.ind = TRUE)
## Find index of negatively correlated variables.
indNegative <- which( corr.matrix < -0.9, arr.ind = TRUE)
## Positively correlated variables
cbind(RowName = rownames(corr.matrix)[indPositive[, 1]], ColName = colnames(corr.matrix)[indPositive[, 2]])
## Negatively correlated variables
cbind(RowName = rownames(corr.matrix)[indNegative[, 1]], ColName = colnames(corr.matrix)[indNegative[, 2]])
```
### Distribution of Classe variable
```{r fig.align='center'}
ggplot(training, aes(x=classe)) + ggtitle("Classe") + xlab("Classe") + geom_bar(aes(y = 100*(..count..)/sum(..count..)), width = 0.5) + ylab("Percentage") + coord_flip() + theme_minimal()
```
We can see that classe variable is reasonbly distributed
### Data Spliting
To get the In and Out of Sample Error rate, I set the seed to be 1026 and split the training data(training) into two set: training set(train 70%) and testing set(test 30%)
```{r}
set.seed(1026)
inTrain <- createDataPartition(training$classe, p = 0.7, list = FALSE)
train <- training[inTrain,]
test <- training[-inTrain,]
dim(train);dim(test)
```
The train set has 13737 observations vs test set has 5885 observations.
## Predicting
### Cross Validation
To reduce the computation time of the predicting algorithm, the K-Fold cross validation k is set to 3.
```{r fig.align='center'}
control <- trainControl(method = "cv", number = 3)
```
### Tree
To predict factor variable, classification tree and random forest are two common techniques.
Because many of the variables are highly correlated, it is reasonable to expect that classification tree will do poorly unless further processing the data set(remove highly correlated variables). To confirm this, we train the data with default setting and method rpart.
```{r fig.align='center'}
tree <- train(classe~., method = "rpart", data = train, trControl = control)
print(tree)
fancyRpartPlot(tree$finalModel)
```
Clearly, we can see that the accuracy is low.
### Random Forest
The performance of classification tree is poor, hence random forest is considered in this case.
First, we train the model using random forest algorithm.
```{r}
rf <- randomForest(classe~., data = train, trControl = control)
print(rf)
```
Looking at the confustion matrix and the error rate of the model, we can see that the performance of random forest is exceptionally well!
Now we vailidate the result on the test set to make sure the model is not overfitting.
```{r}
pre <- predict(rf, test)
confusionMatrix(pre, test$classe)
```
From the print out we see that the Accuracy of the prediction is 99.41%, which is very high compare to classification tree.
The out of sample error rate will thus be 0.59%.
## Conclusion
As for conclusion, two algorithms implemented in the project show considerably distinction between each other. As analyzed in the Exploratory Analysis section, many variables are high correlated with each other. While decision tree algorithm splits variables into corresponding nodes, random forest bootstrap variables at each split to grow muliple trees. The correlated variables show little impact on random forest algorithm as they are grouped to be a subset of predictors at the split.