-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
109 lines (73 loc) · 4.26 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<title>Practical Machine Learning Course Submission</title>
</head>
<body>
The processing steps performed in the uploaded R script are summarised below. The code itself is also reasonably well commented.
<h2>1) Loading data</h2>
<h2>2) Pre-Processing</h2>
Only data columns (attributes) which provide some differentiation in the final test data-set are kept.
This could be considered 'cheating' as the final test data is examined; however it is reasonable to assume that in most applications where a prediction is necessary which attributes will be gathered going forwards would be known.
All of the attributes in the final test set are converted to factors; and those with fewer than 2 factors (so no differentiation between any of the samples) are dropped from the data.
<h2>3) Data Splitting</h2>
The training data is split into 3 sets, with 70% of the observations used for model training, 20% used for cross-validation and any required model tuning work, and 10% used for making a final estimation of the model accuracy.
<h2>4) Model Training</h2>
The model is then trained using the random forest method (method='rf') from within the caret package. All of the default values are used; as an initial attempt, and all attributes (i.e. all data columns other than the class) were included as predictors.
This training step takes rather a long time to run.
<h2>5) Checking Accuracy</h2>
The accuracy of the model was then checked against the training and cross-validation sets. These both gave a very low miss-classification error. The confusion matrices are reproduced below for reference:
Confusion Matrix and Statistics (Training Set)
Reference
Prediction A B C D E
A 3906 0 0 0 0
B 0 2658 0 0 0
C 0 0 2395 0 0
D 0 0 0 2251 0
E 0 0 0 0 2525
Overall Statistics
Accuracy : 1
95% CI : (0.9997, 1)
No Information Rate : 0.2844
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Confusion Matrix and Statistics (Validation Set)
Reference
Prediction A B C D E
A 1116 0 0 0 0
B 0 760 0 0 0
C 0 0 685 0 0
D 0 0 0 644 0
E 0 0 0 0 722
Overall Statistics
Accuracy : 1
95% CI : (0.9991, 1)
No Information Rate : 0.2842
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Given how low the miss-classification error was it was decided that no further model tuning was required, so all that was left is to make an estimate of the out of sample error.
This is done by attempting to predict the class of the test instances. The confusion matrix for the test set is given below. This would suggest the out-of-sample accuracy is exactly 1.
Confusion Matrix and Statistics
Reference
Prediction A B C D E
A 558 0 0 0 0
B 0 379 0 0 0
C 0 0 342 0 0
D 0 0 0 321 0
E 0 0 0 0 360
Overall Statistics
Accuracy : 1
95% CI : (0.9981, 1)
No Information Rate : 0.2847
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Because no modifications were made to the base random forest method, it is likely that a slightly better model could be made by running the training method again on all of the data; but it would not then be possible to quote an expected out of sample accuracy.
<h2>6) Make Predictions</h2>
Finally predictions are made regarding the actual test set (for which the actual classification is not known).
Rather strangely this predicted that all 20 of the test cases were in class 'A', which seems a little odd, but this was the answer submitted.
</body>
</html>