-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathactivity-data.html
444 lines (413 loc) · 42.1 KB
/
activity-data.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
<!DOCTYPE html>
<html>
<head>
<title>Categorisation of inertial activity data - datawerk</title>
<meta charset="utf-8" />
<link href="https://buhrmann.github.io/theme/css/bootstrap-custom.css" rel="stylesheet"/>
<link href="https://buhrmann.github.io/theme/css/pygments.css" rel="stylesheet"/>
<link href="https://buhrmann.github.io/theme/css/style.css" rel="stylesheet" />
<link href="//maxcdn.bootstrapcdn.com/font-awesome/4.2.0/css/font-awesome.min.css" rel="stylesheet">
<link rel="shortcut icon" type="image/png" href="https://buhrmann.github.io/theme/css/logo.png">
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">
<meta name="author" contents="Thomas Buhrmann"/>
<meta name="keywords" contents="datawerk, R,report,classification,svm,random forest,lda,"/>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-56071357-1', 'auto');
ga('send', 'pageview');
</script> </head>
<body>
<div class="wrap">
<div class="container-fluid">
<div class="header">
<div class="container">
<nav class="navbar navbar-default navbar-fixed-top" role="navigation">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target=".navbar-collapse">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="https://buhrmann.github.io">
<!-- <span class="fa fa-pie-chart navbar-logo"></span> datawerk -->
<span class="navbar-logo"><img src="https://buhrmann.github.io/theme/css/logo.png" style=""></img></span>
</a>
</div>
<div class="navbar-collapse collapse">
<ul class="nav navbar-nav">
<!--<li><a href="https://buhrmann.github.io/archives.html">Archives</a></li>-->
<li><a href="https://buhrmann.github.io/posts.html">Blog</a></li>
<li><a href="https://buhrmann.github.io/pages/cv.html">Interactive CV</a></li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Data Reports<span class="caret"></span></a>
<ul class="dropdown-menu" role="menu">
<!--<li class="divider"></li>
<li class="dropdown-header">Data Science Reports</li>-->
<li >
<a href="https://buhrmann.github.io/p2p-loans.html">Interest rates on <span class="caps">P2P</span> loans</a>
</li>
<li >
<a href="https://buhrmann.github.io/activity-data.html">Categorisation of inertial activity data</a>
</li>
<li >
<a href="https://buhrmann.github.io/titanic-survival.html">Titanic survival prediction</a>
</li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Data Apps<span class="caret"></span></a>
<ul class="dropdown-menu" role="menu">
<!--<li class="divider"></li>
<li class="dropdown-header">Data Science Reports</li>-->
<li >
<a href="https://buhrmann.github.io/elegans.html">C. elegans connectome explorer</a>
</li>
<li >
<a href="https://buhrmann.github.io/dash+.html">Dash+ visualization of running data</a>
</li>
</ul>
</li>
</ul>
</div>
</nav>
</div>
</div><!-- header -->
</div><!-- container-fluid -->
<div class="container main-content">
<div class="row row-centered">
<div class="col-centered col-max col-min col-sm-12 col-md-10 col-lg-10 main-content">
<section id="content" class="article content">
<header>
<span class="entry-title-info">Nov 06 · <a href="https://buhrmann.github.io/category/reports.html">Reports</a></span>
<h2 class="entry-title entry-title-tight">Categorisation of inertial activity data</h2>
</header>
<div class="entry-content">
<p>The ubiquity of mobile phones equipped with a wide range of sensors presents interesting opportunities for data mining applications. In this report we aim to find out whether data from accelerometers and gyroscopes can be used to identify physical activities performed by subjects wearing mobile phones on their wrist.</p>
<p><img src="/images/activitycat/muybridge.jpg" alt="Human activity" width="1000"/></p>
<h3>Methods</h3>
<p>The data used in this analysis is based on the “Human activity recognition using smartphones” data set available from the <span class="caps">UCL</span> Machine Learning Repository [1]. A preprocessed version was downloaded from the Data Analysis online course [2]. The set contains data derived from 3-axial linear acceleration and 3-axial angular velocity sampled at 50Hz from a Samsung Galaxy S <span class="caps">II</span>. These signals were preprocessed using various filters and other methods to reduce noise and to separate low- and high-frequency components. From this data a set of 17 individual signals was extracted by separating e.g. accelerations due to gravity from those due to body motion, separating acceleration magnitude into its individual axis-aligned components and so on. The final feature variables were calculated from both the time and frequency domain of these signals. They include too large a range to cover entirely here, but examples include variables related to the spread and centre of each signal, its entropy, skewness and kurtosis in frequency space and many more.</p>
<p>All data was recorded while subjects (age 19-48) performed one of six activities and labelled accordingly: lying, sitting, standing, walking, walking down stairs and walking up stairs.</p>
<p>The problem to be solved in our analysis is the prediction of the activity class from sensor data. Since we are only interested in prediction, and not in producing an accurate or easily comprehensible model of the relation between activity and sensor data, we have chosen to investigate the performance of the following three classifiers only: random forest (<span class="caps">RF</span>), support vector machine (<span class="caps">SVM</span>) and linear discriminant analysis (<span class="caps">LDA</span>). A short description of each algorithm is given in the next sections.</p>
<p>In order to assess and compare the performance of these classifiers we separated the data into a training and a test set. The latter consisted of data for subjects 27 to 30 and the former of the remainder.</p>
<h4>Random forests</h4>
<p>Random forests are a recursive partitioning method [3]. In the case of classification, the algorithm creates a set of decision trees calculated on random subsets of the data, using at each split of a decision tree a random subset of predictors. The final prediction is made on the basis of a majority vote across all trees. Random trees have been chosen for this analysis in part because of their accuracy and their applicability to large data sets without the need for feature selection.</p>
<p>Because the trees in random forests are already build from random subsamples of the data, they do not require cross-validation to estimate accuracy, and the <span class="caps">OOB</span> (out-off-bag) error calculated internally is generally considered a good estimator of prediction error. They also do no require the tuning of many hyper-parameters. The algorithm is not sensitive, for example, to the number of trees fitted, as long as that number is greater than a few hundred. However, some have reported variation in performance depending on the proportion of variables tested at each split. We therefore tuned this parameter using a monotonic error reduction criterion which searches for performance improvement to both sides of the default value (the square root of the number of variables, approx. 23 in this case). Using the best identified value we then trained a final random forest for prediction.</p>
<p>Random forests conveniently can provide a measure of each predictor’s importance. This is achieved by comparing the performance of the tree before and after shuffling the values of the variables in question, thereby removing its relation with the outcome variable.</p>
<h4>Support vector machines</h4>
<p>Support vector machines (SVMs) classify data by separating it into classes such that the distance between their decision boundaries and the closest data points is maximised (i.e. by finding maximum margin hyperplanes) [4]. The algorithm is based on a mathematical trick that involves the use of simple linear boundaries in a high-dimensional non-linear feature space; without requiring computations on this complex transformation of the data. The mapping of the feature space is done using kernel functions, which can be selected based on the classification problem. The data is then modeled using a weighted combination of the closest points in transformed space (the support vectors).</p>
<p>Here we use the <span class="caps">SVM</span> classifier provided in the e1071 package for R [5]. For multiclass problems this algorithm performs a one-against-one voting scheme. We chose the default optimization method “C-classification”, where the hyper-parameter C scales the misclassification cost, such that the higher the value the more complex the model (i.e. the larger the bias). We also chose to use the radial basis kernel, which is commonly considered a good first choice. The cost parameter C, along with γ, which defines the size of the kernel (the spatial extent of the influence of a training example), was tuned using grid-search [6] with 10-fold cross validation (tuning function provided in e1071 package).</p>
<h4>Linear discriminant analysis</h4>
<p>Linear discriminant analysis (<span class="caps">LDA</span>) is similar to <span class="caps">SVM</span> in that it also tries to transform the problem such that classes separated by non-linear decision boundaries become linearly separable [4]. Instead of using kernels and support vectors, however, it identifies a linear transformation of the predictor variables (a “discriminant function”) that allows for more accurate classification than individual predictors. Identification of the transformation is based on the maximisation of the ratio of between-class variance to within-class variance. The transformation thereby maximises the separation between classes.</p>
<h4>Combination of classifiers</h4>
<p>We evaluate the performance of each classifier using its error rate (the proportion of misclassified data) or equivalently its accuracy (proportion of correctly classified data). We then combine all three methods using a simple majority vote on the prediction set.</p>
<h3>Results</h3>
<p>The data set contains 7352 observations of 561 features (in addition to a subject index and the activity performed). Of the 21 subjects included in the data, the last four were used only for evaluating the final performance of the algorithms (test set, 1485 observations) and the rest for training (5867 observations). The same sets were used for all classifiers unless stated otherwise. Data was reasonably distributed across activities (number of data points in each class: lying=1407, sitting=1286, standing=1374, walk=1226, walking down stairs=986, walking up stairs=1073). Since the classifiers used here do not make strong assumptions about the distribution of data (they are relatively robust), no detailed investigation of the statistical properties of individual features was performed. In particular, the methods employed did not require transformations of individual features (e.g. such as to improve normality of their distribution). However, as can be expected from the fact that all features derive from the same few sensor signals, the data exhibits high collinearity. While this would have led to problems with confounders in e.g. a regression model, this was not generally the case with the methods employed here. It was addressed explicitly for the <span class="caps">LDA</span> however (see below).</p>
<p>We first report results from individual classifiers and then their combination.</p>
<h4>Random Forest</h4>
<p>We tuned the proportion of variables considered in each split using 100 trees for each evaluation. The best value found was 20. A final random forest was then trained using the optimal value and 500 trees. Error rate remained low (< 5%) and stable after about 250 trees had been added. Analysis of variable importances, considering both the mean decrease in accuracy and Gini index, shows that the most significant variables are related to the acceleration due to gravity along the X and Y axes, as well as the mean angle with respect to gravity in the same directions (with corresponding measures from the time domain).</p>
<p>Figure 1 shows the data, color-coded by activity, in the first two dimensions identified. We can see that several activities are already well-separated in these two dimensions, but others (standing, walk and walk-down) are largely overlapping.</p>
<figure>
<img src="/images/activitycat/centers.png" alt="RF centers"/>
<figcaption class="capCenter">Figure 1: Scatter plot of data in the two most important dimensions according to the random forest. Bigger disks indicate the class centers (for each class the data point that has most nearest neighbours of the same class).</figcaption>
</figure>
<p>The error rate of the fitted <span class="caps">RF</span> is 1.6% on the training set and 4.6% on the test set (accuracy of 0.954). The confusion matrix of the predicted activities (Table 1) shows that misclassification is almost exclusively due to an inability to distinguish sitting from standing. For example, while precision is greater than 0.977 for all other activities, it is 0.912 and 0.876 for sitting and standing respectively. Apparently the activities showing large overlap in the two most important dimensions (see Figure 1) can easily be separated taking into account other variables, while for sitting and standing activities this is not the case.</p>
<figure>
<div class="figCenter">
<TABLE class="table">
<TR>
<TH> </TH><TH> lying </TH><TH> sitting </TH><TH> standing </TH><TH> walking </TH><TH> walk down </TH><TH> walk up </TH><TH> precision </TH>
</TR>
<TR>
<TD align="right"> lying </TD> <TD align="right"> 293 </TD><TD align="right"> </TD> <TD align="right"> </TD> <TD align="right"> </TD> <TD align="right"> </TD><TD align="right"> </TD> <TD align="right"> 1.0 </TD>
</TR>
<TR>
<TD align="right"> sitting </TD> <TD align="right"> </TD><TD align="right"> 227 </TD> <TD align="right"> 22 </TD>
<TD align="right"> </TD> <TD align="right"> </TD><TD align="right"> </TD> <TD align="right"> 0.9116 </TD>
</TR>
<TR>
<TD align="right"> standing </TD> <TD align="right"> </TD><TD align="right"> 37 </TD> <TD align="right"> 261 </TD>
<TD align="right"> </TD> <TD align="right"> </TD><TD align="right"> </TD> <TD align="right"> 0.8758 </TD>
</TR>
<TR>
<TD align="right"> walking </TD> <TD align="right"> </TD><TD align="right"> </TD> <TD align="right"> </TD>
<TD align="right"> 228 </TD> <TD align="right"> 2 </TD> <TD align="right"> 1 </TD> <TD align="right"> 0.9870 </TD>
</TR>
<TR>
<TD align="right"> walk down </TD> <TD align="right"> </TD><TD align="right"> </TD> <TD align="right"> </TD>
<TD align="right"> </TD> <TD align="right"> 194 </TD> <TD align="right"> 1 </TD> <TD align="right"> 0.9949 </TD>
</TR>
<TR>
<TD align="right"> walk up </TD> <TD align="right"> </TD><TD align="right"> </TD> <TD align="right"> </TD>
<TD align="right"> 1 </TD> <TD align="right"> 4 </TD> <TD align="right"> 214 </TD> <TD align="right"> 0.9772 </TD>
</TR>
<TR>
<TD align="right"> sensitivity </TD> <TD align="right"> 1.0 </TD><TD align="right"> 0.8598 </TD> <TD align="right"> 0.9223 </TD> <TD align="right"> 0.9956 </TD> <TD align="right"> 0.97 </TD> <TD align="right"> 0.9907 </TD> <TD align="right"> accuracy=0.954 </TD>
</TR>
</TABLE>
</div>
<figcaption class="capCenter">Table 1: Confusion matrix of random forest predictions. Rows correspond to predicted, and columns to reference (real observed) activities. Zero counts are omitted for clarity and misclassifications appear in off-diagonal entries (precision = positive predictive value, sensitivity = true positive rate).</figcaption>
</figure>
<p>The accuracy of the random forest can be appreciated when comparing the actual activities in the test set with those predicted by the model. Figure 2 below plots for the two most important variables the conditional density plots of both actual and predicted activities. In each panel the density plot shows the frequency of each activity as a function of the given variable. Clearly, at least in the two chosen dimensions, the model’s predictions match the actual distribution of activities very closely.</p>
<figure>
<img src="/images/activitycat/density.png" alt="RF CDF"/>
<figcaption class="capCenter">Figure 2: Conditional density plots for actual and predicted activities using the two most important variables of the data set.</figcaption>
</figure>
<h4>Support Vector Machine</h4>
<p>Tuning of <span class="caps">SVM</span> hyper-parameters using the training set resulted in optimal values of the cost C = 100 and kernel size γ = 0.001 (search was performed in intervals γ ∈ [1e-6, 0.1] and C ∈ [1,100]). To reduce computation time, the search was performed on a fraction (20%) of data randomly sampled from the training set. Using these optimal values a final <span class="caps">SVM</span> was trained on the whole set.</p>
<p>The resulting <span class="caps">SVM</span> uses 22.6% of the data points as support vectors (1326 out of 5867). Since this number depends on the tuned parameter C, which was found using cross-validation, we assume that we have not overfit the model. This is supported by the model’s high accuracy of 0.989 on the training set when averaged over a 10-fold cross validation. On the test set its accuracy is 0.96, i.e. slightly better than the random forest. </p>
<p>The confusion matrix of predictions is shown in Table 2. As we can see, the <span class="caps">SVM</span> exhibits perfect classification for all activities other than sitting and standing, where its performance is similar to the random forest.</p>
<figure>
<div class="figCenter">
<TABLE class="table">
<TR>
<TH> </TH><TH> lying </TH><TH> sitting </TH><TH> standing </TH><TH> walking </TH><TH> walk down </TH><TH> walk up </TH><TH> precision </TH>
</TR>
<TR>
<TD align="right"> lying </TD> <TD align="right"> 293 </TD><TD align="right"> </TD> <TD align="right"> </TD> <TD align="right"> </TD> <TD align="right"> </TD><TD align="right"> </TD> <TD align="right"> 1.0 </TD>
</TR>
<TR>
<TD align="right"> sitting </TD> <TD align="right"> </TD><TD align="right"> 232 </TD> <TD align="right"> 27 </TD>
<TD align="right"> </TD> <TD align="right"> </TD><TD align="right"> </TD> <TD align="right"> 0.8958 </TD>
</TR>
<TR>
<TD align="right"> standing </TD> <TD align="right"> </TD><TD align="right"> 32 </TD> <TD align="right"> 256 </TD>
<TD align="right"> </TD> <TD align="right"> </TD><TD align="right"> </TD> <TD align="right"> 0.8889 </TD>
</TR>
<TR>
<TD align="right"> walking </TD> <TD align="right"> </TD><TD align="right"> </TD> <TD align="right"> </TD>
<TD align="right"> 229 </TD> <TD align="right"> </TD> <TD align="right"> </TD> <TD align="right"> 1.0 </TD>
</TR>
<TR>
<TD align="right"> walk down </TD> <TD align="right"> </TD><TD align="right"> </TD> <TD align="right"> </TD>
<TD align="right"> </TD> <TD align="right"> 200 </TD> <TD align="right"> </TD> <TD align="right"> 1.0 </TD>
</TR>
<TR>
<TD align="right"> walk up </TD> <TD align="right"> </TD><TD align="right"> </TD> <TD align="right"> </TD>
<TD align="right"> </TD> <TD align="right"> </TD> <TD align="right"> 216 </TD> <TD align="right"> 1.0 </TD>
</TR>
<TR>
<TD align="right"> sensitivity </TD> <TD align="right"> 1.0 </TD><TD align="right"> 0.8788 </TD> <TD align="right"> 0.9046 </TD> <TD align="right"> 1.0 </TD> <TD align="right"> 1.0 </TD> <TD align="right"> 1.0 </TD> <TD align="right"> accuracy=0.96 </TD>
</TR>
</TABLE>
</div>
<figcaption class="capCenter">Table 2: Confusion matrix of <span class="caps">SVM</span> predictions. See Table 1 for further details.</figcaption>
</figure>
<h4>Linear Discriminant Analysis</h4>
<p><span class="caps">LDA</span> can be sensitive or even fail when the data exhibits a high degree of collinearity. Since our sensor data essentially consists of different transformations of the same few signals we can expect that this is indeed the case in our data set. We therefore performed two <span class="caps">LDA</span> classifications. For the first model (<span class="caps">LDA1</span>) the complete training set was used. For the second model (<span class="caps">LDA2</span>) we removed those variables that exhibited pair-wise correlations greater than R=0.9 (removing one from each pair) using the findCorrelation function in R’s caret package. A total of 346 variables were thus removed, leaving 215 less correlated predictors. Using these two training sets, <span class="caps">LDA</span> models were trained with 10-fold cross validation to assess whether we would expect a difference in their accuracy. The <span class="caps">LDA2</span> model, trained on relatively uncorrelated data, showed an error rate of 3.5%, and <span class="caps">LDA1</span> a rate of 5.2%. Based on these results we have to conclude that <span class="caps">LDA2</span> should be used for our final predictions.</p>
<p>Table 3 shows the confusion matrix for the <span class="caps">LDA2</span> model when predicting on the test set.</p>
<figure>
<div class="figCenter" >
<TABLE class="table">
<TR>
<TH> </TH><TH> lying </TH><TH> sitting </TH><TH> standing </TH><TH> walking </TH><TH> walk down </TH><TH> walk up </TH><TH> precision </TH>
</TR>
<TR>
<TD align="right"> lying </TD> <TD align="right"> 293 </TD><TD align="right"> </TD> <TD align="right"> </TD> <TD align="right"> </TD> <TD align="right"> </TD><TD align="right"> </TD> <TD align="right"> 1.0 </TD>
</TR>
<TR>
<TD align="right"> sitting </TD> <TD align="right"> </TD><TD align="right"> 223 </TD> <TD align="right"> 24 </TD>
<TD align="right"> </TD> <TD align="right"> </TD><TD align="right"> </TD> <TD align="right"> 0.9028 </TD>
</TR>
<TR>
<TD align="right"> standing </TD> <TD align="right"> </TD><TD align="right"> 41 </TD> <TD align="right"> 259 </TD>
<TD align="right"> </TD> <TD align="right"> </TD><TD align="right"> </TD> <TD align="right"> 0.8633 </TD>
</TR>
<TR>
<TD align="right"> walking </TD> <TD align="right"> </TD><TD align="right"> </TD> <TD align="right"> </TD>
<TD align="right"> 226 </TD> <TD align="right"> 3 </TD> <TD align="right"> 2 </TD> <TD align="right"> 0.9784 </TD>
</TR>
<TR>
<TD align="right"> walk down </TD> <TD align="right"> </TD><TD align="right"> </TD> <TD align="right"> </TD>
<TD align="right"> </TD> <TD align="right"> 196 </TD> <TD align="right"> </TD> <TD align="right"> 1.0 </TD>
</TR>
<TR>
<TD align="right"> walk up </TD> <TD align="right"> </TD><TD align="right"> </TD> <TD align="right"> </TD>
<TD align="right"> 3 </TD> <TD align="right"> 1 </TD> <TD align="right"> 214 </TD> <TD align="right"> 0.9817 </TD>
</TR>
<TR>
<TD align="right"> sensitivity </TD> <TD align="right"> 1.0 </TD><TD align="right"> 0.8447 </TD> <TD align="right"> 0.9152 </TD> <TD align="right"> 0.9869 </TD> <TD align="right"> 0.98 </TD> <TD align="right"> 0.9907 </TD> <TD align="right"> accuracy=0.95 </TD>
</TR>
</TABLE>
</div>
<figcaption class="capCenter">Table 3: Confusion matrix of <span class="caps">LDA2</span> predictions. See Table 1 for further details.</figcaption>
</figure>
<p>We can observe the same pattern of misclassification as in the other two models. Interestingly, when we use <span class="caps">LDA1</span> for prediction, accuracy is increased to 0.9785 (error rate of 2.15%). Nevertheless, since in cross-validation on the training set <span class="caps">LDA2</span> performed better, we assume that this increase is a result of chance only and does not reflect a truly better model.</p>
<p>To visually demonstrate the reason for the model’s misclassification we can plot the test data in the first two dimensions of the trained linear discriminant, color-coded by true activity (Figure 3).</p>
<figure>
<img src="/images/activitycat/lda.png" alt="RF CDF"/>
<figcaption class="capCenter">Figure 3: Test data scattered in the first two discriminant dimensions.</figcaption>
</figure>
<p>Again, we find that two clusters of activities are similar: sitting and standing on the one hand, and the different walking activities on the other. But at least the two clusters (as well as the data points for the lying acticity) are well separated, in contrast to the “raw” dimensions shown in Figure 1.</p>
<h4>Comparison of classifiers</h4>
<p>Comparing the three classifiers in terms of their sensitivity (recall), i.e. the proportion of correct predictions for each class, we have already seen that all three models perform very similar, with the <span class="caps">SVM</span> having a slight advantage. We can speculate that this is due to the non-linear (radial basis) decision boundaries of the classifier, which stands in contrast to the linear methods employed in the other two models.</p>
<p>Based on the previous results we expect not to gain much predictive power from the combination of individual models using a simple majority vote. All models exhibit the same problem of misclassification of sitting and standing activities, and therefore do not complement each other. This is confirmed by a combined accuracy of 0.958 when predictions are made based on a majority vote of the three models, which sits exactly between the lower scoring <span class="caps">RF</span> and <span class="caps">LDA</span> on the one hand, and the slightly higher scoring <span class="caps">SVM</span> on the other.</p>
<p>What explains the consistent misclassification of sitting and standing across all three models? Intuitively it is clear that since in both “activities” subjects remain more or less motionless, inertial data will not provide much differentiating information. This is reflected in the data. To illustrate this we trained another random forest on a new subset of the training data which a) included only sitting and standing activities, and b) only included predictors with pair-wise correlations less than R=0.9 (same procedure as for the <span class="caps">LDA</span> model). This data set therefore consisted of a binary outcome and 2113 observations (1022 and 1091 in each level). The importances of the resulting random forest show that the most significant split is achieved on the mean angle of gravity with respect to the Y axis (θy), followed by the energy measure of acceleration due to gravity in the Y dimension in the time domain (gey) or, according to the mean decrease in Gini index, the entropy measure of the same variable. In the left panel of Figure 1 we plot the data along these two axes (θy vs. gey) and color the data according to activity. </p>
<figure>
<img src="/images/activitycat/intertial.jpg" alt="Inertial data"/>
<figcaption class="capCenter">Figure 1: Overlap of data from sitting and standing activities underlying the failure to perfectly separate these two classes. Left panel: scatterplot of the two most important variables for distinguishing sitting and standing activities (according to a random forest fitted to data for these two activities only). θy is the mean angle of gravity with respect to the y-axis, and gey is the entropy of acceleration due to gravity in the y-dimension (see main text for further details). Only part of the range for θy is shown to highlight the region of overlap. Right panel: the same overlap is more clearly seen in the histogram of the θy variable only. Even though the means of θy for sitting and standing are different (p-value in t-test < 2.2e-16), their distributions overlap significantly.</figcaption>
</figure>
<p>We can see that while the data falls into two identifiable regions, these are not perfectly separable but rather show significant overlap. This can be seen even more clearly in the right panel of Figure 1, where we overimpose histograms of θy separated by activity. The distributions of sitting and standing in this variable are clearly different statistically, but also overlap significantly. Their difference is confirmed by a t-test of their means (-0.01 and 0.21 for sitting and standing respectively, p– value < 2.2e-16). Nevertheless, the overlap means that no classifier should be able to distinguish these two activities perfectly, at least not based on this single variable. Adding further variables might help in separating the two distributions. But as the three trained models seem to indicate, the data set does not appear to contain the kind of variables that allows for perfect discrimination of sitting and standing.</p>
<h3>Conclusions</h3>
<p>We have used three different types of classifiers to predict a subject’s physical activity from inertial data captured using the accelerometer and gyroscope embedded in mobile phones worn at the wrist. All classifiers performed well overall (accuracy > 0.95), but failed equally to distinguish some cases of sitting and standing. We observe, however, that the non-linear <span class="caps">SVM</span> seems to have a slight advantage over the two linear models. This suggests that perhaps a non-linear variant of the <span class="caps">LDA</span> algorithm (namely quadratic discriminant analysis, or <span class="caps">QDA</span>), and equally a random forest using decision trees with non-linear boundaries, would have been more appropriate for this data set. Further work would also be needed to determine whether the radial kernel used in the <span class="caps">SVM</span> model is in fact the optimal kernel for this data set.</p>
<p>We have shown that the data used in this analysis does not seem to contain individual variables that can separate sitting and standing activities perfectly. The failure of all three classifiers also suggests that the two activities cannot be resolved in higher dimensions. This is corroborated by the fact that the classifiers all take rather different approaches, e.g. parametric (<span class="caps">LDA</span>) and non-parametric (<span class="caps">RF</span>), or linear (decision trees) and non-linear decision boundaries (<span class="caps">SVM</span>). Of course, the failure to distinguish sitting and standing using inertial data only is not surprising, as both activities imply near stationarity of the sensors. However, we can hypothesise that other transformations of the data not provided in this set could be helpful. E.g. accelerations in the vertical direction due to body motion should show non-linear step changes at the moment of sitting down, while this would not be the case if a person continued standing. Adding the existence of such step-changes to the data set could potentially lead to better separability of these activities.</p>
<p>We have not here performed an analysis of variation between subjects. It is possible that the behaviour of some subjects differs significantly from that of others, and that in the process of “averaging” across subjects information is lost. Future work should also address this question.</p>
<h3>References</h3>
<p><ol class="bib">
<li><span class="caps">UCI</span> Data set: <a href="http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones">Human Activity Recognition Using Smartphones</a></li>
<li><a href="https://spark-public.s3.amazonaws.com/dataanalysis/samsungData.rda">Preprocessed data set</a> on Amazon S3 storage.</li>
<li>Breiman, L. (2001), Random Forests, Machine Learning 45(1), 5-32.</li>
<li>Bishop, <span class="caps">C. M.</span>(2006). Pattern recognition and machine learning (Vol. 4, No. 4). New York: Springer.</li>
<li><a href="http://cran.r-project.org/web/packages/e1071/index.html"><span class="caps">SVM</span> package ‘e1071’</a></li>
<li>Bergstra, J. and Bengio, Y. (2012). Random Search for Hyper-Parameter Optimization. J. Machine Learning Research 13: 281—305.</li>
</ol></p>
</div><!-- /.entry-content -->
<footer class="post-info">
Published on <span class="published">November 06, 2014</span><br>
Written by <span class="author">Thomas Buhrmann</span><br>
Posted in <span class="label label-default"><a href="https://buhrmann.github.io/category/reports.html">Reports</a></span>
~ Tagged
<span class="label label-default"><a href="https://buhrmann.github.io/tag/r.html">R</a></span>
<span class="label label-default"><a href="https://buhrmann.github.io/tag/report.html">report</a></span>
<span class="label label-default"><a href="https://buhrmann.github.io/tag/classification.html">classification</a></span>
<span class="label label-default"><a href="https://buhrmann.github.io/tag/svm.html">svm</a></span>
<span class="label label-default"><a href="https://buhrmann.github.io/tag/random-forest.html">random forest</a></span>
<span class="label label-default"><a href="https://buhrmann.github.io/tag/lda.html">lda</a></span>
</footer><!-- /.post-info -->
</section>
<div class="blogItem">
<h2>Comments</h2>
<div id="disqus_thread"></div>
<script type="text/javascript">
var disqus_shortname = 'datawerk';
var disqus_title = 'Categorisation of inertial activity data';
var disqus_identifier = "activity-data.html";
(function() {
var dsq = document.createElement('script');
dsq.type = 'text/javascript';
dsq.async = true;
//dsq.src = 'http://' + disqus_shortname + '.disqus.com/embed.js';
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] ||
document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script>
<noscript>
Please enable JavaScript to view the
<a href="http://disqus.com/?ref_noscript=datawerk">
comments powered by Disqus.
</a>
</noscript>
</div>
</div>
</div><!-- row-->
</div><!-- container -->
<!-- <div class="push"></div> -->
</div> <!-- wrap -->
<div class="container-fluid aw-footer">
<div class="row-centered">
<div class="col-sm-3 col-sm-offset-1">
<h4>Author</h4>
<ul class="list-unstyled my-list-style">
<li><a href="http://www.ias-research.net/people/thomas-buhrmann/">Academic Home</a></li>
<li><a href="http://github.com/synergenz">Github</a></li>
<li><a href="http://www.linkedin.com/in/thomasbuhrmann">LinkedIn</a></li>
<li><a href="https://secure.flickr.com/photos/syngnz/">Flickr</a></li>
</ul>
</div>
<div class="col-sm-3">
<h4>Categories</h4>
<ul class="list-unstyled my-list-style">
<li><a href="https://buhrmann.github.io/category/academia.html">Academia (4)</a></li>
<li><a href="https://buhrmann.github.io/category/data-apps.html">Data Apps (2)</a></li>
<li><a href="https://buhrmann.github.io/category/data-posts.html">Data Posts (9)</a></li>
<li><a href="https://buhrmann.github.io/category/reports.html">Reports (3)</a></li>
</ul>
</div>
<div class="col-sm-3">
<h4>Tags</h4>
<ul class="tagcloud">
<li class="tag-4"><a href="https://buhrmann.github.io/tag/shiny.html">shiny</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/networks.html">networks</a></li>
<li class="tag-3"><a href="https://buhrmann.github.io/tag/sql.html">sql</a></li>
<li class="tag-3"><a href="https://buhrmann.github.io/tag/hadoop.html">hadoop</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/mongodb.html">mongodb</a></li>
<li class="tag-1"><a href="https://buhrmann.github.io/tag/visualization.html">visualization</a></li>
<li class="tag-2"><a href="https://buhrmann.github.io/tag/smcs.html">smcs</a></li>
<li class="tag-3"><a href="https://buhrmann.github.io/tag/sklearn.html">sklearn</a></li>
<li class="tag-3"><a href="https://buhrmann.github.io/tag/tf-idf.html">tf-idf</a></li>
<li class="tag-1"><a href="https://buhrmann.github.io/tag/r.html">R</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/sna.html">sna</a></li>
<li class="tag-2"><a href="https://buhrmann.github.io/tag/nosql.html">nosql</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/svm.html">svm</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/java.html">java</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/hive.html">hive</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/scraping.html">scraping</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/lda.html">lda</a></li>
<li class="tag-2"><a href="https://buhrmann.github.io/tag/kaggle.html">kaggle</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/exploratory.html">exploratory</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/titanic.html">titanic</a></li>
<li class="tag-2"><a href="https://buhrmann.github.io/tag/classification.html">classification</a></li>
<li class="tag-1"><a href="https://buhrmann.github.io/tag/python.html">python</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/random-forest.html">random forest</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/text.html">text</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/big-data.html">big data</a></li>
<li class="tag-2"><a href="https://buhrmann.github.io/tag/report.html">report</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/regression.html">regression</a></li>
<li class="tag-2"><a href="https://buhrmann.github.io/tag/graph.html">graph</a></li>
<li class="tag-2"><a href="https://buhrmann.github.io/tag/d3.html">d3</a></li>
<li class="tag-3"><a href="https://buhrmann.github.io/tag/neo4j.html">neo4j</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/flume.html">flume</a></li>
</ul>
</div>
</div>
</div>
<!-- JavaScript -->
<script src="https://code.jquery.com/jquery-2.1.1.min.js"></script>
<script src="//maxcdn.bootstrapcdn.com/bootstrap/3.2.0/js/bootstrap.min.js"></script>
<script type="text/javascript">
jQuery(document).ready(function($)
{
$("div.collapseheader").click(function () {
$header = $(this).children("span").first();
$codearea = $(this).children(".input_area");
$codearea.slideToggle(500, function () {
$header.text(function () {
return $codearea.is(":visible") ? "Collapse Code" : "Expand Code";
});
});
});
// $(window).resize(function(){
// var footerHeight = $('.aw-footer').outerHeight();
// var stickFooterPush = $('.push').height(footerHeight);
// $('.wrap').css({'marginBottom':'-' + footerHeight + 'px'});
// });
// $(window).resize();
// $(window).bind("load resize", function() {
// var footerHeight = 0,
// footerTop = 0,
// $footer = $(".aw-footer");
// positionFooter();
// function positionFooter() {
// footerHeight = $footer.height();
// footerTop = ($(window).scrollTop()+$(window).height()-footerHeight)+"px";
// console.log(footerHeight, footerTop);
// console.log($(document.body).height()+footerHeight, $(window).height());
// if ( ($(document.body).height()+footerHeight) < $(window).height()) {
// $footer.css({ position: "absolute" }).css({ top: footerTop });
// console.log("Positioning absolute");
// }
// else {
// $footer.css({ position: "static" });
// console.log("Positioning static");
// }
// }
// $(window).scroll(positionFooter).resize(positionFooter);
// });
});
</script>
</body>
</html>