-
Notifications
You must be signed in to change notification settings - Fork 0
/
21-act.tex
661 lines (571 loc) · 60.8 KB
/
21-act.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
\chapter{Human action classification}
\label{chap/act}
\section{Introduction}
\label{sec/act/intro}
% Done
Recognising human activities from videos is regarded as a challenging problem in contemporary computer vision research. It has attracted much attention recently due to the popularity of web-scale video databases, such as YouTube. The task of video-based human action analysis is regarded as a multi-class detection, classification and pose estimation problems of time-varying (spatiotemporal) data. Action classification has been widely studied for different applications, including human-computer interaction, digital entertainment, visual surveillance and automatic video indexing. Despite the popularity of such topics in computer vision research, several issues still remain unresolved:
% Done
\paragraph{Run-time performance in realistic recognition systems} Run-time efficiency is of vital importance for real-world object recognition systems. While real-time action classification is required by many time-critical applications, existing algorithms seldom take computational complexity into full consideration. Current state-of-the-art solutions focus on the final classification accuracy in testing. They report promising results on standard human action datasets \cite{Kim2007, Lin2009, Liu2008, Willems2009}. However, such solutions are not as efficient when they are applied to realistic applications, because they either employ sophisticated algorithms or complex features to maximise recognition accuracy at the expense of increased computation time. To summarise, there exists a trade-off between classification accuracy and run-time performance, designing a suitable algorithm for video-based human action analysis is a challenging engineering problem.
% Done
\paragraph{Responsive action recognition} For many computer vision applications, such as human-computer interface and video surveillance, it is necessary to classify an action within a short response time (classification latency), in order to perform a prompt feedback to the observed actions. The recognition algorithms of such tasks are often repurposed from existing classification algorithms for 2D images, which have been adopted and proved effective in early video-based action recognition research \cite{Schuldt2004, Dollar2005}. Traditional action recognition systems consider the testing video as a single 3D object. Typically, a class label is assigned after sufficient features are extracted from the entire testing video. It is necessary to minimise the classification latency time required for time-critical tasks, still there are few attempts to improve the design of existing action recognition systems. Schindler and van Gool \cite{Schindler2008} have shown that, by using appropriate spatiotemporal features, human actions could be recognised from very short video sequences (1--7 frames) with a promising accuracy comparable to state-of-the-art approaches.
% done
\paragraph{Spatial and temporal structures}
Human activities can be considered as a sequence of time-varying independent poses, spatiotemporal structure of actions is therefore a discriminative feature for action recognition.
While structural information is a useful cue for the task, it is often overlooked by traditional action recognition approaches.
Many existing methods are derived from the standard bag-of-words model in image recognition \cite{Sivic2005, Fei-Fei2005}, which has proved effective for action recognition owing to its rich discriminative power of local appearance information and its inherent benefits to cope with scaling, translation and cluttered backgrounds.
A bag-of-words model, however, neglects the spatial and temporal structure among features.
When a testing video sequence is represented as an unordered set of visual codewords, all structural information of features is discarded.
\subsection{Contributions}
% Done
Addressing the aforementioned challenges, this chapter presents a novel method for real-time human action classification. The major objective of this work is to recognise human actions in monocular videos quickly in a short response time, while at the same time attaining state-of-the-art accuracies. To this end, the proposed method intends to fully utilise both appearance and structure in short, overlapping video snippets as in Schindler and van Gool \cite{Schindler2008}. The main contributions of the proposed method are threefold:
% Done
\paragraph{Learning visual codebooks via semantic texton forest}
Inspired by the work of Shotton \etal \cite{Shotton2008}, the proposed system extends semantic texton forests (STF) from 2D image segmentation to human action classification. STF is a variant of a randomised decision forest \cite{Ho1995, Amit1997, Breiman2001} that encodes textons to multiple visual codewords. STF is capable of learning a powerful codebook without computing expensive local descriptors, and without performing costly k-means clustering as in traditional action recognition algorithms. It operates directly on spatiotemporal video patches without computing expensive local descriptors or filter-bank responses, such as 3D-SIFT descriptors \cite{Scovanner2007}, non-negative matrix factorisation \cite{Wong2007}, histogram of gradient (HOG) \cite{Schuldt2004, Laptev2008} and histogram of optical flow (HOF) descriptors \cite{Riemenschneider2009}.
In addition to faster training and testing time, Moosmann \etal \cite{Moosmann2007} also suggested that decision forest codebooks showed robustness to background clutter compared with traditional flat codebooks such as those constructed via k-means algorithm, hence they can attain a better recognition accuracy.
%Firstly, inspired by the work of Shotton \etal \cite{Shotton2008}, we extend the use of semantic texton forests from 2D image segmentation to spatiotemporal analysis. STFs are ensembles of random decision trees that translate interest points into visual codewords. In our method, STF performs directly on video pixels without computing expensive local descriptors and efficiently generates codewords. As well as being faster than a traditional flat codebook such as k-means clustering, STF achieves high effectiveness comparable to that of existing approaches. As far as we are aware, this is the first method that uses STF in action recognition studies.
% Done
\paragraph{Combining structural and appearance information} This chapter presents the pyramidal spatiotemporal relationship match (HSRM), building on the work of Ryoo and Aggarwal \cite{Ryoo2009}. HSRM describes both local appearance and structural information of visual codewords, such that the subsequent action classifier recognises short video snippets to minimise classification delay. Feature histograms are compared using intersection \cite{Ryoo2009}. Since plain histogram intersection is sensitive to noise and appearance errors, a pyramidal match kernel \cite{Grauman2005} is introduced to alleviate this problem. Taking advantage of the inherent hierarchical structure of semantic texton forests, instead of representing a video snippet as an unordered set of codewords, HSRM takes the spatiotemporal relationship among individual local feature points into account. Despite using only simple features and classifier (video patches and decision forest), the proposed HSRM still achieves promising speed and accuracy by utilising both appearance and structure of human actions.
% Done
\paragraph{Fast classification using a k-means forest} The proposed framework employs a k-means forest to classify snippets from testing videos. A k-means forest is an ensemble of k-means decision trees, which was originally introduced by Muja and Lowe \cite{Muja2009} for approximate nearest neighbour classifier. In this work, it is repurposed to an efficient k-nearest neighbour classification. Instead of the Euclidean kernel, HSRM is employed as the matching kernel for learning the k-means forest classifier. Subsequently, the recognition accuracy is further refined by adaptively combining the k-means forest and the traditional bag-of-semantic-texton method, together with an adaptive weighting scheme \cite{Shotton2008}. As a result, the proposed combined classification algorithm demonstrates a high accuracy comparable to that of existing state-of-the-art approaches, whilst simultaneously achieving real-time performance.
%Done
\paragraph{VFAST spatiotemporal interest point detector} The 3D version of the FAST interest point detector, namely Video-FAST (VFAST), is used to detect spatiotemporal interest points in training and testing videos. Performance evaluation for 3D interest points was discussed in chapter \ref{chap/eval}. VFAST interest points also produce a denser feature set than traditional approaches as shown in chapter \ref{chap/eval}. It is suggested that a densely sampled visual codebook improve recognition performance, according to Nowak \etal \cite{Nowak2006}.
VFAST features are faster than the traditional 3D interest points by Laptev \cite{Laptev2005} and Dollar \etal \cite{Dollar2005}, it is therefore suitable for real-time application.
Given a denser set of interest points, action labels can be classified from a shorter sequence without sacrificing too much recognition accuracy, hence the classification response time can be minimised.
In addition, scale-space analysis can be applied on VFAST in order to detect interest points across different scales.
% Lastly, several techniques are employed to improve the recognition speed and accuracy. A novel spatiotemporal interest point detector, called VFAST, is designed based on the FAST 2D corners{Rosten2006}. The recognition accuracy is improved by adaptively combining HSRM and the bag of semantic texton (BOST) method \cite{Shotton2008}.
% The rest of the chapter is structured as follows: In section \ref{sec/act/relatedwork}, related work are reviewed and, in section \ref{sec/act/overview}--\ref{sec/act/combine}, the proposed methods are detailed. Evaluation results are reported and discussed in Section \ref{sec/act/experiments} and the conclusion is drawn in Section \ref{sec/act/discussion}.
\section{Related work}
\label{sec/act/relatedwork}
% Done
Machine recognition of human activities has been a long-lasting topic in computer vision research since the seventies, \eg moving light displays \cite{Johansson1973}. Early approaches to human activity analysis were reviewed by C\'edras and Shah \cite{Cedras1995}, Moeslund and Granum \cite{Moeslund2001} and Turaga \etal \cite{Turaga2008}. Contemporary action recognition algorithms were reviewed by Poppe \cite{Poppe2010} and Weinland \etal \cite{Weinland2011}.
State-of-the-art action recognition methods have shown the effectiveness of local appearance-based features, the bag-of-words model is particularly popular in the literature \cite{Schuldt2004, Dollar2005, Riemenschneider2009, Niebles2008, Wong2007}.
Figure \ref{fig/act/bow} summarises a typical bag-of-words object recognition pipeline. Feature descriptors are firstly extracted from the detected interest points. Input feature vectors are quantised into visual codewords by a learned codebook. Each object is represented by its frequency list, \ie the histogram of visual codewords. A classifier is then trained to assign class labels to the testing objects.
\begin{figure}[t]
\centering
\includegraphics[width=1\linewidth]{fig/act/bow.pdf}
\caption{\textbf{Bag-of-words model.} A standard overview of bag-of-words model for object classification.}
\label{fig/act/bow}
\end{figure}
% Done
For action classification, a large codebook, \eg up to thousands of codewords, is favourable to ensure a good recognition accuracy. However, computational load and memory requirement also increase with codebook size. With respect to learning algorithms, feature quantisation by a large flat codebook, such as k-means, is computationally heavy. Furthermore, an oversized codebook is vulnerable to high quantisation errors and over-fitting. As a result, alternative approaches to codebook learning have been investigated to increase the efficiency and robustness of action classification. Moosmann \etal \cite{Moosmann2007} first suggested learning a fast visual codebook as a randomised clustering forest. Since then tree-based codebooks have been adopted gradually in various object recognition tasks, including scene recognition \cite{Bosch2007} and semantic segmentation \cite{Shotton2008}, owing to its superior efficiency and generalisation power. There exist two advantages of using a tree-based codebook for object recognition. Firstly, decision trees can be highly parallelised for efficient computation, which is essential for time-critical applications. Secondly, given a feature vector, a decision tree produces multiple visual codewords at different levels, increasing the robustness of visual codeword matching.
% Done
As a variant of a tree-based codebook, Oshin \etal \cite{Oshin2009} recognised human actions by describing the spatial distribution of interest points by random ferns. Lin \etal \cite{Lin2009} used a prototype tree to encode holistic motion-shapes descriptors. Mikolajczyk and Uemura \cite{Mikolajczyk2008} improved codebook testing time by learning clustering trees from the results of k-means clustering. Hierarchical codebooks enable fast vector quantisation during testing, however, the expensive features and classifiers make the overall processes still costly \cite{Lin2009, Mikolajczyk2008}.
% Done
Codewords are orderless in the bag-of-words model.
A typical bag-of-words algorithm contains only encoded appearance information of local features.
% Standard bag of words models contain only local appearance information.
Whilst structural context is useful in describing some action classes, it is often omitted in current action recognition methods. Some studies have implied that structural or holistic features can be as effective as local features.
For instance, Gorelick \etal \cite{Gorelick2007} considered human actions as 3D space-time shapes, which are built from foreground silhouettes. Fathi and Mori \cite{Fathi2008} recognised actions using global mid-level features built from optical flow in segmented 3D patches. Kim \etal \cite{Kim2007} applied canonical correlation analysis to describe human actions in videos. The major limitation of these holistic approaches is that segmentation or action detection is required before the extraction of holistic features.
% Done
Some recent studies attempted to augment action recognition algorithms by considering structural information in addition to local appearance. Savarese \etal \cite{Savarese2008} proposed correlograms to describe the co-occurrences of visual codewords in a space-time neighbourhood. Spatial correlograms were also used in the work by Liu and Shah \cite{Liu2008} to encode structural information of visual word clusters. Wong \etal \cite{Wong2007} presented the pLSA-ISM model, which is an extension of the pLSA topic model \cite{Fergus2005} with additional spatial information from implicit shape model (ISM). Tran \etal \cite{Tran2008} extracted motion templates and performed action classification using metric learning. Zhang \etal \cite{Zhang2008} proposed motion context, which is a spatiotemporal holistic feature based on the shape-context descriptor.
% Done
Nevertheless, the structural relationship among local features are not fully utilised in the above techniques. Some approaches encode local features with respect to a canonical reference frame \cite{Wong2007, Tran2008, Zhang2008}, \eg a rectangular region-of-interest. In order to detect actions, they require segmentation or additional prepossessing, which is often computationally expensive. Addressing this issue, Ryoo and Aggarwal \cite{Ryoo2009} proposed the spatiotemporal relationship match (SRM) that encodes the spatiotemporal relationships between visual codewords using a set of pairwise rules. Kovashka and Grauman \cite{Kovashka2010} exploited structural information by learning an optimal neighbourhood measure of interest points. Despite high recognition accuracy, speed and quantisation errors remain challenging issues for these approaches because flat k-means codebooks are used.
% Done
New machine learning techniques have been introduced to improve the accuracy of different action classification tasks.
For instance, a pyramid match kernel (PMK) \cite{Grauman2005} has been used in object detection and image-based retrieval. PMK constructs multi-resolution, hierarchical histograms where codewords are assigned to bins according to their spatial locations. Similar points that do not match in fine resolutions have the chance to match in lower resolutions, reducing quantisation errors and enhances robustness in the spatial domain. PMK has recently been applied in action classification, Liu and Shah \cite{Liu2008} matched interest points in multiple resolutions using PMK and reported improved results over the flat histogram intersection kernel.
However, PMK only performs hierarchical matching in the spatial domain, the semantic similarities among codewords are not modelled.
% Done
As discussed in chapter \ref{chap/eval}, feature detection and representation also play a crucial role in the performances of action classification systems. Several interest points are specifically designed for spatial temporal data. For example, Laptev \cite{Laptev2005} proposed spatiotemporal interest point (STIP), which is extended from 2D Harris corner detection. Dollar \etal \cite{Dollar2005} then enhanced the density of STIP by applying Gaussian filters in space and Gabor filters in time.
Wong and Cipolla \cite{Wong2007a} extracted discriminative regions using non-negative matrix factorisation, decomposing videos into linear combinations of shared space-time components.
Based on salient region detection \cite{Mikolajczyk2004}, Oikonomopoulos \etal \cite{Oikonomopoulos2005} computed spatiotemporal saliency by measuring the entropy within a spherical neighbourhood. Messing \etal \cite{Messing2009} used trajectories of tracked keypoints to recognise actions from videos.
With respect to feature descriptors, HOG and HOF are popular techniques in earlier action classification methods \cite{Dollar2005, Niebles2008, Schuldt2004}. Scovanner \etal \cite{Scovanner2007} proposed a 3D version of Lowe's seminal SIFT descriptor \cite{Lowe2004}. Willems \etal \cite{Willems2009} presented an extended SURF descriptor in videos for action recognition.
% Extra feature descriptor?
To summarise, feature extraction is often one of the most computationally expensive processes in action classification systems. Efficient features, or descriptor-less approaches, are investigated to improve the run-time speed of action classification.
% done
In addition to classification accuracy, run-time performance is another essential quality for realistic human action recognition systems. Various techniques have been developed to speed up feature extraction and classification.
Instead of HOG and HOF, Yeffet and Wolf \cite{Yeffet2009} computed dense local trinary patterns, which is an efficient feature derived from local binary patterns originally designed for texture analysis and face recognition \cite{Ahonen2006}.
Gilbert \etal \cite{Gilbert2009} proposed a fast multi-action recognition algorithm by finding patterns of dense 2D Harris corners with a data-mining algorithm.
Patron-Perez and Reid \cite{Patron2007} designed a probabilistic classifier that recognised actions continuously with a sliding window.
Bregonzio \etal \cite{Bregonzio2009} considered actions as clouds of points, efficient classification was done by analysing a histogram of point clusters.
%Regarding the classification step in existing systems, standard classifiers are often used.
Standard classifiers are often used in existing action classification systems.
K-nearest neighbour classifier, support vector machines and boosting were among the most widely adopted algorithms. New models have been proposed to accelerate the classification process, such as k-NN approximation using decision trees \cite{Lin2009}, randomised ferns \cite{Oshin2009} and Markov chains \cite{Messing2009}.
Although real-time performances have been reported by these methods, their computational bottleneck in feature extraction and vector quantisation have not been addressed.
\section{Overview of the proposed method}
\label{sec/act/overview}
%done
The proposed action classification system is outlined in figure \ref{fig/act/flow}. A testing video is first divided into many video snippets, which are overlapping short video sequences. In order to minimise the classification delay, actions are classified sequentially from the snippets instead of the whole testing video.
Spatiotemporal features are detected using the VFAST interest point, which has been evaluated in section \ref{sec/eval/vfast}.
Spatiotemporal textons, \ie cuboids of voxels, are extracted from the detected VFAST interest points. They are quantised into visual codewords by a trained semantic texton forest. The learning algorithm of semantic texton forest is described in section \ref{sec/act/stf}.
The set of visual codewords are classified by the proposed HSRM algorithm and the traditional bag-of-semantic-textons (BOST) algorithm. These two algorithms are performed simultaneously, their classification results are combined using a late fusion scheme. In the HSRM approach, visual codewords are assigned to 3D histograms according to a set of association rules, capturing both appearance and structural information of the video snippet. Actions are classified efficiently using a k-means forest classifier in section \ref{sec/act/kmeansforest}, with a pyramid match kernel \cite{Grauman2005}.
In the BOST algorithm, a random forest classifier is employed to classify 1-D histograms of visual codewords.
The final classification result is combined from the outputs of HSRM and BOST algorithms, using an adaptive late fusion scheme.
% The feature extraction phase of the proposed framework is flexible.
% Depending on the target application, simple spatiotemporal video patches can be replaced by more discriminative local features, such as HOG or tracked trajectories.
\begin{figure}[!ht]
\centering
\includegraphics[width=1.0\linewidth]{fig/act/act_intro.pdf}
\caption{\textbf{Overview of the proposed approach.} VFAST interest points extracted from the video snippets are encoded into visual codewords by the spatiotemporal semantic texton forest. Two different algorithms, namely the proposed HSRM method and the traditional bag-of-words model, are employed to classify the codewords. Their results are combined using an adaptive late fusion scheme.}
\label{fig/act/flow}
\end{figure}
\section{VFAST interest point detector}
\label{sec/act/fastest}
% DONE
Video FAST (VFAST) interest points are obtained by extending the FAST corners \cite{Rosten2006} into the spatiotemporal domain. It considers voxels in three orthogonal Bresenham circles $\bcir$ with radii $r_{XY}$, $r_{XZ}$ and $r_{YZ}$ on $XY$, $XZ$ and $YZ$ planes respectively. Implementation details of the VFAST interest point detector have been explained in section \ref{sec/eval/vfast}.
% Figure \ref{fig/act/fastest} illustrates how interest points are detected using the 42-pixel V-FAST interest point detector with $r_{xy},r_{xt},r_{yt} = 3$.
Figure \ref{fig/act/fastest} illustrates how the cuboids are extracted from the input videos.
An input video is considered as a spatiotemporal volume of voxels, hence the VFAST algorithm works along the $X$, $Y$ and $T$ axes instead. Salient motions are detected by VFAST using three axis-aligned circles of radius voxels, \ie $r_{XY}, r_{YT}, r_{XT} = 3$, according to algorithm \ref{algo/eval/vfast}. Spatiotemporal textons are obtained by extracting voxel cuboids from the input videos centred at the VFAST interest point. In this work, a video cuboid has dimensions of $9 \times 9$ voxels in space and $5$ voxels in time.
% Figure \ref{fig/act/fastest} illustrates how the cuboids $\vpatch$ are extracted using the VFAST interest point detector.
% already discussed in ``eval'' chapter
% For a spatiotemporal volume $\vpatch$, FAST saliency is detected on a plane if there exist $n$ contiguous pixels on the circle which are all brighter than the intensity of the reference voxel $\vpatch(x,y,t)$ plus a predefined threshold $\delta^{FAST} \in \{ t_{XY}, t_{XT}, t_{YT} \}$, or all darker than $\vpatch(x,y,t)$ minus $\delta^{FAST}$. An interest point is detected when the reference pixel shows both spatial ($XY$ plane) and temporal ($XT$ plane or $YT$-plane) saliency.
\iffalse
The steps of computing VFAST interest points are described in algorithm \ref{algo/act/vfast}.
%DONE
\begin{algorithm}
\caption{\textbf{VFAST spatiotemporal interest point detector.}}
\label{algo/act/vfast}
\begin{algorithmic}
\REQUIRE $[x,y,t] = $ voxel coordinates
\REQUIRE $r_{XY},r_{XT},r_{YT} = $ Radii of Bresenham circles
\REQUIRE $n = $ Contiguous threshold
\REQUIRE $\delta^{FAST} = $ Intensity threshold
\STATE $\bcir_{XY}(r_{XY}),\bcir_{XT}(r_{XT}),\bcir_{YT}(r_{YT})$ = Pixels of Bresenham circles
\STATE $FAST_{XY} = ( \bcir > \vpatch(x,y,t)+t_{XY}\quad or\quad \bcir < \vpatch(x,y,t)-\delta^{FAST}_{XY} )$
\STATE $FAST_{XT} = ( \bcir > \vpatch(x,y,t)+t_{XT}\quad or\quad \bcir < \vpatch(x,y,t)-\delta^{FAST}_{XT} )$
\STATE $FAST_{YT} = ( \bcir > \vpatch(x,y,t)+t_{YT}\quad or\quad \bcir < \vpatch(x,y,t)-\delta^{FAST}_{YT} )$
\IF {Number of contiguous pixels in $Saliency_{XY} \ge n$}
\IF{(${ContiguousPixel}(FAST_{XY}) \ge n$)}
\RETURN \texttt{TRUE}
\ELSIF{($ContiguousPixel(FAST_{XY}) \ge n$)}
\RETURN \texttt{TRUE}
\ELSE
\RETURN \texttt{FALSE}
\ENDIF
\ENDIF
\end{algorithmic}
\end{algorithm}
\fi
% DONE
\iffalse
\begin{figure}[ht]
\centering
\includegraphics[width=1\linewidth]{fig/act/fig2_new.pdf}
\caption{\textbf{Feature detection and extraction. } Spatiotemporal features are detected from the input video by the VFAST detector. Video cuboids are extracted around these interest points for training and testing.}
\end{figure}
\fi
\begin{figure}[ht]
\centering
\begin{subfigure}[t]{0.48\linewidth} \centering
\includegraphics[height=0.66\linewidth]{./fig/act/act_fast_2.pdf}
\subcaption{Input video as a spatiotemporal volume}
\end{subfigure}
\begin{subfigure}[t]{0.48\linewidth} \centering
\includegraphics[height=0.66\linewidth]{./fig/act/act_fast_1.pdf}
\subcaption{Voxel patches are extracted from the VFAST interest points}
\end{subfigure}
\caption{\textbf{Feature detection and extraction. } (Left) An input video is regarded as a 3D spatiotemporal volume. (Right) Video cuboids are extracted from the VFAST interest points for both training and testing.}
\label{fig/act/fastest}
\end{figure}
%\section{Random Forest}
%blah blah blah
\section{Spatiotemporal semantic texton forest}
\label{sec/act/stf}
%done
A semantic texton forest (STF) is an ensemble of randomised decision trees which textonises input video patches into semantic textons. Instead of using random forest as a classifier, it has also been applied to vector quantisation by learning multiple tree-based codebooks \cite{Moosmann2007}. Shotton \etal \cite{Shotton2008} applied this technique to image segmentation by quantising texture patches using a STF. In this chapter, STF is used to convert video patches into visual codewords.
It is an efficient codebook, where feature vectors traverse each of the trees by simple voxel pair tests, without computing filter responses or feature descriptors. The STF codebook also demonstrates higher robustness and discriminative power than k-means because it produces multiple codewords from a feature vector.
% Done
The training algorithm of STF is similar to that of supervised classification forest \cite{Breiman2001}. Spatiotemporal patches $\tdata$ are extracted from the VFAST interest points detected in the training videos.
At each new node, a pool of candidate split functions are generated randomly, a split function is subsequently selected from the pool, by maximising the information gain $\mathcal{Q}(\cdot)$ in the current training data, such that
\begin{equation}
\begin{aligned}
\entropy(\tdata_{\curnode}) & = -\sum_{c \in \nclass} \frac{|\tdata_{\curnode c}|}{|\tdata_{\curnode}|}\log_{2}\left(\frac{|\tdata_{\curnode c}|}{|\tdata_{\curnode}|}\right),\\[3mm]
\mathcal{Q}(\tdata_{\curnode}) & = \entropy(\tdata_{\curnode}) - \frac{|\tdata_{l}|}{|\tdata_{\curnode}|}\entropy(\tdata_{l}) - \frac{|\tdata_{r}|}{|\tdata_{\curnode}|}\entropy(\tdata_{r}),
\end{aligned}
\label{eqn/act/ig}
\end{equation}
where $\entropy(\tdata_{\curnode})$ is the information entropy of a training dataset $\tdata_{\curnode}$ in current node $\curnode$. $\tdata_{\curnode c}$ denotes the subset of $\tdata_{\curnode}$ that belongs to class $c \in \nclass$. $\tdata_{l}$ and $\tdata_{r}$ are the left and right subsets of $\tdata_{\curnode}$.
Split functions are assigned to a new node by maximising equation \ref{eqn/act/ig}. As a result, training video patches are clustered recursively down the decision tree. Training is stopped when the new tree nodes reach a certain depth. Algorithm \ref{algo/act/grow} describes the recursive procedure of learning a randomised decision tree in a STF.
\begin{algorithm}[ht]
\caption{\textbf{Semantic texton tree node training.}}
\label{algo/act/grow}
\algsetup{indent=2em}
\begin{algorithmic}
\STATE \hspace{-2em} \textbf{function split}($\tdata_{\curnode}$, $N$):
\STATE Generate a pool of candidate split functions in $\Theta = \{ \theta_{1},\dots,\theta_{|\Theta|}\}$
\FOR {$i = 1$ to $|\Theta|$}
\STATE Split training data $\tdata_{\curnode}$ using $\theta_i$
\STATE $\tdata_l = $ data points in $\tdata_{\curnode}$ that $\theta_i$ return $0$
\STATE $\tdata_r = $ data points in $\tdata_{\curnode}$ that $\theta_i$ return $1$
\STATE $\mathcal{Q}(\tdata_{\curnode}) = \Delta\histogram(\tdata_{\curnode})$
\ENDFOR
\STATE Find the $j$-th candidate split that maximises $\mathcal{Q}(\tdata_{\curnode})$
\STATE Assign $\theta_j$ to the current node
\IF {(Node has not reached maximum depth)}
\STATE Create left node $N_l$
\STATE Create right node $N_r$
\STATE \textbf{split}($\tdata_{l}$,$N_l$)
\STATE \textbf{split}($\tdata_{r}$,$N_r$)
\ENDIF
\STATE \hspace{-2em} \textbf{end function}
\end{algorithmic} \vspace{3mm}
\begin{algorithmic}
\STATE \hspace{-2em} \textbf{function grow}():
\STATE $\tdata_{root}$ = $\tdata$
\STATE Add root node $N_{0}$ to tree
\STATE \textbf{split}($\tdata_{root}$, $N_{0}$)
\STATE \hspace{-2em} \textbf{end function}
\end{algorithmic}
\end{algorithm}
%Done
In the proposed action classification system, a split function $\theta(\cdot)$ is defined as the weighted difference between two voxel values in an input video patch $\vpatch$:
\begin{equation}
\theta(p) =
\left\{
\begin{array}{lc}
1 & \mbox{ when } w_1 \vpatch(x_1,y_2,t_1) - w_2 \vpatch(x_2,y_2,t_2) > threshold \\
0 & \mbox{ otherwise }
\end{array}
\right.,
\label{eqn/act/stftest}
\end{equation}
where $\vpatch(x_1, y_1, t_1)$ and $\vpatch(x_2, y_2, t_2)$ are two randomly selected voxels from the input video patch, and $w_1$ and $w_2$ are random weightings for $\vpatch(x_1,y_1,t_1)$ and $\vpatch(x_2,y_2,t_2)$ respectively.
% Done
During test time, a STF acts on small spatiotemporal cuboid patches, which are extracted around the detected VFAST interest points. These video cuboids are then flattened to feature vectors, which pass down all $M$ trees in the STF.
The forest recursively clusters these video cuboids according to the chosen split functions.
% until reaching a certain depth in the randomised decision tree.
The STF codebook has a size of $\codebooksize=\sum_m^M \codebooksize_m$, where $\codebooksize_m$ denotes the number of leaf nodes, \ie codewords in $m$-th tree.
Consequently, in each decision tree, the set of video cuboids are roughly clustered into $\codebooksize_m$ groups based on their appearance, the average of cuboids within a node is called a semantic texton \cite{Shotton2008}.
Figure \ref{fig/act/stf} demonstrates how input spatiotemporal patches are clustered into two codewords through a tree node.
\begin{figure}[ht]
\centering
\includegraphics[width=0.8\linewidth]{fig/act/stf_new.pdf}
\caption{\textbf{Visual codeword generation by a semantic texton forest.} In the spatiotemporal semantic texton forest, each terminal node represents a visual codeword. A split function is presented in each non-terminal node. The split function divides incoming data into two groups and passes them to the child nodes.}
\label{fig/act/stf}
\end{figure}
\begin{table}
\begin{center}
\begin{tabular}{|c|c|c|c|c|}
\hline
\textbf{ Algorithm} & \textbf{ Complexity} & \parbox{3cm}{\centering\textbf{Relative Run Time}} & \textbf{Speedup} & \textbf{ Hierarchical} \\
\hline
k-means & $O(K)$ & $559.86$ & $1\mathrm{x}$ & no \\
Hierarchical k-means & $O(b\log_{b}(K))$ & $12.87$ & $43.50\mathrm{x}$ & yes \\
{STF} & \textbf{\color{blue}{ $ O(\log_{2}(K)) $ }} & \textbf{\color{blue}{$1$}} & \textbf{\color{blue}$559.86\mathrm{x}$} & \textbf{\color{blue}{yes}}\\
\hline
\end{tabular}
\end{center}
\caption{\textbf{Run-time performances of different vector quantisation techniques.} The run-time and space complexities of semantic texton forest and k-means codebooks.}
\label{tab/act/codebook}
\end{table}
Table \ref{tab/act/codebook} summarises an empirical comparison between STF and k-means algorithms, showing that STF performs much faster than k-means. Relative speed is measured by computing $1$ million feature vectors with $405$ dimensions, with codebook size $K = 1905$ ($127$ codewords per tree).
\section{Hierarchical spatiotemporal relationship match}
\label{sec/act/HSRM}
% Done
In this work, hierarchical spatiotemporal relationship match (HSRM) is designed to describe both local-appearance and structural information efficiently in a 3D histogram.
% Figure \ref{fig/act/hsrm} visualises the procedures of the spatiotemporal relationship histogram for hierarchical matching:
In section \ref{sec/act/stf}, STF quantises local spatiotemporal volumes into visual codewords.
For each tree in the STF codebook, a 3D histogram is constructed by analysing all possible codeword pairs for their spatiotemporal co-occurrences. The formation of 3D HSRM histograms is described in section \ref{sec/act/hsrmhistogram}. In section \ref{sec/act/hsrmmatch}, a novel k-means forest algorithm is presented to classify the 3D HSRM histograms. A pyramid match kernel is used in the k-means forest as the distance metric.
% introduction to SRM
The proposed HSRM algorithm is a hierarchical extension of the original spatiotemporal relationship match (SRM) by Ryoo \etal \cite{Ryoo2009}. In SRM, feature quantisation is performed by a flat k-means codebook instead of a semantic texton forest. Four association rules are defined to describe the spatiotemporal structure of visual codewords. Codeword histograms are constructed by testing every possible pair of visual codewords with the association rules. Testing codeword histograms are classified using N-nearest neighbours, using histogram intersection as the similarity metric.
Whereas the original SRM algorithm relies on a flat k-means codebook, HSRM leverages the hierarchical properties of multiple tree-based codebooks and performs robust matching using pyramid match kernels. The tree structure offers a time-efficient way to perform semantic codeword matching between pairs of histograms \cite{Grauman2005}.
\subsection{Building spatiotemporal relationship histograms}
\label{sec/act/hsrmhistogram}
% done
Short video snippets are the basic input units of the proposed system, which are densely sampled from an input video.
A set of VFAST interest points $\vfastset = \{\onevfast_{i}\}$ are localised from the snippets. Visual codewords are assigned to the interest points by the STF.
The $i$-th encoded interest point is described as
\begin{equation}
\onevfast_{i} = \{x_i, y_i, t_i, l_{m,i}\}_{m=1,\dots,M},
\end{equation}
where $x_i, y_i, z_i$ represents the $XYT$-location of a feature and $\vcode_{m,i}$ is the corresponding visual codeword, \ie leaf node assigned to $\onevfast_{i}$ by the $m$-th tree.
In the proposed system, a set of pair-wise spatiotemporal associations are designed to capture the structural relations among interest points. By analysing all possible pairs $\onevfast_i$ and $\onevfast_j$ in $\vfastset$, space-time structures of codewords are described by the following seven association rules $\mathcal{R} = \{ R_1,\dots,R_7\}$:
\begin{figure}[ht]
\centering
\includegraphics[width=1\linewidth]{fig/act/act_hsrm_histogram.pdf}
\caption{\textbf{Codeword histograms of Hierarchical spatiotemporal relationship match (HSRM).} Spatiotemporal textons are converted to visual codewords by the STF. For each decision tree in the STF, a 3D histogram is constructed to capture spatiotemporal relationships between codewords.}
\label{fig/act/hsrm}
\end{figure}
% done
\begin{equation}
\centering
\begin{array}{lr}
R_{1}~{(overlap)}: & |t_i - t_j| < \mathrm{T_{o}}; \\
R_{2}~{(before)}: & \mathrm{T_{o}} < t_j - t_i < \mathrm{T_{b}}; \\
R_{3}~{(after)}: & \mathrm{T_{o}} < t_i - t_j < \mathrm{T_{a}}; \\
R_{4}~{(near XY)}: & (|x_i - x_j| < \mathrm{T_{n}}) \wedge (|y_i - y_j| < \mathrm{T_{n}}); \\
R_{5}~{(near X)}: & (|x_i - x_j| < \mathrm{T_{n}}) \wedge \sim(R_{4}); \\
R_{6}~{(near Y)}: & (|y_i - y_j| < \mathrm{T_{n}}) \wedge \sim(R_{4}); \\
R_{7}~{(far)}: & (|x_i - x_j| < \mathrm{T_{f}}) \wedge (|y_i - y_j| < \mathrm{T_{f}}) \wedge \sim(R_{4} \vee R_{5} \vee R_{6}).
\end{array}
\label{eqn/act/rules}
\end{equation}
% done
Figure \ref{fig/act/hsrm} illustrates the construction and matching of the spatiotemporal relationship histograms using the proposed HSRM.
A set of 3D histograms $\{ \histogram_1(\vfastset),\dots,\histogram_{M}(\vfastset)\}$ are constructed by encoding all possible pairs of feature points in $\vfastset$. The bin $\mathrm{h}_{m}(i,j,k)$ for the $m$-th tree $\histogram_{m}(\vfastset)$ corresponds to the number of matching $(\vcode_{m,i},\vcode_{m,j})$ codeword pairs by an association $R_k$. The total number of bins of $\histogram_m(\vfastset)$, \ie histogram size, is therefore $\codebooksize_m \times \codebooksize_m \times |\mathcal{R}|$. Since the spatiotemporal histograms are sparse, their operations can be greatly accelerated using sparse matrix routines.
% done
\subsection{Pyramid match kernel for HSRM}
\label{sec/act/hsrmmatch}
% done
Quantisation is a major issue for bag-of-words models for object recognition tasks in general. It is often difficult to determine an optimal codebook size, which is dependent on both training and testing data.
If a codebook is too small, useful information is discarded due to quantisation errors, affecting its discriminative power.
Oppositely, over-quantisation is prone to noisy data and over-fitting. An oversized codebook produces sparse and flat visual codeword histograms, which undermines the final classification accuracy.
% done
In a STF, video patches are quantised hierarchically based on its semantic or appearance information. The following assumptions describe the properties of visual codewords generated by a tree-based codebook:
\begin{itemize}
\item A pair of codewords has higher similarity when they share a longer tree traversal path.
\item Each tree in the STF can be considered as a multi-resolution codebook. Quantisation resolution is adjusted by the depth of tree traversal.
\item Visual codewords of a parent tree-node share some semantic similarities with the codewords at the child nodes. Hence, codewords that share their traversal paths are semantically related.
\end{itemize}
% done
In figure \ref{fig/act/hsrm2}, a HSRM histogram is converted to different resolution levels, by recursively merging adjacent bins. As a result, pairs of spatiotemporal histograms can be compared using the pyramid match kernel (PMK) \cite{Grauman2005}. In section \ref{sec/act/kmeansforest}, the PMK algorithm is used as a similarity kernel in the proposed k-means forest classifier.
%It measures the similarity of two set of features in a multi-resolution histogram space.
The similarity between two sets of video cuboids $\vfastset$ and $\vfastsetx$ is measured by PMK from a multi-resolution histogram space for each tree.
A kernel function $\kernel(\onevfast,v: \codebooksize \times \codebooksize \rightarrow \mathbb{R})$ is defined as the similarity metric between two sets of input features from input space $\codebooksize \times \codebooksize$, which is equal to all pairwise combinations of codewords within a decision tree.
% done
PMK matches the input histograms at each resolution, at a specific resolution $\resolute$, $\mathrm{h}^\resolute_{m}(i,j,k)$ and $\mathrm{\hat{h}}^\resolute_{m}(i,j,k)$ denote two corresponding histogram bins in two sets of codewords, $\vfastset$ and $\vfastsetx$. They are matched by histogram intersection:
\begin{equation}
\label{eqn/act/hik}
\mathrm{I}_{m}^\resolute(\vfastset,\vfastsetx) = \displaystyle\sum_{i=1}^{\codebooksize_m}\sum_{j=1}^{\codebooksize_m}\sum_{k=1}^{|\mathcal{R}|} \left( \min(\mathrm{h}^{\resolute}_m(i,j,k), \mathrm{\hat{h}}^{\resolute}_m(i,j,k)) \right).
\end{equation}
New quantisation levels in the histogram pyramid are formed by increasing bin size. In the proposed method, adjacent bins that share the same parent node in the tree are conveniently merged in equation \ref{eqn/act/merge}, creating a new quantisation level $\mathrm{h}^{\resolute+1}_m(i,j,k)$ and $\mathrm{\hat{h}}^{\resolute+1}_m(i,j,k)$:
\begin{equation}
\label{eqn/act/merge}
\mathrm{h}^{\resolute+1}_m(i,j,k) = \displaystyle\sum_{u=1}^{2}\sum_{v=1}^{2} \left( \mathrm{h}^{\resolute}_m(2(i-1)+u,2(j-1)+v,k) \right).
\end{equation}
The match kernel $\mathrm{K}_m$ for HSRM at the $m$-th tree is then defined in equation \ref{eqn/act/pmk} by the weighted summation of difference between successive histogram intersections:
\begin{equation}
% \label{eqn/act/newcount}
% D^{\resolute}_m(\vfastset,V) &= & I^\resolute_m(U,V) - I^{\resolute-1}_m(U,V) \\
\label{eqn/act/pmk}
\kernel_m(\vfastset,\vfastsetx) = \displaystyle\sum_{\resolute = 1}^{Q}\frac{1}{4^{\resolute-1}} \left( \mathrm{I}_{m}^{\resolute+1}(\vfastset,\vfastsetx) - \mathrm{I}_{m}^{\resolute}(\vfastset,\vfastsetx) \right).
\end{equation}
Matches in finer bins score higher similarity than matches in coarser levels by a factor of $\frac{1}{4^{\resolute-1}}$. A bin with a small bin size implies higher similarity among the feature within the bin, matches in high resolution (small bin size) are assigned a higher weighting.
\begin{figure}[ht]
\centering
\includegraphics[width=1\linewidth]{fig/act/act_hsrm_match.pdf}
\caption{\textbf{Matching of HSRM histograms.} Codewords of adjacent tree nodes are semantically similar, a pyramical match kernel is used as a similarity metric in learning the k-means forest classifier. Low-resolution HSRM histograms can be created by merging adjacent bins.}
\label{fig/act/hsrm2}
\end{figure}
\subsection{Kernel k-means forest classifier}
\label{sec/act/kmeansforest}
%done
The HSRM algorithm is applied to learn a k-means forest classifier from the video snippets. Given a set of training video data $\vfastset_i$, $M$ independent clustering trees are trained by recursively performing the k-means algorithm, using HSRM as the similarity metric.
For the $m$-th tree in STF, the hierarchical k-means algorithm splits current training data into $N$ clusters, $\mathbf{S} = \{\mathbf{S}_1,\dots,\mathbf{S}_N\}$, to maximise the intra-cluster similarity:
\begin{equation}
\displaystyle\argmax_{\mathbf{S}} \displaystyle\sum_{i = 1}^{N}\sum_{\vfastset \in \mathbf{S}_{i}} \kernel_m(\vfastset, \mu_{m,i}),
\label{eqn/act/kmeans}
\end{equation}
where $\mu_{m,i}$ is the centroid of $\mathbf{S}_i$. In the testing stage, HSRM performs on a query video snippet $\vfastsetx$ against all centroids $\mu_{m,i}$ at the same level. The query video snippet proceeds to the node with highest similarity, \ie highest HSRM value. Matching of spatiotemporal relationship histograms is done recursively until a terminal node is reached. Classification is performed by averaging posterior distributions $P_{HSRM}(c|\hat{\mu}_{m})$ of the assigned leaf nodes $\{ \hat{\mu}_m \}$ over $M$ trees in the k-means forest:
\begin{equation}
P_{HSRM}(c|\vfastsetx) =
\frac{1}{M}\displaystyle\sum_{m=1}^{M}P_{HSRM}(c|\hat{\mu}_{m}).
\label{eqn/act/kmeansforest}
\end{equation}
% \section{Combined classification with bag-of-semantic-textons}
% \label{sec/act/combine}
% done
\section{Bag of semantic textons}
The bag-of-semantic-textons method (BOST) is derived from the standard bag-of-words method for image segmentation \cite{Shotton2008}. For each testing video snippet, a histogram $\bost$ is computed from visual words at every node in the STF, thus the size of $\bost$ equals to the total number of tree nodes (except the root), \ie $2\codebooksize - 2$.
%Since the dimension of $\bost$ $\codebooksize$ is relatively low (\cf the HSRM histogram has $\codebooksize_m \times \codebooksize_m \times |\mathrm{R}|$ dimension),
Standard random classification forest \cite{Breiman2001} are used as the classifier for BOST, which has proved powerful in image-based classification.
The random forest trained on the BOST histograms classifies a query video $\vfastset$ by the mean class distributions of the assigned leaf nodes $\{\hat{l}_1,\dots,\hat{l}_{M}\}$:
\begin{equation}
\mathrm{P}_{B}(c|\vfastset) = \frac{1}{M}\displaystyle\sum_{m=1}^{M}P_{BOST}(c|\hat{l}_m).
\label{eqn/act/bost}
\end{equation}
%done
\section{Late fusion classification}
Final action recognition is performed separately by the proposed kernel k-means forest classifier and by BOST. While HSRM has proved effective in most of the cases owing to its both local and structural information, BOST works well owing to its powerful RF classifier on some particular action classes. By integrating classification results of both methods, average performance is significantly improved.
A final class label is assigned to the class $c$ which obtains the highest posterior probability in a late fusion scheme:
\begin{equation}
\arg\max_{c}
\mathrm{P}(c|\vfastsetx) =
\arg\max_{c}
\alpha_c \mathrm{P}_{H}(c|\vfastsetx) + (1 - \alpha_c) P_{B}(c|\vfastsetx),
\label{eqn/act/combine}
\end{equation}
The parameter $\alpha_c$ controls the weight of $\mathrm{P}_{H}(c|\vfastsetx)$ and $P_{B}(c|\vfastsetx)$ in the late fusion scheme. It is set to maximise the true positive ratio (sensitivity) of a class $c \in C$ over the given training dataset $D$, using gradient descent or line search in equation \ref{eqn/act/alpha}:
\begin{equation}
\begin{aligned}
\alpha_c & = \displaystyle\argmax_{\alpha}\mbox{Sensitivity}(\tdata,\alpha) \\[3mm]
& = \displaystyle\argmax_{\alpha}\displaystyle\frac{\mbox{True Positive}}{\mbox{True Positive} + \mbox{False Negative}} \\[3mm]
&= \displaystyle\argmax_{\alpha}\displaystyle\frac{\left|\{\displaystyle\argmax_{c}P(c|\vfastset_d),c=\mbox{Ground Truth}(\vfastset_d),V_d \in D\}\right|}{\left|\{\vfastset_c, \mbox{Ground Truth}(\vfastset_d) = \mbox{Ground Truth}(\vfastset_c\}\right|}.
\end{aligned}
\label{eqn/act/alpha}
\end{equation}
\section{Experiments}
\label{sec/act/experiments}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Example frames of KTH and UT-interaction
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[th]
\centering
\includegraphics[width=1\linewidth]{fig/act/frames.png}
\caption{\textbf{Evaluation datasets.} Example frames of the KTH (top row) and the UT-interaction (bottom row) dataset.}
\label{fig/act/frames}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Experimental setup}
% done
The proposed action classification system was evaluated on two public benchmarks, namely the KTH dataset \cite{Schuldt2004} and the UT-interaction dataset \cite{Ryoo2010}.
Other existing approaches were compared with the proposed approach in terms of recognition accuracy.
Run-time performance of the proposed action classification system was also reported. A prototype system was implemented in C++. Experiments were performed on an \emph{Intel Core}\texttrademark \; i7 $2.8$Ghz PC. It demonstrated real-time (about $15$ frames per second) and continuous performance.
The prototype system was also deployed in an \emph{Intel Core 2 Duo} $2.4$Ghz laptop and ran at approximately $10$ frames per second on the KTH dataset.
No multi-core optimisation was enabled in the prototype system. It is expected that the run-time performance can be further improved if the program code is optimised for multi-core processing units.
%The proposed method was tested on the well-known KTH dataset \cite{Schuldt2004} and the UT-interaction data \cite{Ryoo2010}, a more challenging set \cite{Ryoo2009}. Other published state-of-the-art approaches are compared with the proposed method in terms of recognition accuracy. Computational time of our method is also reported. The experimental prototype was implemented in C++ on a standard PC. The system runs in real-time and makes decisions in a continuous manner (see the supplementary video submitted).
\subsection{KTH}
% done
% dataset introduction KTH
The KTH dataset is an established benchmark for action recognition research.
It contains sequences of six action classes captured from four different environments: boxing, hand-clapping, hand-waving, jogging, running and walking.
The videos are taken with changing subjects, camera poses, scales and appearances, as illustrated in figure \ref{fig/act/frames}.
The video sequences in the KTH dataset have an average length of $19$ seconds, at a frame rate of $30$ frames per second.
To recognise actions within a short response time, video snippets of $1$ second ($30$ frames) in length were extracted on-the-fly from the training and testing sequences.
Each training subsequence (video snippet) was sampled in every $5$ frames, which was collected from the training dataset to build the classifiers.
Testing snippets were densely sampled from testing videos in a similar manner.
Leave-one-out cross validation was used in the experiments.
Most existing work evaluated action classification at the sequence level, where one class label was assigned to the entire testing video instead of individual frames or short subsequences.
To put the proposed method in context, two different accuracy measures are defined:
\begin{enumerate}
\item ``Snippet''(subsequence) level accuracy: It is directly measured at the subsequences level
\item Sequence level accuracy: Class labels are determined by majority voting from the subsequences' classification labels
\end{enumerate}
% done
Table \ref{tab/act/exparam} summarises the parameters used in the experimental setup. They are set to obtain a balanced performance over accuracy, speed and generalisation. However, experimental results showed that small adjustments in the training parameters only brought marginal changes in overall accuracy.
\begin{table}
\centering
\begin{tabular}{|c|c|c|}
\hline
\textbf{Parameter} & \textbf{Description} & \textbf{Value} \\
\hline
\multicolumn{3}{|c|}{\textbf{VFAST spatiotemporal interest points}}\\
\hline
$r_{XY}, r_{XT}, r_{YT}$ & radii of the Bresenham circles in VFAST & $3$ \\
$\delta^{FAST}_{XY}$ & thresholds of VFAST on the $XY$ plane& $35$\\
$\delta^{FAST}_{XT}$ & thresholds of VFAST on the $XT$ plane& $50$\\
$\delta^{FAST}_{YT}$ & thresholds of VFAST on the $YT$ plane& $50$\\
Dimensions$(\vpatch)$ & dimensions of video cuboid (x,y,t)& $(9,9,5)$ \\
\hline
\multicolumn{3}{|c|}{\textbf{Spatiotemporal semantic forest codebook}}\\
\hline
$stf_{dmax}$ & maximum depth of random forest codebook & $8$ \\
$stf_{bootstrap}$ & bootstrap factor of random forest codebook & $0.7$ \\
$stf_M$ & number of tree in random forest codebook & $15$ \\
\hline
\multicolumn{3}{|c|}{\textbf{Classification}} \\
\hline
$km_{dmax}$ & maximum depth of K-means forest & $3$ \\
$km_k$ & branching factor $k$ of K-means forest & $10$ \\
$rf_{dmax}$ & maximum depth of the BOST forest & $8$ \\
$rf_{bootstrap}$ & bootstrap factor of the BOST forest& $0.7$ \\
$rf_M$ & number of tree in the BOST forest& $15$ \\
\hline
\end{tabular}
\caption{\textbf{Training parameters.}}
\label{tab/act/exparam}
\end{table}
% done
Table \ref{tab/act/compare} presents a detailed comparison between the proposed method and other state-of-the art approaches. The combined HSRM-BOST model demonstrates promising results, even though short video snippets are used to recognise actions.
The confusion matrices in figure \ref{fig/act/confusion} shows that HSRM and BOST complemented each other to attain improved accuracy.
Quantisation effects are soothed by the tree-based STF codebook and pyramid match kernel compared with the original spatiotemporal relationship match method by Ryoo and Aggarwal \cite{Ryoo2009}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Confusion matrices
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[ht]
\centering
\begin{subfigure}[t]{0.32\linewidth} \centering
\includegraphics[width=1\linewidth]{fig/act/bost.png}
\subcaption{BOST only}
\end{subfigure}
\begin{subfigure}[t]{0.32\linewidth} \centering
\includegraphics[width=1\linewidth]{fig/act/bost.png}
\subcaption{HSRM only}
\end{subfigure}
\begin{subfigure}[t]{0.32\linewidth} \centering
\includegraphics[width=1\linewidth]{fig/act/bost.png}
\subcaption{HSRM + BOST}
\end{subfigure}
\caption{\textbf{Confusion matrices of the KTH dataset.} From left to right: BOST, HSRM, and combined classification using late fusion.}
\label{fig/act/confusion}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% accuracy table: KTH
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\input{include/act_acc}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% done
Table \ref{tab/act/speed} shows the experimental results of run-time performance. Different from other sequence-level recognition approaches, a more realistic metric has been used to measure the algorithm efficiency. All stages in the recognition pipeline, including feature detection, feature extraction and classification, were timed. Average processing speed in frame per second is defined in equation \ref{eqn/act/fps}:
\begin{equation}
\mbox{FPS} = \frac{\mbox{Total number of subsequences}\times\mbox{Average number of frames}}{\mbox{Total recognition time}}.
\label{eqn/act/fps}
\end{equation}
The proposed action recognition ran at $10$ to $20$ frames per second.
The introduction of STF has greatly improved the run-time performance in feature extraction and codeword generation without computing expensive features, outperforming k-means codebooks (see also Table \ref{tab/act/codebook}). Decision tree-based classifiers, namely random forest and kernel k-means forest, proved efficient solutions to match and recognise multi-dimensional codeword histograms over the traditional techniques, such as nearest neighbour and SVM.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Speed table
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% DONE
\begin{table}
\centering
\begin{tabular}{|c|c|c|}
\hline
\backslashbox{\textbf{Process}}{\textbf{Dataset}} & \textbf{KTH} & \textbf{UT-interaction}\\
\hline
\multicolumn{3}{|c|}{\textbf{Processing time (millisecond)}}\\
\hline
VFAST detection & 15.13 & 28.49 \\
STF \& BOST & 16.86 & 38.75 \\
HSRM & 5.15 & 28.49 \\
Random forest & 0.87 & 1.63 \\
K-means forest & 14.90 & 2.34 \\
\hline
Total time required & 52.92 & 99.71 \\
\hline
\multicolumn{3}{|c|}{\textbf{Average frame-per-second (FPS)}}\\
\hline
Average FPS & \textbf{18.89} & \textbf{10.03} \\
\hline
\end{tabular}
\caption{\textbf{Run-time performance of the proposed algorithm.} Average recognition speed at different stages in frame per second (FPS).}
\label{tab/act/speed}
\end{table}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{UT-interaction dataset}
% dataset introduction UT
The UT-interaction dataset contains six classes of realistic human-to-human interactions, including shaking hands, pointing, hugging, pushing, kicking and punching, as visualised in figure \ref{fig/act/frames}.
Challenging factors of this dataset include moving background, cluttered scene, camera jitter/zoom and clothing changes.
In the experiments, the segmented UT-interaction dataset was used for evaluating the recognition accuracy and speed of the proposed action classification system.
In table \ref{tab/act/utcompare}, the proposed method achieves better accuracy in recognising realistic human-human interactions than Ryoo and Aggarwal \cite{Ryoo2009}.
With complex human interactions, using both local-appearance and structural cues in HSRM appeared more stable than BOST that uses only local-appearance.
However, it still shows improvements in overall recognition accuracies in the combined approach.
From table \ref{tab/act/speed}, the proposed system ran at real-time, with more than $10$ frames per second.
The drop in run-time speed for the UT-interaction experiment is mainly due to the extra interest points detected from other moving objects in the backgrounds.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% accuracy table: UT-interaction
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{table}
\centering
\begin{tabular}{|c|c|c|c|c|}
\hline
\backslashbox{\textbf{Action}}{\textbf{Method}} & \textbf{HSRM-BOST} & \textbf{HSRM} & \textbf{BOST} & \textbf{SRM}\cite{Ryoo2009} \\
\hline
Shake & \textbf{\color{blue}100.0} & 90.0 & 80.0 & 75.0 \\
Hug & \textbf{\color{blue}65.0} & 50.0 & 50.0 & 87.5 \\
Point & \textbf{\color{blue}100.0} & 85.0 & 100.0 & 62.5 \\
Punch & \textbf{\color{blue}85.0} & 65.0 & 65.0 & 50.0 \\
Kick & \textbf{\color{blue}75.0} & 70.0 & 25.0 & 75.0 \\
Push & \textbf{\color{blue}75.0} & 40.0 & 35.0 & 75.0 \\
\hline
\textbf{ Overall } & \textbf{\color{blue}83.33} & \textbf{66.67} & \textbf{59.16} & \textbf{70.8} \\
\hline
\end{tabular}
\caption{\textbf{Classification result of the UT-interaction dataset.} Sequential classification accuracy of the UT-interaction dataset, leave-one-out-cross-validation was used in the experiment.}
\label{tab/act/utcompare}
\end{table}
\section{Summary}
\label{sec/act/discussion}
%advantages: fast, fully utilises structural information
%limitation: still have some lookahead, not so powerful features, localisation not optimised
% This chapter has presented a novel solution for action recognition. Different from other published approaches, a major strength of our method is the run-time speed. Real-time performance is achieved by semantic texton forest which works on video pixels generating visual codewords in an extremely fast manner. HSRM is proposed to capture both spatiotemporal structures and local-appearances of actions and ease quantisation effects based on STF. Furthermore, a novel fast interest point detector and application of random forest and kernel k-means forest classifiers contribute to the acceleration of recognition speed. Experimental results show the comparable accuracies of the proposed method compared to state-of-the-arts. A future challenge will be tackling more complex realistic human actions and partial occlusions, as well as requiring continuous action localisation with a real-time performance.
In this chapter, the topic of video-based action recognition is posed as a 3D object classification task.
% A novel real-time solution for video-based human action classification is proposed.
Addressing the efficiency issues of existing techniques, a new real-time action recognition algorithm is proposed.
Real-time performance is achieved by semantic texton forests which work on video pixels generating visual codewords in an extremely fast manner. Hierarchical spatiotemporal relationship match (HSRM) is also introduced to capture both spatiotemporal structures and local-appearances of video snippets.
Furthermore, random forests and kernel k-means forest classifiers contribute to the acceleration of recognition efficiency.
This work takes a short frame sequences, namely video snippets, as the basic input. Its snippet-based classification algorithm facilitates a responsive system.
The proposed system demonstrated comparable accuracies over state-of-the-art approaches in the experimental results.
% Move to conclusions
% Future challenges include tackling more complex realistic human actions and partial occlusions, as well as performing continuous action detection in real-time.
However, there are several limitations to the proposed approach. Firstly, its feature detection and feature extraction are not scale invariant. In this work, the VFAST interest point detector performs on raw voxel values instead of a scale space. In addition, fixed size cuboids are extracted from the input videos. Secondly, while the proposed approach performs well in controlled environments, \eg the KTH dataset. It is vulnerable to appearance changes, such as clothing or moving background objects. Finally, the relationships among codewords are described by a fixed set of association rules in equation \ref{eqn/act/rules}. These manually designed association rules are limited to a specific set of action data, \eg the KTH dataset or the UT-interaction dataset.
% The action classification algorithm presented in this chapter is not completely compatible in chapter \ref{chap/body}.