-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathMETEOR.Rmd
568 lines (385 loc) · 33.1 KB
/
METEOR.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
---
title: "METEOR - Dynamic Bayesian Outlier Detection"
author: "Jorge Serras"
date: "March 2019"
output:
html_document: default
bibliography: bibliography.bib
---
## README:
> It is advice the use of RStudio IDE and R version of atleast 3.5.2, due to the installation process.
The script will automatically install all required packages if not already.
The relative positioning of all the files and folders should not be tempered with, even output files. Copies should thus be done.
For more information or any doubt concerning the given resources, please consider gathering additional insight or contact through https://jorgeserras.com/
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
if(!require('rJava')) install.packages("rJava") # If a java update was performed or some other issue, reinstall rJava
if(!require('pacman'))install.packages('pacman') # Automatically installs and loads the required packages
pacman::p_load(jmotif, plyr, dplyr, qdapTools, data.table, uroot, stsm, mclust, stats, dmm, DiagrammeR, ggplot2, reshape2, knitr)
library("rJava") # Problems with Java? try: https://github.com/s-u/rJava/issues/163 Furthermore the Java version of your system must be the same as used by rJava
.jinit()
.jaddClassPath('METEOR.jar')
library("xlsx")
library("imputeTS")
library("DiagrammeR")
library(knitr)
#setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
source(file.path('auxiliar', 'formatting.R'))
source(file.path('auxiliar', 'auxiliar.R'))
source(file.path('DBN', 'DBN_scoring.R'))
source(file.path('discretization', 'sax.R'))
source(file.path('score_analysis', 'score_analysis.R'))
as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}
my.file.rename <- function(from, to) {
todir <- dirname(to)
if (!isTRUE(file.info(todir)$isdir)) dir.create(todir, recursive=TRUE)
file.rename(from = from, to = to)
}
read_path <- paste(getwd(),file.path('/resources'), sep = "") # Input files go to this folder
write_path <- paste(getwd(),file.path('/resources', 'output_files'), sep = "") # Output files go to this folder
DBN_write_path <- paste(getwd(),file.path('/resources/output_files', 'DBN'), sep = "") # Output files go to this folder
```
```{r echo=FALSE, include=FALSE}
```
***
## About
The current study has the aim of providing further research and resources into the detection and modeling of anomalous behavior in multivariate temporal data. A complete system is built to handle multivariate time series (MTS) from pre-processing to score-analysis known as MultivariatE Time sEries OutlieR (METEOR) [@webapp].
All data including time stamps are first pre-processed and discretized to ensure the correct employment of the proposed methods. The developed system is based on Dynamic Bayesian Networks (DBN). METEOR is made available for anomaly detection and followed step-be-step. A web application and tutorial are additionally made accessible [@webapp] for any endeavor together with supplementary information.
Furthermore, generated output files are available in folder **/resources/output_files** where several subsets and transformations of the input dataset are present. Additionally, there is a folder **/DBN** which has the output files concerning outlier detection using the DBN (METEOR) approach.
###Abstract
> Outliers can be defined as observations which are suspected of not have been generated by data's underlying processes. Many applications require a way of identifying interesting or unusual patterns in multivariate time series (MTS), however, most outlier detection methods focus solely on univariate series, providing analysts with strenuous solutions. We propose a complete outlier detection system covering problems since pre-processing that adopts a dynamic Bayesian network modeling algorithm. The latter encodes optimal inter and intra time-slice connectivity of transition networks capable of capturing conditional dependencies in MTS datasets. A sliding window mechanism is employed to gradually score each MTS transition given the model. Simultaneously, complete MTS are evaluated. Two score-analysis strategies are studied to assure an automatic boundary classifying anomalous data. The proposed approach is first validated through simulated data, demonstrating the system's performance. Comparison with an assembled probabilistic suffix tree method is available displaying the leverage of the proposed multivariate approach. Further experimentation is made on real data, by uncovering anomalies in distinct scenarios such as electrocardiogram series, mortality rate data and written pen digits. The developed system proved beneficial in capturing unusual data resulting from temporal contexts, being suitable for any MTS scenario. A widely accessible web application employing the complete system is made available jointly with a tutorial.
***
## Formatting
To begin, the input data is loaded. If the dataset is not in the correct format, please check the examples and tools made available to ensure a correct employment. The continuous MTS input dataset should be in the **/resources** folder. The code below must be changed accordingly and the number of variables **n_variables** specified. If data is already discretized, skip the SAX Discretization phase.
```{r loading}
####
original_continuous_data <- as.data.frame(read.csv(file = paste(read_path, "/continuous_example_data.csv", sep = ""), header=TRUE))
n_variables <- 2 # For the SAX plot function, 1 means the subjects are univariate
```
***
## SAX Discretization
In the first phase, a method known as Symbolic Aggregate approXimation (SAX) [@lin2003symbolic] is applied to discretize time series (TS) into strings. It is capable of transforming an univariate TS of length n after normalization into a string of arbitrary length w. The latter length can be w << n if Piecewise Aggregate Approximation (PAA) is applied. This ensures dimensionality reduction of the normalized TS when excessive measurements are present. An alphabet size must be chosen to determine the number of symbols for the final discretized sequences. Such assumes that normalized TS possess Gaussian distribution probabilities with unitary standard deviation and null mean. For additional information, please see [@lin2003symbolic].
SAX does not allow missing values, imputation of the missing values should be considered as future work. The SAX discretization is the same available in the webapp: https://jorgeserras.shinyapps.io/outlierdetection/
Parameters can be **changed** in the code below, allowing the adjusting of the alphabet size, PAA reduction size and an auxiliar function for the plot. Feel free to experiment until results are satisfactory.
**ATTENTION:** It is worth noting that since each TS is normalized, the offset is lost in the process. Such means that it could happen that subjects with extreme values (for example) are brought to zero and be later classified as normal. Visual inspection or the addition of a step before SAX is advised to prevent such scenarios.
```{r SAX}
### Discretization:
panel_data <- as.data.frame(parseToPanel(original_continuous_data))[,1:(n_variables+2)]
### SAX parameters:
alphabet_size <- 3
paa_size <- as.integer(max(panel_data$timestamp)) #paa_size = ncol() - 1 to prevent dimensionality reduction, -1 because of the "subject_id" column. This is for the univariate case, if data is multivariate, paa_size must be equal to the number of time slices to avoid performing PAA reduction
### SAX parameters: alphabet_size and PAA_size which gives the new reduced numer of time slices for each TS
discrete_outputs <- discretize(panel_data = panel_data, alphabet_size = alphabet_size, paa_size = paa_size)
### Retrieve solely the discretized dataset:
discrete_data <- discrete_outputs[[1]] # The variable names are different but no worries, they are easily recovered later if necessary
normalized_data <- discrete_outputs[[2]]
timestamp <- discrete_outputs[[3]]
subject_id_data <- data.frame(subject_id = discrete_data$subject_id)
### Save in file:
write.csv(discrete_data, file = paste(write_path, "/discrete_data.csv", sep = ""), row.names = FALSE, quote = TRUE)
```
The results of the discretization can be analyzed in the figure below, which displays the normalized data before preforming PAA reduction (if considered). It shows the mean and standard deviation of each time slice of the dataset. Every TS displayed is normalized with mean = 0 and std = 1. If the number of variables is higher than 1, several curves occur, one for each variable of the MTS. The breakpoints are also visible (pink), delimiting the regions of each symbol of the alphabet chosen. An alphabet size of 3, for example, displays 2 breakpoints separating the entire y domain in 3 equiprobable regions according to a Normal distribution N(0,1).
```{r SAX_plot, echo=FALSE}
mean_std_plot(normalized_data, timestamp, n_variables, alphabet_size)
```
Results of SAX pre-processing including PAA reduction (if considered). Parameters can be **changed below** to specify which subjects to observe.
```{r SAX_discrete_table, results = 'asis'}
kable(data.frame(discrete_data[1:7,]),caption="SAX discretized subjects")
```
```{r, include=FALSE}
rm(panel_data)
rm(alphabet_size)
rm(timestamp)
rm(n_variables)
rm(paa_size)
```
***
**ATTENTION:** If PAA reduction was performed, the transition plots displaying outliers as red **are not correct**, since these display the input data before the reduction. Nevertheless, all computations and conclusions are correct. Users should also remember that they performed PAA, and a transition *t* classified as anomalous does not correspond to the same transition of the input data (in the case of PAA).
## METEOR application (DBN approach)
Many applications require a way of identifying interesting or unusual patterns in multivariate time series (MTS), however, most outlier detection methods focus solely on univariate series, providing analysts with strenuous solutions. We propose a complete outlier detection system known as METEOR [@webapp] covering problems since pre-processing that adopts a dynamic Bayesian network (DBN) modeling algorithm [@monteiro2015polynomial]. The latter encodes optimal inter and intra time-slice connectivity of transition networks capable of capturing conditional dependencies in MTS datasets. A sliding window mechanism is employed to gradually score each MTS transition given the model. Simultaneously, complete MTS are evaluated. The developed system proved beneficial in capturing unusual data resulting from temporal contexts, being suitable for any MTS scenario. A widely accessible web application [@webapp] employing the complete system is made available jointly with a tutorial.
Outliers are considered to be observations suspicious of not have being generated by the data's main underlying processes. Such instances generally do not fit the trained model. The goal is to score portions as well as entire MTS. For each transition of a subject's MTS, a window composed by its observed attributes is built, dependent of the order of the trained DBN. A transition score represents the log-likelihood (LL) of the observed window computed using the network's conditional probabilities. It is worth noting that a transition score does not represent the outlierness of a time slice, but rather its context. A score of an entire TS (subject) is computed using the average of all the transition scores from that subject. The most negative a score is, the more anomalous it is.
By having parameters markov_lag = 1 and previous_parents = 1, for example, each measurement at time t is associated to its immediate precedent (value at slice t-1).
Issues may occur in this phase due to rJava package. This happens due to the still in progress development of the package (bugs and features are still being coded) and Java versions in question. Most of the times, running the script repeatedly fixes the issues (it simply does, don't ask me). Try to rerun until it works. If problems continue to occur, consider reinstalling rJava and updating Java (unnistalling previous version) or using the online version of the Web application https://jorgeserras.shinyapps.io/outlierdetection/ instead of running through code. The Dynamic Bayesian Network Outlier Detection system for multivariate time series is used. Check the main webpages: https://jorgeserras.shinyapps.io/outlierdetection/ and https://jorgeserras.com/MTS_OutlierDetection/ for additional information. Don't forget there is a tutorial video available!
**NOTE:** If the SAX discretization phase was skipped, due to data already being discrete, please insert the discrete MTS dataset in format .csv in the folder "/resources/output_files" and change the name below from "/discrete_data.csv" to the name of your file. There is an example file in the folder named "/discrete_example_data.csv".
```{r DBN}
### Both transitions and complete subjects are scored
### Name of the file where the discrete data is stored:
data_file_name <- "/discrete_data.csv"
DBN_output <- DBN_scoring(data_file_name, write_path, markov_lag = 1, previous_parents = 1, checkbox_stationary = TRUE, score_function = "ll")
DBN_transition_scores <- DBN_output[[1]]
DBN_subject_scores <- DBN_output[[2]]
# Join all the DBN scores (transitions and subjects) in one file/data.frame
DBN_scores <- DBN_transition_scores[,-1]
DBN_scores <- cbind(data.frame(subject_id=subject_id_data),DBN_scores,data.frame(subject_score=DBN_subject_scores$score))
# Restore the true subject_id values that are lost in the process
DBN_transition_scores$subject_id <- NULL
DBN_subject_scores$subject_id <- NULL
DBN_transition_scores <- cbind(data.frame(subject_id = subject_id_data, DBN_transition_scores))
DBN_subject_scores <- cbind(data.frame(subject_id = subject_id_data, DBN_subject_scores))
```
```{r, include=FALSE}
rm(L)
```
```{r DBN_output_score_results, echo = FALSE, include = FALSE}
write.csv(DBN_transition_scores, file = paste(DBN_write_path, "/DBN_scores_transitions.csv", sep = ""), row.names = FALSE)
write.csv(DBN_subject_scores, file = paste(DBN_write_path, "/DBN_scores_subjects.csv", sep = ""), row.names = FALSE)
#Delete files if they exist
if (file.exists("scores_transition_output.csv"))
file.remove("scores_transition_output.csv")
if (file.exists("subject_scores_output.csv"))
file.remove("subject_scores_output.csv")
write.csv(DBN_scores, file = paste(DBN_write_path, "/DBN_scores_combined.csv", sep = ""), row.names = FALSE)
DBN_sorted_scores <- data.frame(DBN_scores[order(DBN_scores$subject_score),])
write.csv(DBN_sorted_scores, file = paste(DBN_write_path, "/DBN_scores_combined_sort.csv", sep = ""), row.names = FALSE)
```
Visualization of the DBN modeled, this can be stationary or non-stationary. Nodes blue outlined concern attributes at the present time slice (t) being scored. All other nodes concern attributes from previous slices (t-1,...) according to the structure of the transition networks. A node Vi[t] concerns the value of the i-th variable from the dataset at time slice t.
By default, the DBNs are modeled using a LogLikelihood function, a parameter can be changed to used the MDL instead. Such allows a more constraint training, which usually reduces the number of connections in the transition networks.
```{r DBN_figure}
grViz("dbn_structure_dot") # Plot the DBN using file 'dbn_structure_dot'
```
```{r DBN_structure, echo = FALSE, include = FALSE}
my.file.rename(from = paste(getwd(),"/dbn_structure_dot", sep = ""),
to = paste(DBN_write_path,"/dbn_structure_dot",sep = "")) # Move file's location to the DBN output folder
my.file.rename(from = paste(getwd(),"/dbn_structure_output", sep = ""),
to = paste(DBN_write_path,"/dbn_structure_output",sep = "")) # Move file's location to the DBN output folder
```
The scoring results for transitions and entire subjects are observable below, being the subjects sorted by their subject scores (the lower the value, the more anomalous). The subjects to be displayed can be specified in the function's argument. The results are written in files **DBN_scores_combined.csv** and **DBN_scores_combined_sort.csv**, available in the folder '/resources/output_files/DBN' together with all other DBN outputs.
```{r DBN_all_scores}
kable(data.frame(DBN_sorted_scores[1:7,]),caption="DBN transition and subject scores")
```
***
## SCORE-ANALYSIS
A score-analysis phase is preformed to ensure the disclosure of the boundary between both classes "Anomalous" and "Normal". Two strategies are tested, Tukey's strategy [@tukey1977exploratory] and Gaussian Mixture Models (GMM) [@mclachlan2018finite] using 2 classes. The latter is preformed using package "mclust" [@fraley2017package]. More information is also available in the webpages of the developed web application. Both score-analysis approaches are applied to subject scores and to transition scores from the DBN approach.
Tukey's strategy uses the Interquartile Range (IQR), which is the difference between the third and first quartiles of the score's distribution. This measures the values' dispersion. The computed boundary comes thus:
$$ Threshold = Q1 - (1.5 \cdot IQR), \quad IQR = Q1 - Q3,$$ meaning that scores below the threshold are classified as anomalous. Tukey's strategy assumes normal scores form a symmetric cluster, meaning outliers form a negative skew in the distribution.
In the GMM score-analysis, disjoint score distributions are tackled. Scores are assumed to be generated from a finite mixture of Gaussian distributions with unknown parameters. Score distributions are modeled as mixtures of two Gaussian curves. Labeling each score becomes a classification problem among two classes C1 and C2, representing abnormality and normality respectively. The problem is defined as uncovering the value of
$$P(C_i|s) = \displaystyle \frac{P(s|C_i) \cdot P(C_i)}{P(s)}, \quad i \in 1,2$$ for each score value s, which can be obtained by Bayes' Rule. The threshold is defined as the boundary that better separates both curves, which describes the point of maximum uncertainty. Such means that scores falling below the threshold have an higher likelihood of belonging to the "abnormal" class C2, being classified as outliers.
***
###METEOR (DBN)
Perform both score-analysis strategies (Tukey and GMM) for both subject and transition scores from the modeled DBNs.
```{r DBN_score_analysis, echo=FALSE}
####### TRANSITIONS
transitions_score_array <- to_score_array(DBN_transition_scores) # Subjects
### GMM strategy:
GMM_transitions_DBN <- GMM_score_analysis(transitions_score_array)
### Tukey strategy:
Tukey_transitions_DBN <- unname(Tukey_score_analysis(transitions_score_array)) #unname required to retrieve solely the value
####### SUBJECTS
subjects_score_array <- to_score_array(DBN_subject_scores) # Subjects
### GMM strategy:
GMM_subjects_DBN <- GMM_score_analysis(subjects_score_array)
### Tukey strategy:
Tukey_subjects_DBN <- unname(Tukey_score_analysis(subjects_score_array)) #unname required to retrieve solely the value
```
```{r DBN_score_analysis_aux, echo=FALSE, include=FALSE}
transitions_score_array_copy <- transitions_score_array # Useful for later (histogram)
subjects_score_array_copy <- subjects_score_array # useful for later (histogram)
```
Both tresholds from the score-analysis methods are now used to classify each score.
####Transitions
First, each transition score is evaluated with both score-analysis strategies using the corresponding thresholds. These are highlighted in red and observed in the next plots.
Recall that data marked as anomalous refers to the whole transition and not solely the values of that specific slice. An entire anomalous transition is red, which ends in the last time slice generated by that transition. Each anomalous transition contains *lag+1* time slices due to the size of the window. In the present case, since the Markov lag is 1, an anomalous transition is represented by a red point at *t* (first instant of the transition) and a red line connecting this first point *t* and *t+1* which is the last slice of the transition. This last point is not marked as red, since it would indicate an outlier for the next transition (beginning in *t+1*) instead of the one which generated it.
```{r DBN_final_classification_transition, echo=FALSE}
### Final Classification
### Associate the subject_ids lost in the process (Transitions):
subject_id <- data.frame(DBN_transition_scores$subject_id)
colnames(subject_id) <- c("subject_id")
transitions_score_array <- cbind(subject_id, transitions_score_array)
rm(subject_id)
### DBN Transition results
results.DBN.transition.GMM <- classify_transitions(DBN_transition_scores,GMM_transitions_DBN)
results.DBN.transition.Tukey <- classify_transitions(DBN_transition_scores,Tukey_transitions_DBN)
```
***
For the results, please check the folder **/resources/output_files/DBN** to retrieve the outputs of each phase. It is **adviced to open the .csv as text files**, since some editors may change the appearance of the content.
**ATTENTION:** If PAA reduction was performed, the transition plots displaying outliers as red **are not correct**, since these display the input data before the reduction. Nevertheless, all computations and conclusions are correct. Users should also remember that they performed PAA, and a transition *t* classified as anomalous does not correspond to the same transition of the input data (in the case of PAA).
***
#####Tukey
Transitions classified as anomalous using Tukey's strategy on data modeled by the DBN.
**Tukey's threshold: **`r Tukey_transitions_DBN`
```{r DBN_final_classification_transition_plot_Tukey, echo=FALSE}
#### Transitions
#plot_anomalous_transitions(original_continuous_data, results.DBN.transition.Tukey, Markov_lag = 1 ,variable_name = "X1")
```
Results of the final DBN transition classification. Parameters can be **changed below** to specify which subjects to observe.
```{r DBN_final_classification_transition_table_Tukey, results = 'asis'}
kable(data.frame(results.DBN.transition.Tukey[1:7,]),caption="DBN final results using Tukey's threshold")
```
Score distribution and the final classification of each transition according to Tukey's strategy.
```{r DBN_Histogram_transition_Tukey, echo=FALSE}
if(sum(results.DBN.transition.Tukey[,-1])==0){ # No outliers
print("No anomalies detected.")
}else{
### HISTOGRAMS - Graphical representation of each score-analysis
### Assumes there are scores of both classes: anomalous and normal
bins <- seq(min(transitions_score_array_copy ), max(transitions_score_array_copy), length.out = 50 + 1)
red_array <- array("red3",dim = c(1,min(which(bins > Tukey_transitions_DBN))-2)) # -2 because of 0 and yellow
green_array <- array("springgreen3",dim = c(1,length(bins) - min(which(bins > Tukey_transitions_DBN))))
colours <- cbind(red_array,"orange",green_array)
hist(transitions_score_array_copy ,
main="Histogram for DBN transition scores",
xlab="Scores",
border="black",
col = colours,
las=1,
breaks=bins) # breaks are the number of breakpoints which determine the bins
legend("topright", c("Normal", "Outlier"), col=c("springgreen3", "red3"), lwd=10)
aux <- paste("Tukey's Threshold: ", round(Tukey_transitions_DBN, digits = 2))
mtext(aux, side=3)
abline(v=Tukey_transitions_DBN,col="red", lwd=2) # Computed Threshold
}
```
***
#####GMM
Transitions classified as anomalous using the GMM strategy on data modeled by the DBN.
**GMM threshold: ** `r GMM_transitions_DBN`
```{r DBN_final_classification_transition_plot_GMM, echo=FALSE}
#### Transitions
#plot_anomalous_transitions(original_continuous_data, results.DBN.transition.GMM, Markov_lag = 1 ,variable_name = "X1")
```
Results of the final DBN transition classification. Parameters can be **changed below** to specify which subjects to observe.
```{r DBN_final_classification_transition_table, results = 'asis'}
kable(data.frame(results.DBN.transition.GMM[1:7,]),caption="DBN final results using GMM's threshold")
```
Score distribution and the final classification of each transition according to the GMM strategy.
```{r DBN_Histogram_transition, echo=FALSE}
### HISTOGRAMS - Graphical representation of each score-analysis
### Assumes there are scores of both classes: anomalous and normal
bins <- seq(min(transitions_score_array_copy ), max(transitions_score_array_copy), length.out = 50 + 1)
red_array <- array("red3",dim = c(1,min(which(bins > GMM_transitions_DBN))-2)) # -2 because of 0 and yellow
green_array <- array("springgreen3",dim = c(1,length(bins) - min(which(bins > GMM_transitions_DBN))))
colours <- cbind(red_array,"orange",green_array)
hist(transitions_score_array_copy ,
main="Histogram for DBN transition scores",
xlab="Scores",
border="black",
col = colours,
las=1,
breaks=bins) # breaks are the number of breakpoints which determine the bins
legend("topright", c("Normal", "Outlier"), col=c("springgreen3", "red3"), lwd=10)
aux <- paste("GMM Threshold: ", round(GMM_transitions_DBN, digits = 2))
mtext(aux, side=3)
abline(v=GMM_transitions_DBN,col="red", lwd=2) # Computed Threshold
```
Subjects possessing transitions classified as anomalous by METEOR with GMM's score-analysis:
```{r DBN_classification_transitions_GMM_outliers, echo=FALSE}
print(as.vector(unlist(retrieve_subjects_with_anomalous_transitions(results.DBN.transition.GMM))))
```
####Subjects
Each subject score is evaluated with both score-analysis strategies using the corresponding thresholds.
```{r DBN_final_classification_subject, echo=FALSE}
### Associate the subject_ids lost in the process (Subjects):
subject_id <- data.frame(DBN_subject_scores$subject_id)
colnames(subject_id) <- c("subject_id")
subjects_score_array <- cbind(subject_id, subjects_score_array)
rm(subject_id)
### DBN Subject results
results.DBN.subject.GMM <- classify_subjects(subjects_score_array,GMM_subjects_DBN)
results.DBN.subject.Tukey <- classify_subjects(subjects_score_array,Tukey_subjects_DBN)
```
The user can change the next code to specify which subjects to plot and which variable, to compare a single variable TS between subjects. This allows a comparison with the next figures which only plot anomalous subjects. This way, different subjects can be analised as required.
```{r DBN_final_plot_desired_subjects, echo=FALSE}
### Plot some desired subjects
plot_subjects(original_continuous_data, c("4","18"), variable_name = "X1") # Plotting the X1 variable of subjects 4 and 18
```
Subjects classified as anomalous for both strategies are plotted next, for data modeled by METEOR.
***
#####Tukey
Plot of the anomalous results using Tukey's score-analysis, some plots can be altered by changing the input of the corresponding functions. The plots only display a single variable, otherwise it would be very confusing to distinguish subjects.
**Tukey's threhsold: **`r Tukey_subjects_DBN`
```{r DBN_final_classification_subject_plot_Tukey, echo=FALSE}
##### Show original values of subjects classified as outliers (red)
#### Subjects
### Plot the subjects classified as anomalous
if(sum(results.DBN.subject.Tukey$outlier)==0){print("No outliers with Tukey's strategy")
} else {plot_anomalous_subjects(original_continuous_data, results.DBN.subject.Tukey, variable_name = "X1")} # Plot of the X1 TS from the subjects classified as anomalous using Tukey's score-analysis strategy
```
Subjects classified as anomalous by METEOR with Tukey's score-analysis:
```{r DBN_classification_subject_Tukey_outliers, echo=FALSE}
print(as.vector(unlist(retrieve_anomalous_subject_ids(results.DBN.subject.Tukey))))
```
Results of the final DBN subject classification. Parameters can be **changed below** to specify which subjects to observe.
```{r DBN_final_classification_subject_table_Tukey, results = 'asis'}
kable(data.frame(results.DBN.subject.Tukey[1:7,]),caption="DBN final results using Tukey's threhold")
```
Score distribution and the final classification of each subject according to Tukey's strategy.
```{r DBN_Histogram_subject_Tukey, echo=FALSE}
if(sum(results.DBN.subject.Tukey[,-1])==0){ # No outliers
print("No anomalies detected.")
}else{
### HISTOGRAMS - Graphical representation of each score-analysis
### Assumes there are scores of both classes: anomalous and normal
bins <- seq(min(subjects_score_array_copy ), max(subjects_score_array_copy), length.out = 50 + 1)
red_array <- array("red3",dim = c(1,min(which(bins > Tukey_subjects_DBN))-2)) # -2 because of 0 and yellow
green_array <- array("springgreen3",dim = c(1,length(bins) - min(which(bins > Tukey_subjects_DBN))))
colours <- cbind(red_array,"orange",green_array)
hist(subjects_score_array_copy ,
main="Histogram for DBN subject scores",
xlab="Scores",
border="black",
col = colours,
las=1,
breaks=bins) # breaks are the number of breakpoints which determine the bins
legend("topright", c("Normal", "Outlier"), col=c("springgreen3", "red3"), lwd=10)
aux <- paste("Tukey's Threshold: ", round(Tukey_subjects_DBN, digits = 2))
mtext(aux, side=3)
abline(v=Tukey_subjects_DBN,col="red", lwd=2) # Computed Threshold
}
```
***
#####GMM
Plot of the anomalous results using GMM score-analysis, some plots can be altered by changing the input of the corresponding functions.
**GMM threhsold: **`r GMM_subjects_DBN`
```{r DBN_final_classification_subject_plot_GMM, echo=FALSE}
##### Show original values of subjects classified as outliers (red)
#### Subjects
### Plot the subjects classified as anomalous
if(sum(results.DBN.subject.GMM$outlier)==0){print("No outliers with GMM strategy")
} else {plot_anomalous_subjects(original_continuous_data, results.DBN.subject.GMM, variable_name = "X1")} # DBN GMM
```
Subjects classified as anomalous by the DBN approach with the GMM score-analysis:
```{r DBN_classification_subject_GMM_outliers, echo=FALSE}
print(as.vector(unlist(retrieve_anomalous_subject_ids(results.DBN.subject.GMM))))
```
Results of the final DBN subject classification. Parameters can be **changed below** to specify which subjects to observe.
```{r DBN_final_classification_subject_table, results = 'asis'}
kable(data.frame(results.DBN.subject.GMM[1:7,]),caption="DBN final results using GMM's threhold")
```
Score distribution and the final classification of each subject according to the GMM strategy.
```{r DBN_Histogram_subject, echo=FALSE}
### HISTOGRAMS - Graphical representation of each score-analysis
### Assumes there are scores of both classes: anomalous and normal
bins <- seq(min(subjects_score_array_copy ), max(subjects_score_array_copy), length.out = 50 + 1)
red_array <- array("red3",dim = c(1,min(which(bins > GMM_subjects_DBN))-2)) # -2 because of 0 and yellow
green_array <- array("springgreen3",dim = c(1,length(bins) - min(which(bins > GMM_subjects_DBN))))
colours <- cbind(red_array,"orange",green_array)
hist(subjects_score_array_copy ,
main="Histogram for DBN subject scores",
xlab="Scores",
border="black",
col = colours,
las=1,
breaks=bins) # breaks are the number of breakpoints which determine the bins
legend("topright", c("Normal", "Outlier"), col=c("springgreen3", "red3"), lwd=10)
aux <- paste("GMM Threshold: ", round(GMM_subjects_DBN, digits = 2))
mtext(aux, side=3)
abline(v=GMM_subjects_DBN,col="red", lwd=2) # Computed Threshold
```
```{r DBN_classification_results_output, echo=FALSE, include=FALSE}
write.csv(results.DBN.transition.Tukey, file = paste(DBN_write_path, "/DBN_results_transitions_Tukey.csv", sep = ""), row.names = FALSE)
write.csv(results.DBN.transition.GMM, file = paste(DBN_write_path, "/DBN_results_transitions_GMM.csv", sep = ""), row.names = FALSE)
write.csv(data.frame(results.DBN.subject.Tukey), file = paste(DBN_write_path, "/DBN_results_subjects_Tukey.csv", sep = ""), row.names = FALSE)
write.csv(data.frame(results.DBN.subject.GMM), file = paste(DBN_write_path, "/DBN_results_subjects_GMM.csv", sep = ""), row.names = FALSE)
```
***
Subjects which are classfied as anomalous and at the same time possess anomalous transitions using GMM's score-analysis are presented below.
```{r DBN_results_intersection_GMM, echo=FALSE}
a1<-as.vector(unlist(retrieve_subjects_with_anomalous_transitions(results.DBN.transition.GMM)))
a2<-as.vector(unlist(retrieve_anomalous_subject_ids(results.DBN.subject.GMM)))
print(intersect(a1,a2))
```
# Conclusions and Future Work
Uncovering anomalies among MTS measurements is far from an easy task, not only due to the high instability/variability of values from each specific application, but also due to external/unknown factors which could have caused the anomalies. Most detections are usually caused by abrupt changes or abnormal values in multiple consecutive samples. It's worth noting that outlier subjects which also possess anomalous transitions are more likely to be interesting.
It's worth noting that there is room for improvement and future research by enhancing the existing methods and extending the present conclusions for other datasets mainly missing data.
***
# References