METEOR.Rmd


---
title: "METEOR - Dynamic Bayesian Outlier Detection"
author: "Jorge Serras"
date: "March 2019"
output:
  html_document: default
bibliography: bibliography.bib

---


## README:
> It is advice the use of RStudio IDE and R version of atleast 3.5.2, due to the installation process.
The script will automatically install all required packages if not already.
The relative positioning of all the files and folders should not be tempered with, even output files. Copies should thus be done.
For more information or any doubt concerning the given resources, please consider gathering additional insight or contact through https://jorgeserras.com/


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
if(!require('rJava')) install.packages("rJava") # If a java update was performed or some other issue, reinstall rJava

if(!require('pacman'))install.packages('pacman') # Automatically installs and loads the required packages
pacman::p_load(jmotif, plyr, dplyr, qdapTools, data.table, uroot, stsm, mclust, stats, dmm, DiagrammeR, ggplot2, reshape2, knitr) 
library("rJava") # Problems with Java? try: https://github.com/s-u/rJava/issues/163  Furthermore the Java version of your system must be the same as used by rJava
.jinit()
.jaddClassPath('METEOR.jar')
library("xlsx")
library("imputeTS")
library("DiagrammeR")
library(knitr)
#setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
source(file.path('auxiliar', 'formatting.R'))
source(file.path('auxiliar', 'auxiliar.R'))
source(file.path('DBN', 'DBN_scoring.R'))
source(file.path('discretization', 'sax.R'))
source(file.path('score_analysis', 'score_analysis.R'))

as.numeric.factor <- function(x) {as.numeric(levels(x))[x]}

my.file.rename <- function(from, to) {
    todir <- dirname(to)
    if (!isTRUE(file.info(todir)$isdir)) dir.create(todir, recursive=TRUE)
    file.rename(from = from,  to = to)
}

read_path <- paste(getwd(),file.path('/resources'), sep = "") # Input files go to this folder
write_path <- paste(getwd(),file.path('/resources', 'output_files'), sep = "") # Output files go to this folder
DBN_write_path <- paste(getwd(),file.path('/resources/output_files', 'DBN'), sep = "") # Output files go to this folder
```

```{r echo=FALSE, include=FALSE}

```

***

## About

The current study has the aim of providing further research and resources into the detection and modeling of anomalous behavior in multivariate temporal data. A complete system is built to handle multivariate time series (MTS) from pre-processing to score-analysis known as MultivariatE Time sEries OutlieR (METEOR) [@webapp].

All data including time stamps are first pre-processed and discretized to ensure the correct employment of the proposed methods. The developed system is based on Dynamic Bayesian Networks (DBN). METEOR is made available for anomaly detection and followed step-be-step. A web application and tutorial are additionally made accessible [@webapp] for any endeavor together with supplementary information.

Furthermore, generated output files are available in folder **/resources/output_files** where several subsets and transformations of the input dataset are present. Additionally, there is a folder **/DBN** which has the output files concerning outlier detection using the DBN (METEOR) approach. 

###Abstract
> Outliers can be defined as observations which are suspected of not have been generated by data's underlying processes. Many applications require a way of identifying interesting or unusual patterns in multivariate time series (MTS), however, most outlier detection methods focus solely on univariate series, providing analysts with strenuous solutions. We propose a complete outlier detection system covering problems since pre-processing that adopts a dynamic Bayesian network modeling algorithm. The latter encodes optimal inter and intra time-slice connectivity of transition networks capable of capturing conditional dependencies in MTS datasets. A sliding window mechanism is employed to gradually score each MTS transition given the model. Simultaneously, complete MTS are evaluated. Two score-analysis strategies are studied to assure an automatic boundary classifying anomalous data. The proposed approach is first validated through simulated data, demonstrating the system's performance. Comparison with an assembled probabilistic suffix tree method is available displaying the leverage of the proposed multivariate approach. Further experimentation is made on real data, by uncovering anomalies in distinct scenarios such as electrocardiogram series, mortality rate data and written pen digits. The developed system proved beneficial in capturing unusual data resulting from temporal contexts, being suitable for any MTS scenario. A widely accessible web application employing the complete system is made available jointly with a tutorial.


***

## Formatting

To begin, the input data is loaded. If the dataset is not in the correct format, please check the examples and tools made available to ensure a correct employment. The continuous MTS input dataset should be in the **/resources** folder. The code below must be changed accordingly and the number of variables **n_variables** specified. If data is already discretized, skip the SAX Discretization phase.

```{r loading}
####
original_continuous_data <- as.data.frame(read.csv(file = paste(read_path, "/continuous_example_data.csv", sep = ""), header=TRUE))
n_variables <- 2 # For the SAX plot function, 1 means the subjects are univariate
```


***

## SAX Discretization

In the first phase, a method known as Symbolic Aggregate approXimation (SAX) [@lin2003symbolic] is applied to discretize time series (TS) into strings. It is capable of transforming an univariate TS of length n after normalization into a string of arbitrary length w. The latter length can be w << n if Piecewise Aggregate Approximation (PAA) is applied. This ensures dimensionality reduction of the normalized TS when excessive measurements are present. An alphabet size must be chosen to determine the number of symbols for the final discretized sequences. Such assumes that normalized TS possess Gaussian distribution probabilities with unitary standard deviation and null mean. For additional information, please see [@lin2003symbolic].

SAX does not allow missing values, imputation of the missing values should be considered as future work. The SAX discretization is the same available in the webapp: https://jorgeserras.shinyapps.io/outlierdetection/

Parameters can be **changed** in the code below, allowing the adjusting of the alphabet size, PAA reduction size and an auxiliar function for the plot. Feel free to experiment until results are satisfactory. 

**ATTENTION:** It is worth noting that since each TS is normalized, the offset is lost in the process. Such means that it could happen that subjects with extreme values (for example) are brought to zero and be later classified as normal. Visual inspection or the addition of a step before SAX is advised to prevent such scenarios.


```{r SAX}

### Discretization:
panel_data <- as.data.frame(parseToPanel(original_continuous_data))[,1:(n_variables+2)]

### SAX parameters:
alphabet_size <- 3
paa_size <- as.integer(max(panel_data$timestamp)) #paa_size = ncol() - 1 to prevent dimensionality reduction, -1 because of the "subject_id" column. This is for the univariate case, if data is multivariate, paa_size must be equal to the number of time slices to avoid performing PAA reduction

### SAX parameters: alphabet_size and PAA_size which gives the new reduced numer of time slices for each TS
discrete_outputs <- discretize(panel_data = panel_data, alphabet_size = alphabet_size, paa_size = paa_size)
### Retrieve solely the discretized dataset:
discrete_data <- discrete_outputs[[1]] # The variable names are different but no worries, they are easily recovered later if necessary
normalized_data <- discrete_outputs[[2]]
timestamp <- discrete_outputs[[3]]
subject_id_data <- data.frame(subject_id = discrete_data$subject_id)
### Save in file:
write.csv(discrete_data, file = paste(write_path, "/discrete_data.csv", sep = ""), row.names = FALSE, quote = TRUE)


```

The results of the discretization can be analyzed in the figure below, which displays the normalized data before preforming PAA reduction (if considered). It shows the mean and standard deviation of each time slice of the dataset. Every TS displayed is normalized with mean = 0 and std = 1. If the number of variables is higher than 1, several curves occur, one for each variable of the MTS. The breakpoints are also visible (pink), delimiting the regions of each symbol of the alphabet chosen. An alphabet size of 3, for example, displays 2 breakpoints separating the entire y domain in 3 equiprobable regions according to a Normal distribution N(0,1).

```{r SAX_plot, echo=FALSE}
mean_std_plot(normalized_data, timestamp, n_variables, alphabet_size)
```

Results of SAX pre-processing including PAA reduction (if considered). Parameters can be **changed below** to specify which subjects to observe.

```{r SAX_discrete_table, results = 'asis'}
kable(data.frame(discrete_data[1:7,]),caption="SAX discretized subjects")
```


```{r, include=FALSE}
rm(panel_data)
rm(alphabet_size)
rm(timestamp)
rm(n_variables)
rm(paa_size)
```

***

**ATTENTION:** If PAA reduction was performed, the transition plots displaying outliers as red **are not correct**, since these display the input data before the reduction. Nevertheless, all computations and conclusions are correct. Users should also remember that they performed PAA, and a transition *t* classified as anomalous does not correspond to the same transition of the input data (in the case of PAA).


## METEOR application (DBN approach)

Many applications require a way of identifying interesting or unusual patterns in multivariate time series (MTS), however, most outlier detection methods focus solely on univariate series, providing analysts with strenuous solutions. We propose a complete outlier detection system known as METEOR [@webapp] covering problems since pre-processing that adopts a dynamic Bayesian network (DBN) modeling algorithm [@monteiro2015polynomial]. The latter encodes optimal inter and intra time-slice connectivity of transition networks capable of capturing conditional dependencies in MTS datasets. A sliding window mechanism is employed to gradually score each MTS transition given the model. Simultaneously, complete MTS are evaluated. The developed system proved beneficial in capturing unusual data resulting from temporal contexts, being suitable for any MTS scenario. A widely accessible web application [@webapp] employing the complete system is made available jointly with a tutorial.

Outliers are considered to be observations suspicious of not have being generated by the data's main underlying processes. Such instances generally do not fit the trained model. The goal is to score portions as well as entire MTS. For each transition of a subject's MTS, a window composed by its observed attributes is built, dependent of the order of the trained DBN. A transition score represents the log-likelihood (LL) of the observed window computed using the network's conditional probabilities. It is worth noting that a transition score does not represent the outlierness of a time slice, but rather its context. A score of an entire TS (subject) is computed using the average of all the transition scores from that subject. The most negative a score is, the more anomalous it is.

By having parameters markov_lag = 1 and previous_parents = 1, for example, each measurement at time t is associated to its immediate precedent (value at slice t-1).

Issues may occur in this phase due to rJava package. This happens due to the still in progress development of the package (bugs and features are still being coded) and Java versions in question. Most of the times, running the script repeatedly fixes the issues (it simply does, don't ask me). Try to rerun until it works. If problems continue to occur, consider reinstalling rJava and updating Java (unnistalling previous version) or using the online version of the Web application https://jorgeserras.shinyapps.io/outlierdetection/ instead of running through code. The Dynamic Bayesian Network Outlier Detection system for multivariate time series is used. Check the main webpages: https://jorgeserras.shinyapps.io/outlierdetection/ and https://jorgeserras.com/MTS_OutlierDetection/ for additional information. Don't forget there is a tutorial video available!

**NOTE:** If the SAX discretization phase was skipped, due to data already being discrete, please insert the discrete MTS dataset in format .csv in the folder "/resources/output_files" and change the name below from "/discrete_data.csv" to the name of your file. There is an example file in the folder named "/discrete_example_data.csv".


```{r DBN}

### Both transitions and complete subjects are scored
### Name of the file where the discrete data is stored:
data_file_name <- "/discrete_data.csv"

DBN_output <- DBN_scoring(data_file_name, write_path, markov_lag = 1, previous_parents = 1, checkbox_stationary = TRUE, score_function = "ll")
DBN_transition_scores <- DBN_output[[1]]
DBN_subject_scores <- DBN_output[[2]]

# Join all the DBN scores (transitions and subjects) in one file/data.frame
DBN_scores <- DBN_transition_scores[,-1]
DBN_scores <- cbind(data.frame(subject_id=subject_id_data),DBN_scores,data.frame(subject_score=DBN_subject_scores$score))

# Restore the true subject_id values that are lost in the process
DBN_transition_scores$subject_id <- NULL
DBN_subject_scores$subject_id <- NULL
DBN_transition_scores <- cbind(data.frame(subject_id = subject_id_data, DBN_transition_scores))
DBN_subject_scores <- cbind(data.frame(subject_id = subject_id_data, DBN_subject_scores))

```


```{r, include=FALSE}
rm(L)
```


```{r DBN_output_score_results, echo = FALSE, include = FALSE}
write.csv(DBN_transition_scores, file = paste(DBN_write_path, "/DBN_scores_transitions.csv", sep = ""), row.names = FALSE)
write.csv(DBN_subject_scores, file = paste(DBN_write_path, "/DBN_scores_subjects.csv", sep = ""), row.names = FALSE)
#Delete files if they exist
if (file.exists("scores_transition_output.csv")) 
  file.remove("scores_transition_output.csv")
if (file.exists("subject_scores_output.csv")) 
  file.remove("subject_scores_output.csv")
write.csv(DBN_scores, file = paste(DBN_write_path, "/DBN_scores_combined.csv", sep = ""), row.names = FALSE)
DBN_sorted_scores <- data.frame(DBN_scores[order(DBN_scores$subject_score),])
write.csv(DBN_sorted_scores, file = paste(DBN_write_path, "/DBN_scores_combined_sort.csv", sep = ""), row.names = FALSE)
```

Visualization of the DBN modeled, this can be stationary or non-stationary. Nodes blue outlined concern attributes at the present time slice (t) being scored. All other nodes concern attributes from previous slices (t-1,...) according to the structure of the transition networks. A node Vi[t] concerns the value of the i-th variable from the dataset at time slice t.

By default, the DBNs are modeled using a LogLikelihood function, a parameter can be changed to used the MDL instead. Such allows a more constraint training, which usually reduces the number of connections in the transition networks.

```{r DBN_figure}
grViz("dbn_structure_dot") # Plot the DBN using file 'dbn_structure_dot'
```

```{r DBN_structure, echo = FALSE, include = FALSE}
my.file.rename(from = paste(getwd(),"/dbn_structure_dot", sep = ""),
               to = paste(DBN_write_path,"/dbn_structure_dot",sep = "")) # Move file's location to the DBN output folder
my.file.rename(from = paste(getwd(),"/dbn_structure_output", sep = ""),
               to = paste(DBN_write_path,"/dbn_structure_output",sep = "")) # Move file's location to the DBN output folder
```

The scoring results for transitions and entire subjects are observable below, being the subjects sorted by their subject scores (the lower the value, the more anomalous). The subjects to be displayed can be specified in the function's argument. The results are written in files **DBN_scores_combined.csv** and **DBN_scores_combined_sort.csv**, available in the folder '/resources/output_files/DBN' together with all other DBN outputs.

```{r DBN_all_scores}
kable(data.frame(DBN_sorted_scores[1:7,]),caption="DBN transition and subject scores")
```

***

## SCORE-ANALYSIS

A score-analysis phase is preformed to ensure the disclosure of the boundary between both classes "Anomalous" and "Normal". Two strategies are tested, Tukey's strategy [@tukey1977exploratory] and Gaussian Mixture Models (GMM) [@mclachlan2018finite] using 2 classes. The latter is preformed using package "mclust" [@fraley2017package]. More information is also available in the webpages of the developed web application. Both score-analysis approaches are applied to subject scores and to transition scores from the DBN approach.

Tukey's strategy uses the Interquartile Range (IQR), which is the difference between the third and first quartiles of the score's distribution. This measures the values' dispersion. The computed boundary comes thus:

$$ Threshold = Q1 - (1.5 \cdot IQR),  \quad IQR = Q1 - Q3,$$ meaning that scores below the threshold are classified as anomalous. Tukey's strategy assumes normal scores form a symmetric cluster, meaning outliers form a negative skew in the distribution.

In the GMM score-analysis, disjoint score distributions are tackled. Scores are assumed to be generated from a finite mixture of Gaussian distributions with unknown parameters. Score distributions are modeled as mixtures of two Gaussian curves. Labeling each score becomes a classification problem among two classes C1 and C2, representing abnormality and normality respectively. The problem is defined as uncovering the value of 

$$P(C_i|s) = \displaystyle \frac{P(s|C_i) \cdot P(C_i)}{P(s)}, \quad i \in 1,2$$ for each score value s, which can be obtained by Bayes' Rule. The threshold is defined as the boundary that better separates both curves, which describes the point of maximum uncertainty. Such means that scores falling below the threshold have an higher likelihood of belonging to the "abnormal" class C2, being classified as outliers.

***

###METEOR (DBN)

Perform both score-analysis strategies (Tukey and GMM) for both subject and transition scores from the modeled DBNs.

```{r DBN_score_analysis, echo=FALSE}

####### TRANSITIONS
transitions_score_array <- to_score_array(DBN_transition_scores) # Subjects
### GMM strategy:
GMM_transitions_DBN <- GMM_score_analysis(transitions_score_array)
### Tukey strategy:
Tukey_transitions_DBN <- unname(Tukey_score_analysis(transitions_score_array)) #unname required to retrieve solely the value

####### SUBJECTS
subjects_score_array <- to_score_array(DBN_subject_scores) # Subjects
### GMM strategy:
GMM_subjects_DBN <- GMM_score_analysis(subjects_score_array)
### Tukey strategy:
Tukey_subjects_DBN <- unname(Tukey_score_analysis(subjects_score_array)) #unname required to retrieve solely the value

```

```{r DBN_score_analysis_aux, echo=FALSE, include=FALSE}
transitions_score_array_copy <- transitions_score_array # Useful for later (histogram)
subjects_score_array_copy <- subjects_score_array # useful for later (histogram)
```

Both tresholds from the score-analysis methods are now used to classify each score.

####Transitions

First, each transition score is evaluated with both score-analysis strategies using the corresponding thresholds. These are highlighted in red and observed in the next plots.

Recall that data marked as anomalous refers to the whole transition and not solely the values of that specific slice. An entire anomalous transition is red, which ends in the last time slice generated by that transition. Each anomalous transition contains *lag+1* time slices due to the size of the window. In the present case, since the Markov lag is 1, an anomalous transition is represented by a red point at *t* (first instant of the transition) and a red line connecting this first point *t* and *t+1* which is the last slice of the transition. This last point is not marked as red, since it would indicate an outlier for the next transition (beginning in *t+1*) instead of the one which generated it.

```{r DBN_final_classification_transition, echo=FALSE}
### Final Classification
### Associate the subject_ids lost in the process (Transitions):
subject_id <- data.frame(DBN_transition_scores$subject_id)
colnames(subject_id) <- c("subject_id")
transitions_score_array <- cbind(subject_id, transitions_score_array)
rm(subject_id)
### DBN Transition results
results.DBN.transition.GMM <- classify_transitions(DBN_transition_scores,GMM_transitions_DBN)
results.DBN.transition.Tukey <- classify_transitions(DBN_transition_scores,Tukey_transitions_DBN)
```

***

For the results, please check the folder **/resources/output_files/DBN** to retrieve the outputs of each phase. It is **adviced to open the .csv as text files**, since some editors may change the appearance of the content.

**ATTENTION:** If PAA reduction was performed, the transition plots displaying outliers as red **are not correct**, since these display the input data before the reduction. Nevertheless, all computations and conclusions are correct. Users should also remember that they performed PAA, and a transition *t* classified as anomalous does not correspond to the same transition of the input data (in the case of PAA).

***

#####Tukey

Transitions classified as anomalous using Tukey's strategy on data modeled by the DBN.

**Tukey's threshold: **`r Tukey_transitions_DBN`

```{r DBN_final_classification_transition_plot_Tukey, echo=FALSE}
#### Transitions
#plot_anomalous_transitions(original_continuous_data, results.DBN.transition.Tukey, Markov_lag = 1 ,variable_name = "X1")
```

Results of the final DBN transition classification. Parameters can be **changed below** to specify which subjects to observe.

```{r DBN_final_classification_transition_table_Tukey, results = 'asis'}
kable(data.frame(results.DBN.transition.Tukey[1:7,]),caption="DBN final results using Tukey's threshold")
```

Score distribution and the final classification of each transition according to Tukey's strategy.

```{r DBN_Histogram_transition_Tukey, echo=FALSE}

if(sum(results.DBN.transition.Tukey[,-1])==0){ # No outliers 
  print("No anomalies detected.")
}else{
  ### HISTOGRAMS - Graphical representation of each score-analysis
  ### Assumes there are scores of both classes: anomalous and normal
  bins <- seq(min(transitions_score_array_copy ), max(transitions_score_array_copy), length.out = 50 + 1)
  red_array <- array("red3",dim = c(1,min(which(bins > Tukey_transitions_DBN))-2)) # -2 because of 0 and yellow
  green_array <- array("springgreen3",dim = c(1,length(bins) - min(which(bins > Tukey_transitions_DBN))))
  
  colours <- cbind(red_array,"orange",green_array)
  
  hist(transitions_score_array_copy , 
       main="Histogram for DBN transition scores", 
       xlab="Scores", 
       border="black", 
       col = colours,
       las=1, 
       breaks=bins) # breaks are the number of breakpoints which determine the bins
  
  legend("topright", c("Normal", "Outlier"), col=c("springgreen3", "red3"), lwd=10)
  aux <- paste("Tukey's Threshold: ", round(Tukey_transitions_DBN, digits = 2))
  mtext(aux, side=3)
  abline(v=Tukey_transitions_DBN,col="red", lwd=2) # Computed Threshold
}

```

***

#####GMM

Transitions classified as anomalous using the GMM strategy on data modeled by the DBN.

**GMM threshold: ** `r GMM_transitions_DBN`

```{r DBN_final_classification_transition_plot_GMM, echo=FALSE}
#### Transitions
#plot_anomalous_transitions(original_continuous_data, results.DBN.transition.GMM, Markov_lag = 1 ,variable_name = "X1")
```

Results of the final DBN transition classification. Parameters can be **changed below** to specify which subjects to observe.

```{r DBN_final_classification_transition_table, results = 'asis'}
kable(data.frame(results.DBN.transition.GMM[1:7,]),caption="DBN final results using GMM's threshold")
```

Score distribution and the final classification of each transition according to the GMM strategy.

```{r DBN_Histogram_transition, echo=FALSE}
### HISTOGRAMS - Graphical representation of each score-analysis
### Assumes there are scores of both classes: anomalous and normal
bins <- seq(min(transitions_score_array_copy ), max(transitions_score_array_copy), length.out = 50 + 1)
red_array <- array("red3",dim = c(1,min(which(bins > GMM_transitions_DBN))-2)) # -2 because of 0 and yellow
green_array <- array("springgreen3",dim = c(1,length(bins) - min(which(bins > GMM_transitions_DBN))))

colours <- cbind(red_array,"orange",green_array)

hist(transitions_score_array_copy , 
     main="Histogram for DBN transition scores", 
     xlab="Scores", 
     border="black", 
     col = colours,
     las=1, 
     breaks=bins) # breaks are the number of breakpoints which determine the bins

legend("topright", c("Normal", "Outlier"), col=c("springgreen3", "red3"), lwd=10)
aux <- paste("GMM Threshold: ", round(GMM_transitions_DBN, digits = 2))
mtext(aux, side=3)
abline(v=GMM_transitions_DBN,col="red", lwd=2) # Computed Threshold

```

Subjects possessing transitions classified as anomalous by METEOR with GMM's score-analysis:

```{r DBN_classification_transitions_GMM_outliers, echo=FALSE}
print(as.vector(unlist(retrieve_subjects_with_anomalous_transitions(results.DBN.transition.GMM))))

```


####Subjects

Each subject score is evaluated with both score-analysis strategies using the corresponding thresholds.

```{r DBN_final_classification_subject, echo=FALSE}

### Associate the subject_ids lost in the process (Subjects):
subject_id <- data.frame(DBN_subject_scores$subject_id)
colnames(subject_id) <- c("subject_id")
subjects_score_array <- cbind(subject_id, subjects_score_array)
rm(subject_id)
### DBN Subject results
results.DBN.subject.GMM <- classify_subjects(subjects_score_array,GMM_subjects_DBN)
results.DBN.subject.Tukey <- classify_subjects(subjects_score_array,Tukey_subjects_DBN)

```

The user can change the next code to specify which subjects to plot and which variable, to compare a single variable TS between subjects. This allows a comparison with the next figures which only plot anomalous subjects. This way, different subjects can be analised as required.

```{r DBN_final_plot_desired_subjects, echo=FALSE}
### Plot some desired subjects
plot_subjects(original_continuous_data, c("4","18"), variable_name = "X1") # Plotting the X1 variable of subjects 4 and 18
```

Subjects classified as anomalous for both strategies are plotted next, for data modeled by METEOR.

***

#####Tukey

Plot of the anomalous results using Tukey's score-analysis, some plots can be altered by changing the input of the corresponding functions. The plots only display a single variable, otherwise it would be very confusing to distinguish subjects.

**Tukey's threhsold: **`r Tukey_subjects_DBN`

```{r DBN_final_classification_subject_plot_Tukey, echo=FALSE}

##### Show original values of subjects classified as outliers (red)
#### Subjects
### Plot the subjects classified as anomalous
if(sum(results.DBN.subject.Tukey$outlier)==0){print("No outliers with Tukey's strategy")
} else {plot_anomalous_subjects(original_continuous_data, results.DBN.subject.Tukey, variable_name = "X1")} # Plot of the X1 TS from the subjects classified as anomalous using Tukey's score-analysis strategy

```

Subjects classified as anomalous by METEOR with Tukey's score-analysis:

```{r DBN_classification_subject_Tukey_outliers, echo=FALSE}
print(as.vector(unlist(retrieve_anomalous_subject_ids(results.DBN.subject.Tukey))))

```

Results of the final DBN subject classification. Parameters can be **changed below** to specify which subjects to observe.

```{r DBN_final_classification_subject_table_Tukey, results = 'asis'}
kable(data.frame(results.DBN.subject.Tukey[1:7,]),caption="DBN final results using Tukey's threhold")
```

Score distribution and the final classification of each subject according to Tukey's strategy.

```{r DBN_Histogram_subject_Tukey, echo=FALSE}

if(sum(results.DBN.subject.Tukey[,-1])==0){ # No outliers 
  print("No anomalies detected.")
}else{
  ### HISTOGRAMS - Graphical representation of each score-analysis
  ### Assumes there are scores of both classes: anomalous and normal
   bins <- seq(min(subjects_score_array_copy ), max(subjects_score_array_copy), length.out = 50 + 1)
   red_array <- array("red3",dim = c(1,min(which(bins > Tukey_subjects_DBN))-2)) # -2 because of 0 and yellow
   green_array <- array("springgreen3",dim = c(1,length(bins) - min(which(bins > Tukey_subjects_DBN))))
   
   colours <- cbind(red_array,"orange",green_array)
   
   hist(subjects_score_array_copy , 
        main="Histogram for DBN subject scores", 
        xlab="Scores", 
        border="black", 
        col = colours,
        las=1, 
        breaks=bins) # breaks are the number of breakpoints which determine the bins
   
   legend("topright", c("Normal", "Outlier"), col=c("springgreen3", "red3"), lwd=10)
   aux <- paste("Tukey's Threshold: ", round(Tukey_subjects_DBN, digits = 2))
   mtext(aux, side=3)
   abline(v=Tukey_subjects_DBN,col="red", lwd=2) # Computed Threshold
}

```
***

#####GMM

Plot of the anomalous results using GMM score-analysis, some plots can be altered by changing the input of the corresponding functions.

**GMM threhsold: **`r GMM_subjects_DBN`

```{r DBN_final_classification_subject_plot_GMM, echo=FALSE}

##### Show original values of subjects classified as outliers (red)
#### Subjects
### Plot the subjects classified as anomalous
if(sum(results.DBN.subject.GMM$outlier)==0){print("No outliers with GMM strategy")
} else {plot_anomalous_subjects(original_continuous_data, results.DBN.subject.GMM, variable_name = "X1")} # DBN GMM

```

Subjects classified as anomalous by the DBN approach with the GMM score-analysis:

```{r DBN_classification_subject_GMM_outliers, echo=FALSE}
print(as.vector(unlist(retrieve_anomalous_subject_ids(results.DBN.subject.GMM))))
```

Results of the final DBN subject classification. Parameters can be **changed below** to specify which subjects to observe.

```{r DBN_final_classification_subject_table, results = 'asis'}
kable(data.frame(results.DBN.subject.GMM[1:7,]),caption="DBN final results using GMM's threhold")
```

Score distribution and the final classification of each subject according to the GMM strategy.

```{r DBN_Histogram_subject, echo=FALSE}
### HISTOGRAMS - Graphical representation of each score-analysis
### Assumes there are scores of both classes: anomalous and normal
 bins <- seq(min(subjects_score_array_copy ), max(subjects_score_array_copy), length.out = 50 + 1)
 red_array <- array("red3",dim = c(1,min(which(bins > GMM_subjects_DBN))-2)) # -2 because of 0 and yellow
 green_array <- array("springgreen3",dim = c(1,length(bins) - min(which(bins > GMM_subjects_DBN))))
 
 colours <- cbind(red_array,"orange",green_array)
 
 hist(subjects_score_array_copy , 
      main="Histogram for DBN subject scores", 
      xlab="Scores", 
      border="black", 
      col = colours,
      las=1, 
      breaks=bins) # breaks are the number of breakpoints which determine the bins
 
 legend("topright", c("Normal", "Outlier"), col=c("springgreen3", "red3"), lwd=10)
 aux <- paste("GMM Threshold: ", round(GMM_subjects_DBN, digits = 2))
 mtext(aux, side=3)
 abline(v=GMM_subjects_DBN,col="red", lwd=2) # Computed Threshold

```


```{r DBN_classification_results_output, echo=FALSE, include=FALSE}
write.csv(results.DBN.transition.Tukey, file = paste(DBN_write_path, "/DBN_results_transitions_Tukey.csv", sep = ""), row.names = FALSE)
write.csv(results.DBN.transition.GMM, file = paste(DBN_write_path, "/DBN_results_transitions_GMM.csv", sep = ""), row.names = FALSE)
write.csv(data.frame(results.DBN.subject.Tukey), file = paste(DBN_write_path, "/DBN_results_subjects_Tukey.csv", sep = ""), row.names = FALSE)
write.csv(data.frame(results.DBN.subject.GMM), file = paste(DBN_write_path, "/DBN_results_subjects_GMM.csv", sep = ""), row.names = FALSE)
```

***

Subjects which are classfied as anomalous and at the same time possess anomalous transitions using GMM's score-analysis are presented below.

```{r DBN_results_intersection_GMM, echo=FALSE}
a1<-as.vector(unlist(retrieve_subjects_with_anomalous_transitions(results.DBN.transition.GMM)))
a2<-as.vector(unlist(retrieve_anomalous_subject_ids(results.DBN.subject.GMM)))
print(intersect(a1,a2))
```



# Conclusions and Future Work

Uncovering anomalies among MTS measurements is far from an easy task, not only due to the high instability/variability of values from each specific application, but also due to external/unknown factors which could have caused the anomalies. Most detections are usually caused by abrupt changes or abnormal values in multiple consecutive samples. It's worth noting that outlier subjects which also possess anomalous transitions are more likely to be interesting.

It's worth noting that there is room for improvement and future research by enhancing the existing methods and extending the present conclusions for other datasets mainly missing data.


***

# References