-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.qmd
126 lines (95 loc) · 6.73 KB
/
README.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
---
format:
gfm:
html-math-method: webtex
jupyter: python3
---
# Summary
Anomaly detection is a unique machine learning approach where a data point is either considered typical or unusual (outlier). Potential applications include identifying nefarious program behavior like ransomware or identifying stolen debit cards. Interestingly, only typical data is needed to train the model. This typical data defines a distribution that the model learns. Once the distribution is learned, a new data point can be labeled typical (meaning it is inside the boundary of the training distribution) or an outlier (meaning it is outside the boundary).
This repo compares eight anomaly detection models. Is one model better than all the others?
Models:
* One class SVM with linear kernel
* One class SVM with polynomial kernel
* One class SVM with radial basis function kernel
* One class Approximate SVM
* One class SVM with sigmoid kernel
* Local Outlier Factor using minikowski distance
* Local Outlier Factor using euclidean distance
* Local Outlier Factor using cosine similarity
The main outcome of this simulation is Local Outlier Factor using cosine similarity is the best model overall for these data. Horizontal scaling is necessary for production. If this is not an option, the SVM model with radial basis function as the kernel is the next best option.
# Data Overview
In this simulation, data comes from four two-dimensional Gaussian random variables. The center of each random variable is at a corner of the unit square. Data in the first and third quartile of the xy plane are consider typical, and data from the second and fourth quartile are consider outliers.
Looking at each variable separately, the distributions are identical between both cases.
```{python}
#| echo: false
import pandas as pd
import numpy as np
from plotnine import ggplot, aes
from plotnine import geom_boxplot, geom_point, geom_col
from plotnine import labs, scale_x_continuous, scale_y_continuous, coord_flip, facet_wrap
from mizani.formatters import percent_format
folder = 'S:\\Python\\projects\\anomaly_detection\\data\\'
exampleDF = pd.read_csv(folder + 'exampleDataDF.csv')
temp = exampleDF.melt(id_vars=['LABEL'], value_vars=['X1', 'X2'])
graph = (
ggplot(temp, aes(x = 'variable', y = 'value', fill = 'factor(LABEL)'))
+ geom_boxplot()
+ labs(x = "Variable", y = "Value", fill = "Label")
+ coord_flip()
)
graph
```
It is only when both variables are considered together that an obvious separation exists.
```{python}
#| echo: false
graph = (
ggplot(exampleDF, aes(x = 'X1', y = 'X2', color = 'factor(LABEL)'))
+ geom_point(alpha = .20, shape = '.')
+ labs(x = "X1", y = "X2", color = "LABEL")
)
graph
```
To vary the challenge, zero to ten extra variables are added. By design, these variables have no separation between the two cases. The idea is to introduce similarity. When zero extra variables are added, each model should be able to identify outliers. With 10 extra variables, outliers are identical to typical data for ten out of twelve variables making outliers harder to find.
Each setting of the simulation is repeated 10 times, and performance metrics are averaged to reduce variability.
# Results
Measured by recall (proportion of outliers found), the SVM model with the radial basis function kernel and Local Outlier Factor with cosine similarity model have nearly identical performance. Interestingly, the linear approximation SVM has very different performance then the real SVM with rbf.
```{python}
#| echo: false
resultDF = pd.read_csv(folder + 'results.csv')
resultDF = resultDF.groupby(['n_col_extra', 'model'], as_index=False)[['percision', 'recall']].mean()
temp = resultDF.melt(id_vars=['n_col_extra', 'model'], value_vars=['percision', 'recall'])
graph = (
ggplot(resultDF, aes(x = 'n_col_extra', y = 'recall'))
+ geom_col(alpha = .40)
+ scale_x_continuous(breaks = np.arange(0, 11, 2))
+ scale_y_continuous(labels = percent_format())
+ labs( x = "Number of Extra Variables", y = "Recall")
+ coord_flip()
+ facet_wrap("~model")
)
graph
```
Measured by precision (proportion of predicted outliers that are actually outliers), Local Outlier Factor with euclidean or minikowski distance are best. The cosine similarity model has nearly the same precision.
```{python}
#| echo: false
graph = (
ggplot(resultDF, aes(x = 'n_col_extra', y = 'percision'))
+ geom_col(alpha = .40)
+ scale_x_continuous(breaks = np.arange(0, 11, 2))
+ scale_y_continuous(labels = percent_format())
+ labs( x = "Number of Extra Variables", y = "Percision")
+ coord_flip()
+ facet_wrap("~model")
)
graph
```
Taking both metrics together, the Local Outlier Factor with cosine similarity model is best at least for these data.
# Only One Class
In the real word, anomaly detection models can be trained without a single example of an outlier. This is a double-edged sword. Given 1,000 outliers, there could be 1,000 distinct definitions of why the data points are outliers. Learning these 1,000 definitions using just one example each is impossible. In addition, new definitions may present themselves after model training. The ability to train a model without providing a single example of an outlier is a major advantage.
Because there are no real examples of outliers, it is impossible to know if the trained model is any good at identifying outliers. One possible sanity check is to randomly shuffle a few variables from observed data and use them as synthetic outliers. Can the model identify these data points?
Even if performance is great for the synthetic data, it does not prove real world outliers will be identified. There are two options. Either accept this blind spot or find examples of real world outliers prior to model deployment. There is no work-around for quality data.
# Scalability
Local outlier factor is variation of k-nearest neighbors. Therefore the model essentially does nothing during the .fit call and only learns at prediction time. Every call to .predict is associated with learning. Prediction calculations are not going to be fast.
Luckily, computation is done on a per-row basis of prediction data. This means it can be spread across CPU threads and even multiple CPUs. Serverless horizontal scaling with the cloud is necessary for enterprise scale.
For SVMs, prediction calculations do not require learning. Model training can be done outside of production. However, model training involves a single threaded algorithm with quadratic in sample size complexity. Thus training time grows very quickly with respect to sample size.
The SGDOneClassSVM model in scikit-learn approximates the SVM model with the rbf kernel. This model has linear complexity instead of quadratic. Unfortunately, the approximation did not work well for these data.