Skip to content

Commit

Permalink
[ML][DOCS] Add documentation for detector rules and filters (#32013)
Browse files Browse the repository at this point in the history
  • Loading branch information
dimitris-athanasiou committed Jul 25, 2018
1 parent 41b12e2 commit 4cdef4e
Show file tree
Hide file tree
Showing 16 changed files with 648 additions and 12 deletions.
10 changes: 10 additions & 0 deletions x-pack/docs/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -310,6 +310,16 @@ setups['farequote_datafeed'] = setups['farequote_job'] + '''
"job_id":"farequote",
"indexes":"farequote"
}
'''
setups['ml_filter_safe_domains'] = '''
- do:
xpack.ml.put_filter:
filter_id: "safe_domains"
body: >
{
"description": "A list of safe domains",
"items": ["*.google.com", "wikipedia.org"]
}
'''
setups['server_metrics_index'] = '''
- do:
Expand Down
9 changes: 9 additions & 0 deletions x-pack/docs/en/ml/api-quickref.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,15 @@ The main {ml} resources can be accessed with a variety of endpoints:
* {ref}/ml-delete-calendar-job.html[DELETE /calendars/<calendar_id+++>+++/jobs/<job_id+++>+++]: Disassociate a job from a calendar
* {ref}/ml-delete-calendar.html[DELETE /calendars/<calendar_id+++>+++]: Delete a calendar

[float]
[[ml-api-filters]]
=== /filters/

* {ref}/ml-put-filter.html[PUT /filters/<filter_id+++>+++]: Create a filter
* {ref}/ml-update-filter.html[POST /filters/<filter_id+++>+++/_update]: Update a filter
* {ref}/ml-get-filter.html[GET /filters/<filter_id+++>+++]: List filters
* {ref}/ml-delete-filter.html[DELETE /filter/<filter_id+++>+++]: Delete a filter

[float]
[[ml-api-datafeeds]]
=== /datafeeds/
Expand Down
4 changes: 4 additions & 0 deletions x-pack/docs/en/ml/configuring.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ The scenarios in this section describe some best practices for generating useful
* <<ml-configuring-categories>>
* <<ml-configuring-pop>>
* <<ml-configuring-transform>>
* <<ml-configuring-detector-custom-rules>>

:edit_url: https://github.com/elastic/elasticsearch/edit/{branch}/x-pack/docs/en/ml/customurl.asciidoc
include::customurl.asciidoc[]
Expand All @@ -49,3 +50,6 @@ include::populations.asciidoc[]

:edit_url: https://github.com/elastic/elasticsearch/edit/{branch}/x-pack/docs/en/ml/transforms.asciidoc
include::transforms.asciidoc[]

:edit_url: https://github.com/elastic/elasticsearch/edit/{branch}/x-pack/docs/en/ml/detector-custom-rules.asciidoc
include::detector-custom-rules.asciidoc[]
230 changes: 230 additions & 0 deletions x-pack/docs/en/ml/detector-custom-rules.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,230 @@
[role="xpack"]
[[ml-configuring-detector-custom-rules]]
=== Customizing detectors with rules and filters

<<ml-rules,Rules and filters>> enable you to change the behavior of anomaly
detectors based on domain-specific knowledge.

Rules describe _when_ a detector should take a certain _action_ instead
of following its default behavior. To specify the _when_ a rule uses
a `scope` and `conditions`. You can think of `scope` as the categorical
specification of a rule, while `conditions` are the numerical part.
A rule can have a scope, one or more conditions, or a combination of
scope and conditions.

Let us see how those can be configured by examples.

==== Specifying rule scope

Let us assume we are configuring a job in order to DNS data exfiltration.
Our data contain fields "subdomain" and "highest_registered_domain".
We can use a detector that looks like `high_info_content(subdomain) over highest_registered_domain`.
If we run such a job it is possible that we discover a lot of anomalies on
frequently used domains that we have reasons to trust. As security analysts, we
are not interested in such anomalies. Ideally, we could instruct the detector to
skip results for domains that we consider safe. Using a rule with a scope allows
us to achieve this.

First, we need to create a list with our safe domains. Those lists are called
`filters` in {ml}. Filters can be shared across jobs.

We create our filter using the {ref}/ml-put-filter.html[put filter API]:

[source,js]
----------------------------------
PUT _xpack/ml/filters/safe_domains
{
"description": "Our list of safe domains",
"items": ["safe.com", "trusted.com"]
}
----------------------------------
// CONSOLE

Now, we can create our job specifying a scope that uses the filter for the
`highest_registered_domain` field:

[source,js]
----------------------------------
PUT _xpack/ml/anomaly_detectors/dns_exfiltration_with_rule
{
"analysis_config" : {
"bucket_span":"5m",
"detectors" :[{
"function":"high_info_content",
"field_name": "subdomain",
"over_field_name": "highest_registered_domain",
"custom_rules": [{
"actions": ["skip_result"],
"scope": {
"highest_registered_domain": {
"filter_id": "safe_domains",
"filter_type": "include"
}
}
}]
}]
},
"data_description" : {
"time_field":"timestamp"
}
}
----------------------------------
// CONSOLE

As time advances and we see more data and more results, we might encounter new
domains that we want to add in the filter. We can do that by using the
{ref}/ml-update-filter.html[update filter API]:

[source,js]
----------------------------------
POST _xpack/ml/filters/safe_domains/_update
{
"add_items": ["another-safe.com"]
}
----------------------------------
// CONSOLE
// TEST[setup:ml_filter_safe_domains]

Note that we can provide scope for any of the partition/over/by fields.
In the following example we scope multiple fields:

[source,js]
----------------------------------
PUT _xpack/ml/anomaly_detectors/scoping_multiple_fields
{
"analysis_config" : {
"bucket_span":"5m",
"detectors" :[{
"function":"count",
"partition_field_name": "my_partition",
"over_field_name": "my_over",
"by_field_name": "my_by",
"custom_rules": [{
"actions": ["skip_result"],
"scope": {
"my_partition": {
"filter_id": "filter_1"
},
"my_over": {
"filter_id": "filter_2"
},
"my_by": {
"filter_id": "filter_3"
}
}
}]
}]
},
"data_description" : {
"time_field":"timestamp"
}
}
----------------------------------
// CONSOLE

Such a detector will skip results when the values of all 3 scoped fields
are included in the referenced filters.

==== Specifying rule conditions

Imagine a detector that looks for anomalies in CPU utilization.
Given a machine that is idle for long enough, small movement in CPU could
result in anomalous results where the `actual` value is quite small, for
example, 0.02. Given our knowledge about how CPU utilization behaves we might
determine that anomalies with such small actual values are not interesting for
investigation.

Let us now configure a job with a rule that will skip results where CPU
utilization is less than 0.20.

[source,js]
----------------------------------
PUT _xpack/ml/anomaly_detectors/cpu_with_rule
{
"analysis_config" : {
"bucket_span":"5m",
"detectors" :[{
"function":"high_mean",
"field_name": "cpu_utilization",
"custom_rules": [{
"actions": ["skip_result"],
"conditions": [
{
"applies_to": "actual",
"operator": "lt",
"value": 0.20
}
]
}]
}]
},
"data_description" : {
"time_field":"timestamp"
}
}
----------------------------------
// CONSOLE

When there are multiple conditions they are combined with a logical `and`.
This is useful when we want the rule to apply to a range. We simply create
a rule with two conditions, one for each end of the desired range.

Here is an example where a count detector will skip results when the count
is greater than 30 and less than 50:

[source,js]
----------------------------------
PUT _xpack/ml/anomaly_detectors/rule_with_range
{
"analysis_config" : {
"bucket_span":"5m",
"detectors" :[{
"function":"count",
"custom_rules": [{
"actions": ["skip_result"],
"conditions": [
{
"applies_to": "actual",
"operator": "gt",
"value": 30
},
{
"applies_to": "actual",
"operator": "lt",
"value": 50
}
]
}]
}]
},
"data_description" : {
"time_field":"timestamp"
}
}
----------------------------------
// CONSOLE

==== Rules in the life-cycle of a job

Rules only affect results created after the rules were applied.
Let us imagine that we have configured a job and it has been running
for some time. After observing its results we decide that we can employ
rules in order to get rid of some uninteresting results. We can use
the update-job API to do so. However, the rule we added will only be in effect
for any results created from the moment we added the rule onwards. Past results
will remain unaffected.

==== Using rules VS filtering data

It might appear like using rules is just another way of filtering the data
that feeds into a job. For example, a rule that skips results when the
partition field value is in a filter sounds equivalent to having a query
that filters out such documents. But it is not. There is a fundamental
difference. When the data is filtered before reaching a job it is as if they
never existed for the job. With rules, the data still reaches the job and
affects its behavior (depending on the rule actions).

For example, a rule with the `skip_result` action means all data will still
be modeled. On the other hand, a rule with the `skip_model_update` action means
results will still be created even though the model will not be updated by
data matched by a rule.
2 changes: 2 additions & 0 deletions x-pack/docs/en/ml/functions/geo.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ input data.
The {xpackml} features include the following geographic function: `lat_long`.

NOTE: You cannot create forecasts for jobs that contain geographic functions.
You also cannot add rules with conditions to detectors that use geographic
functions.

[float]
[[ml-lat-long]]
Expand Down
4 changes: 3 additions & 1 deletion x-pack/docs/en/ml/functions/metric.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,9 @@ The {xpackml} features include the following metric functions:
* <<ml-metric-metric,`metric`>>
* xref:ml-metric-varp[`varp`, `high_varp`, `low_varp`]

NOTE: You cannot add rules with conditions to detectors that use the `metric`
function.

[float]
[[ml-metric-min]]
==== Min
Expand Down Expand Up @@ -221,7 +224,6 @@ mean `responsetime` for each application over time. It detects when the mean
The `metric` function combines `min`, `max`, and `mean` functions. You can use
it as a shorthand for a combined analysis. If you do not specify a function in
a detector, this is the default function.
//TBD: Is that default behavior still true?

High- and low-sided functions are not applicable. You cannot use this function
when a `summary_count_field_name` is specified.
Expand Down
2 changes: 2 additions & 0 deletions x-pack/docs/en/ml/functions/rare.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ number of times (frequency) rare values occur.
`exclude_frequent`.
* You cannot create forecasts for jobs that contain `rare` or `freq_rare`
functions.
* You cannot add rules with conditions to detectors that use `rare` or
`freq_rare` functions.
* Shorter bucket spans (less than 1 hour, for example) are recommended when
looking for rare events. The functions model whether something happens in a
bucket at least once. With longer bucket spans, it is more likely that
Expand Down
3 changes: 3 additions & 0 deletions x-pack/docs/en/rest-api/defs.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ job configuration options.
* <<ml-calendar-resource,Calendars>>
* <<ml-datafeed-resource,{dfeeds-cap}>>
* <<ml-datafeed-counts,{dfeed-cap} counts>>
* <<ml-filter-resource,Filters>>
* <<ml-job-resource,Jobs>>
* <<ml-jobstats,Job statistics>>
* <<ml-snapshot-resource,Model snapshots>>
Expand All @@ -19,6 +20,8 @@ include::ml/calendarresource.asciidoc[]
[role="xpack"]
include::ml/datafeedresource.asciidoc[]
[role="xpack"]
include::ml/filterresource.asciidoc[]
[role="xpack"]
include::ml/jobresource.asciidoc[]
[role="xpack"]
include::ml/jobcounts.asciidoc[]
Expand Down
12 changes: 12 additions & 0 deletions x-pack/docs/en/rest-api/ml-api.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,14 @@ machine learning APIs and in advanced job configuration options in Kibana.
* <<ml-post-calendar-event,Add scheduled events to calendar>>, <<ml-delete-calendar-event,Delete scheduled events from calendar>>
* <<ml-get-calendar,Get calendars>>, <<ml-get-calendar-event,Get scheduled events>>

[float]
[[ml-api-filter-endpoint]]
=== Filters

* <<ml-put-filter,Create filter>>, <<ml-delete-filter,Delete filter>>
* <<ml-update-filter,Update filters>>
* <<ml-get-filter,Get filters>>

[float]
[[ml-api-datafeed-endpoint]]
=== {dfeeds-cap}
Expand Down Expand Up @@ -69,11 +77,13 @@ include::ml/close-job.asciidoc[]
//CREATE
include::ml/put-calendar.asciidoc[]
include::ml/put-datafeed.asciidoc[]
include::ml/put-filter.asciidoc[]
include::ml/put-job.asciidoc[]
//DELETE
include::ml/delete-calendar.asciidoc[]
include::ml/delete-datafeed.asciidoc[]
include::ml/delete-calendar-event.asciidoc[]
include::ml/delete-filter.asciidoc[]
include::ml/delete-job.asciidoc[]
include::ml/delete-calendar-job.asciidoc[]
include::ml/delete-snapshot.asciidoc[]
Expand All @@ -93,6 +103,7 @@ include::ml/get-job.asciidoc[]
include::ml/get-job-stats.asciidoc[]
include::ml/get-snapshot.asciidoc[]
include::ml/get-calendar-event.asciidoc[]
include::ml/get-filter.asciidoc[]
include::ml/get-record.asciidoc[]
//OPEN
include::ml/open-job.asciidoc[]
Expand All @@ -107,6 +118,7 @@ include::ml/start-datafeed.asciidoc[]
include::ml/stop-datafeed.asciidoc[]
//UPDATE
include::ml/update-datafeed.asciidoc[]
include::ml/update-filter.asciidoc[]
include::ml/update-job.asciidoc[]
include::ml/update-snapshot.asciidoc[]
//VALIDATE
Expand Down
Loading

0 comments on commit 4cdef4e

Please sign in to comment.