[ML][DOCS] Add documentation for detector rules and filters (#32013)

elastic · Jul 25, 2018 · 4cdef4e · 4cdef4e
1 parent 41b12e2
commit 4cdef4e
Show file tree

Hide file tree

Showing 16 changed files with 648 additions and 12 deletions.
diff --git a/x-pack/docs/build.gradle b/x-pack/docs/build.gradle
@@ -310,6 +310,16 @@ setups['farequote_datafeed'] = setups['farequote_job'] + '''
           "job_id":"farequote",
           "indexes":"farequote"
           }
+'''          
+setups['ml_filter_safe_domains'] = '''
+  - do:
+      xpack.ml.put_filter:
+        filter_id: "safe_domains"
+        body:  >
+          {
+            "description": "A list of safe domains",
+            "items": ["*.google.com", "wikipedia.org"]
+          }
 '''
 setups['server_metrics_index'] = '''
   - do:

diff --git a/x-pack/docs/en/ml/api-quickref.asciidoc b/x-pack/docs/en/ml/api-quickref.asciidoc
@@ -47,6 +47,15 @@ The main {ml} resources can be accessed with a variety of endpoints:
 * {ref}/ml-delete-calendar-job.html[DELETE /calendars/<calendar_id+++>+++/jobs/<job_id+++>+++]: Disassociate a job from a calendar
 * {ref}/ml-delete-calendar.html[DELETE /calendars/<calendar_id+++>+++]: Delete a calendar
 
+[float]
+[[ml-api-filters]]
+=== /filters/
+
+* {ref}/ml-put-filter.html[PUT /filters/<filter_id+++>+++]: Create a filter
+* {ref}/ml-update-filter.html[POST /filters/<filter_id+++>+++/_update]: Update a filter
+* {ref}/ml-get-filter.html[GET /filters/<filter_id+++>+++]: List filters
+* {ref}/ml-delete-filter.html[DELETE /filter/<filter_id+++>+++]: Delete a filter
+
 [float]
 [[ml-api-datafeeds]]
 === /datafeeds/

diff --git a/x-pack/docs/en/ml/configuring.asciidoc b/x-pack/docs/en/ml/configuring.asciidoc
@@ -34,6 +34,7 @@ The scenarios in this section describe some best practices for generating useful
 * <<ml-configuring-categories>>
 * <<ml-configuring-pop>>
 * <<ml-configuring-transform>>
+* <<ml-configuring-detector-custom-rules>>
 
 :edit_url: https://github.com/elastic/elasticsearch/edit/{branch}/x-pack/docs/en/ml/customurl.asciidoc
 include::customurl.asciidoc[]
@@ -49,3 +50,6 @@ include::populations.asciidoc[]
 
 :edit_url: https://github.com/elastic/elasticsearch/edit/{branch}/x-pack/docs/en/ml/transforms.asciidoc
 include::transforms.asciidoc[]
+
+:edit_url: https://github.com/elastic/elasticsearch/edit/{branch}/x-pack/docs/en/ml/detector-custom-rules.asciidoc
+include::detector-custom-rules.asciidoc[]
diff --git a/x-pack/docs/en/ml/detector-custom-rules.asciidoc b/x-pack/docs/en/ml/detector-custom-rules.asciidoc
@@ -0,0 +1,230 @@
+[role="xpack"]
+[[ml-configuring-detector-custom-rules]]
+=== Customizing detectors with rules and filters
+
+<<ml-rules,Rules and filters>> enable you to change the behavior of anomaly 
+detectors based on domain-specific knowledge.
+
+Rules describe _when_ a detector should take a certain _action_ instead
+of following its default behavior. To specify the _when_ a rule uses
+a `scope` and `conditions`. You can think of `scope` as the categorical
+specification of a rule, while `conditions` are the numerical part.
+A rule can have a scope, one or more conditions, or a combination of
+scope and conditions.
+
+Let us see how those can be configured by examples.
+
+==== Specifying rule scope
+
+Let us assume we are configuring a job in order to DNS data exfiltration.
+Our data contain fields "subdomain" and "highest_registered_domain".
+We can use a detector that looks like `high_info_content(subdomain) over highest_registered_domain`.
+If we run such a job it is possible that we discover a lot of anomalies on 
+frequently used domains that we have reasons to trust. As security analysts, we 
+are not interested in such anomalies. Ideally, we could instruct the detector to 
+skip results for domains that we consider safe. Using a rule with a scope allows 
+us to achieve this.
+
+First, we need to create a list with our safe domains. Those lists are called 
+`filters` in {ml}. Filters can be shared across jobs.
+
+We create our filter using the {ref}/ml-put-filter.html[put filter API]:
+
+[source,js]
+----------------------------------
+PUT _xpack/ml/filters/safe_domains
+{
+  "description": "Our list of safe domains",
+  "items": ["safe.com", "trusted.com"]
+}
+----------------------------------
+// CONSOLE
+
+Now, we can create our job specifying a scope that uses the filter for the 
+`highest_registered_domain` field:
+
+[source,js]
+----------------------------------
+PUT _xpack/ml/anomaly_detectors/dns_exfiltration_with_rule
+{
+  "analysis_config" : {
+    "bucket_span":"5m",
+    "detectors" :[{
+      "function":"high_info_content",
+      "field_name": "subdomain",
+      "over_field_name": "highest_registered_domain",
+      "custom_rules": [{
+        "actions": ["skip_result"],
+        "scope": {
+          "highest_registered_domain": {
+            "filter_id": "safe_domains",
+            "filter_type": "include"
+          }
+        }
+      }]
+    }]
+  },
+  "data_description" : {
+    "time_field":"timestamp"
+  }
+}
+----------------------------------
+// CONSOLE
+
+As time advances and we see more data and more results, we might encounter new 
+domains that we want to add in the filter. We can do that by using the 
+{ref}/ml-update-filter.html[update filter API]:
+
+[source,js]
+----------------------------------
+POST _xpack/ml/filters/safe_domains/_update
+{
+  "add_items": ["another-safe.com"]
+}
+----------------------------------
+// CONSOLE
+// TEST[setup:ml_filter_safe_domains]
+
+Note that we can provide scope for any of the partition/over/by fields.
+In the following example we scope multiple fields:
+
+[source,js]
+----------------------------------
+PUT _xpack/ml/anomaly_detectors/scoping_multiple_fields
+{
+  "analysis_config" : {
+    "bucket_span":"5m",
+    "detectors" :[{
+      "function":"count",
+      "partition_field_name": "my_partition",
+      "over_field_name": "my_over",
+      "by_field_name": "my_by",
+      "custom_rules": [{
+        "actions": ["skip_result"],
+        "scope": {
+          "my_partition": {
+            "filter_id": "filter_1"
+          },
+          "my_over": {
+            "filter_id": "filter_2"
+          },
+          "my_by": {
+            "filter_id": "filter_3"
+          }
+        }
+      }]
+    }]
+  },
+  "data_description" : {
+    "time_field":"timestamp"
+  }
+}
+----------------------------------
+// CONSOLE
+
+Such a detector will skip results when the values of all 3 scoped fields
+are included in the referenced filters.
+
+==== Specifying rule conditions
+
+Imagine a detector that looks for anomalies in CPU utilization.
+Given a machine that is idle for long enough, small movement in CPU could
+result in anomalous results where the `actual` value is quite small, for 
+example, 0.02. Given our knowledge about how CPU utilization behaves we might 
+determine that anomalies with such small actual values are not interesting for 
+investigation.
+
+Let us now configure a job with a rule that will skip results where CPU 
+utilization is less than 0.20.
+
+[source,js]
+----------------------------------
+PUT _xpack/ml/anomaly_detectors/cpu_with_rule
+{
+  "analysis_config" : {
+    "bucket_span":"5m",
+    "detectors" :[{
+      "function":"high_mean",
+      "field_name": "cpu_utilization",
+      "custom_rules": [{
+        "actions": ["skip_result"],
+        "conditions": [
+          {
+            "applies_to": "actual",
+            "operator": "lt",
+            "value": 0.20
+          }
+        ]
+      }]
+    }]
+  },
+  "data_description" : {
+    "time_field":"timestamp"
+  }
+}
+----------------------------------
+// CONSOLE
+
+When there are multiple conditions they are combined with a logical `and`.
+This is useful when we want the rule to apply to a range. We simply create
+a rule with two conditions, one for each end of the desired range.
+
+Here is an example where a count detector will skip results when the count
+is greater than 30 and less than 50:
+
+[source,js]
+----------------------------------
+PUT _xpack/ml/anomaly_detectors/rule_with_range
+{
+  "analysis_config" : {
+    "bucket_span":"5m",
+    "detectors" :[{
+      "function":"count",
+      "custom_rules": [{
+        "actions": ["skip_result"],
+        "conditions": [
+          {
+            "applies_to": "actual",
+            "operator": "gt",
+            "value": 30
+          },
+          {
+            "applies_to": "actual",
+            "operator": "lt",
+            "value": 50
+          }
+        ]
+      }]
+    }]
+  },
+  "data_description" : {
+    "time_field":"timestamp"
+  }
+}
+----------------------------------
+// CONSOLE
+
+==== Rules in the life-cycle of a job
+
+Rules only affect results created after the rules were applied.
+Let us imagine that we have configured a job and it has been running
+for some time. After observing its results we decide that we can employ
+rules in order to get rid of some uninteresting results. We can use
+the update-job API to do so. However, the rule we added will only be in effect
+for any results created from the moment we added the rule onwards. Past results
+will remain unaffected.
+
+==== Using rules VS filtering data
+
+It might appear like using rules is just another way of filtering the data
+that feeds into a job. For example, a rule that skips results when the
+partition field value is in a filter sounds equivalent to having a query
+that filters out such documents. But it is not. There is a fundamental
+difference. When the data is filtered before reaching a job it is as if they
+never existed for the job. With rules, the data still reaches the job and
+affects its behavior (depending on the rule actions).
+
+For example, a rule with the `skip_result` action means all data will still
+be modeled. On the other hand, a rule with the `skip_model_update` action means
+results will still be created even though the model will not be updated by
+data matched by a rule.
diff --git a/x-pack/docs/en/ml/functions/geo.asciidoc b/x-pack/docs/en/ml/functions/geo.asciidoc
@@ -8,6 +8,8 @@ input data.
 The {xpackml} features include the following geographic function: `lat_long`.
 
 NOTE: You cannot create forecasts for jobs that contain geographic functions. 
+You also cannot add rules with conditions to detectors that use geographic 
+functions. 
 
 [float]
 [[ml-lat-long]]

diff --git a/x-pack/docs/en/ml/functions/metric.asciidoc b/x-pack/docs/en/ml/functions/metric.asciidoc
@@ -15,6 +15,9 @@ The {xpackml} features include the following metric functions:
 * <<ml-metric-metric,`metric`>>
 * xref:ml-metric-varp[`varp`, `high_varp`, `low_varp`]
 
+NOTE: You cannot add rules with conditions to detectors that use the `metric` 
+function. 
+
 [float]
 [[ml-metric-min]]
 ==== Min
@@ -221,7 +224,6 @@ mean `responsetime` for each application over time. It detects when the mean
 The `metric` function combines `min`, `max`, and `mean` functions. You can use
 it as a shorthand for a combined analysis. If you do not specify a function in
 a detector, this is the default function.
-//TBD: Is that default behavior still true?
 
 High- and low-sided functions are not applicable. You cannot use this function
 when a `summary_count_field_name` is specified.

diff --git a/x-pack/docs/en/ml/functions/rare.asciidoc b/x-pack/docs/en/ml/functions/rare.asciidoc
@@ -15,6 +15,8 @@ number of times (frequency) rare values occur.
 `exclude_frequent`.
 * You cannot create forecasts for jobs that contain `rare` or `freq_rare`
 functions. 
+* You cannot add rules with conditions to detectors that use `rare` or 
+`freq_rare` functions. 
 * Shorter bucket spans (less than 1 hour, for example) are recommended when
 looking for rare events. The functions model whether something happens in a
 bucket at least once. With longer bucket spans, it is more likely that

diff --git a/x-pack/docs/en/rest-api/defs.asciidoc b/x-pack/docs/en/rest-api/defs.asciidoc
@@ -8,6 +8,7 @@ job configuration options.
 * <<ml-calendar-resource,Calendars>>
 * <<ml-datafeed-resource,{dfeeds-cap}>>
 * <<ml-datafeed-counts,{dfeed-cap} counts>>
+* <<ml-filter-resource,Filters>>
 * <<ml-job-resource,Jobs>>
 * <<ml-jobstats,Job statistics>>
 * <<ml-snapshot-resource,Model snapshots>>
@@ -19,6 +20,8 @@ include::ml/calendarresource.asciidoc[]
 [role="xpack"]
 include::ml/datafeedresource.asciidoc[]
 [role="xpack"]
+include::ml/filterresource.asciidoc[]
+[role="xpack"]
 include::ml/jobresource.asciidoc[]
 [role="xpack"]
 include::ml/jobcounts.asciidoc[]

diff --git a/x-pack/docs/en/rest-api/ml-api.asciidoc b/x-pack/docs/en/rest-api/ml-api.asciidoc
@@ -15,6 +15,14 @@ machine learning APIs and in advanced job configuration options in Kibana.
 * <<ml-post-calendar-event,Add scheduled events to calendar>>, <<ml-delete-calendar-event,Delete scheduled events from calendar>>
 * <<ml-get-calendar,Get calendars>>, <<ml-get-calendar-event,Get scheduled events>>
 
+[float]
+[[ml-api-filter-endpoint]]
+=== Filters
+
+* <<ml-put-filter,Create filter>>, <<ml-delete-filter,Delete filter>>
+* <<ml-update-filter,Update filters>>
+* <<ml-get-filter,Get filters>>
+
 [float]
 [[ml-api-datafeed-endpoint]]
 === {dfeeds-cap}
@@ -69,11 +77,13 @@ include::ml/close-job.asciidoc[]
 //CREATE
 include::ml/put-calendar.asciidoc[]
 include::ml/put-datafeed.asciidoc[]
+include::ml/put-filter.asciidoc[]
 include::ml/put-job.asciidoc[]
 //DELETE
 include::ml/delete-calendar.asciidoc[]
 include::ml/delete-datafeed.asciidoc[]
 include::ml/delete-calendar-event.asciidoc[]
+include::ml/delete-filter.asciidoc[]
 include::ml/delete-job.asciidoc[]
 include::ml/delete-calendar-job.asciidoc[]
 include::ml/delete-snapshot.asciidoc[]
@@ -93,6 +103,7 @@ include::ml/get-job.asciidoc[]
 include::ml/get-job-stats.asciidoc[]
 include::ml/get-snapshot.asciidoc[]
 include::ml/get-calendar-event.asciidoc[]
+include::ml/get-filter.asciidoc[]
 include::ml/get-record.asciidoc[]
 //OPEN
 include::ml/open-job.asciidoc[]
@@ -107,6 +118,7 @@ include::ml/start-datafeed.asciidoc[]
 include::ml/stop-datafeed.asciidoc[]
 //UPDATE
 include::ml/update-datafeed.asciidoc[]
+include::ml/update-filter.asciidoc[]
 include::ml/update-job.asciidoc[]
 include::ml/update-snapshot.asciidoc[]
 //VALIDATE