[ML] adding new KS test pipeline aggregation (#73334)

This adds a new pipeline aggregation for calculating Kolmogorov–Smirnov test for a given sample and buckets path. For now, the buckets path resolution needs to be `_count`. But, this may be relaxed in the future. It accepts a parameter `fractions` that indicates the distribution of documents from some other pre-calculated sample. This particular version of the K-S test is Two-sample, meaning, it calculates if the `fractions` and the distribution of `_count` values in the buckets_path are taken from the same distribution. This in combination with the hypothesis alternatives (`less`, `greater`, `two_sided`) and sampling logic (`upper_tail`, `lower_tail`, `uniform`) allow for flexibility and usefulness when comparing two samples and determining the likelihood of them being from the same overall distribution. Usage: ``` POST correlate_latency/_search?size=0&filter_path=aggregations { "aggs": { "buckets": { "terms": { <1> "field": "version", "size": 2 }, "aggs": { "latency_ranges": { "range": { <2> "field": "latency", "ranges": [ { "to": 0.0 }, { "from": 0, "to": 105 }, { "from": 105, "to": 225 }, { "from": 225, "to": 445 }, { "from": 445, "to": 665 }, { "from": 665, "to": 885 }, { "from": 885, "to": 1115 }, { "from": 1115, "to": 1335 }, { "from": 1335, "to": 1555 }, { "from": 1555, "to": 1775 }, { "from": 1775 } ] } }, "ks_test": { <3> "bucket_count_ks_test": { "buckets_path": "latency_ranges>_count", "alternative": ["less", "greater", "two_sided"] } } } } } } ```
elastic · Jun 4, 2021 · 30cf4dc · 30cf4dc
1 parent c6c2f1b
commit 30cf4dc
Show file tree

Hide file tree

Showing 25 changed files with 2,346 additions and 45 deletions.
diff --git a/docs/reference/aggregations/pipeline.asciidoc b/docs/reference/aggregations/pipeline.asciidoc
@@ -271,6 +271,10 @@ include::pipeline/avg-bucket-aggregation.asciidoc[]
 
 include::pipeline/bucket-script-aggregation.asciidoc[]
 
+include::pipeline/bucket-count-ks-test-aggregation.asciidoc[]
+
+include::pipeline/bucket-correlation-aggregation.asciidoc[]
+
 include::pipeline/bucket-selector-aggregation.asciidoc[]
 
 include::pipeline/bucket-sort-aggregation.asciidoc[]

diff --git a/docs/reference/aggregations/pipeline/bucket-correlation-aggregation.asciidoc b/docs/reference/aggregations/pipeline/bucket-correlation-aggregation.asciidoc
@@ -103,13 +103,13 @@ POST correlate_latency/_search?size=0&filter_path=aggregations
 {
   "aggs": {
     "buckets": {
-      "terms": {
+      "terms": { <1>
         "field": "version",
         "size": 2
       },
       "aggs": {
         "latency_ranges": {
-          "range": {
+          "range": { <2>
             "field": "latency",
             "ranges": [
               { "to": 0.0 },
@@ -126,7 +126,7 @@ POST correlate_latency/_search?size=0&filter_path=aggregations
             ]
           }
         },
-        "bucket_correlation": {
+        "bucket_correlation": { <3>
           "bucket_correlation": {
             "buckets_path": "latency_ranges>_count",
             "function": {

diff --git a/docs/reference/aggregations/pipeline/bucket-count-ks-test-aggregation.asciidoc b/docs/reference/aggregations/pipeline/bucket-count-ks-test-aggregation.asciidoc
@@ -0,0 +1,299 @@
+[role="xpack"]
+[testenv="basic"]
+[[search-aggregations-bucket-count-ks-test-aggregation]]
+=== Bucket count K-S test correlation aggregation
+++++
+<titleabbrev>Bucket count K-S test aggregation</titleabbrev>
+++++
+
+experimental::[]
+
+A sibling pipeline aggregation which executes a two sample Kolmogorov–Smirnov test
+(referred to as a "K-S test" from now on) against a provided distribution, and the
+distribution implied by the documents counts in the configured sibling aggregation.
+Specifically, for some metric, assuming that the percentile intervals of the metric are
+known beforehand or have been computed by an aggregation, then one would use range
+aggregation for the sibling to compute the p-value of the distribution difference between
+the metric and the restriction of that metric to a subset of the documents. A natural use
+case is if the sibling aggregation range aggregation nested in a terms aggregation, in
+which case one compares the overall distribution of metric to its restriction to each term.
+
+
+[[bucket-count-ks-test-agg-syntax]]
+==== Parameters
+
+`buckets_path`::
+(Required, string)
+Path to the buckets that contain one set of values to correlate. Must be a `_count` path
+For syntax, see <<buckets-path-syntax>>.
+
+`alternative`::
+(Optional, list)
+A list of string values indicating which K-S test alternative to calculate.
+The valid values are: "greater", "less", "two_sided". This parameter is key for
+determining the K-S statistic used when calculating the K-S test. Default value is
+all possible alternative hypotheses.
+
+`fractions`::
+(Optional, list)
+A list of doubles indicating the distribution of the samples with which to compare to the
+`buckets_path` results. In typical usage this is the overall proportion of documents in
+each bucket, which is compared with the actual document proportions in each bucket
+from the sibling aggregation counts. The default is to assume that overall documents
+are uniformly distributed on these buckets, which they would be if one used equal
+percentiles of a metric to define the bucket end points.
+
+`sampling_method`::
+(Optional, string)
+Indicates the sampling methodology when calculating the K-S test. Note, this is sampling
+of the returned values. This determines the cumulative distribution function (CDF) points
+used comparing the two samples. Default is `upper_tail`, which emphasizes the upper
+end of the CDF points. Valid options are: `upper_tail`, `uniform`, and `lower_tail`.
+
+==== Syntax
+
+A `bucket_count_ks_test` aggregation looks like this in isolation:
+
+[source,js]
+--------------------------------------------------
+{
+  "bucket_count_ks_test": {
+    "buckets_path": "range_values>_count", <1>
+    "alternative": ["less", "greater", "two_sided"], <2>
+    "sampling_method": "upper_tail" <3>
+  }
+}
+--------------------------------------------------
+// NOTCONSOLE
+<1> The buckets containing the values to test against.
+<2> The alternatives to calculate.
+<3> The sampling method for the K-S statistic.
+
+
+[[bucket-count-ks-test-agg-example]]
+==== Example
+
+The following snippet runs the `bucket_count_ks_test` on the individual terms in the field `version` against a uniform distribution.
+The uniform distribution reflects the `latency` percentile buckets. Not shown is the pre-calculation of the `latency` indicator values,
+which was done utilizing the
+<<search-aggregations-metrics-percentile-aggregation,percentiles>> aggregation.
+
+This example is only using the deciles of `latency`.
+
+[source,console]
+-------------------------------------------------
+POST correlate_latency/_search?size=0&filter_path=aggregations
+{
+  "aggs": {
+    "buckets": {
+      "terms": { <1>
+        "field": "version",
+        "size": 2
+      },
+      "aggs": {
+        "latency_ranges": {
+          "range": { <2>
+            "field": "latency",
+            "ranges": [
+              { "to": 0 },
+              { "from": 0, "to": 105 },
+              { "from": 105, "to": 225 },
+              { "from": 225, "to": 445 },
+              { "from": 445, "to": 665 },
+              { "from": 665, "to": 885 },
+              { "from": 885, "to": 1115 },
+              { "from": 1115, "to": 1335 },
+              { "from": 1335, "to": 1555 },
+              { "from": 1555, "to": 1775 },
+              { "from": 1775 }
+            ]
+          }
+        },
+        "ks_test": { <3>
+          "bucket_count_ks_test": {
+            "buckets_path": "latency_ranges>_count",
+            "alternative": ["less", "greater", "two_sided"]
+          }
+        }
+      }
+    }
+  }
+}
+-------------------------------------------------
+// TEST[setup:correlate_latency]
+
+<1> The term buckets containing a range aggregation and the bucket correlation aggregation. Both are utilized to calculate
+    the correlation of the term values with the latency.
+<2> The range aggregation on the latency field. The ranges were created referencing the percentiles of the latency field.
+<3> The bucket count K-S test aggregation that tests if the bucket counts comes from the same distribution as `fractions`;
+    where `fractions` is a uniform distribution.
+
+And the following may be the response:
+
+[source,console-result]
+----
+{
+  "aggregations" : {
+    "buckets" : {
+      "doc_count_error_upper_bound" : 0,
+      "sum_other_doc_count" : 0,
+      "buckets" : [
+        {
+          "key" : "1.0",
+          "doc_count" : 100,
+          "latency_ranges" : {
+            "buckets" : [
+              {
+                "key" : "*-0.0",
+                "to" : 0.0,
+                "doc_count" : 0
+              },
+              {
+                "key" : "0.0-105.0",
+                "from" : 0.0,
+                "to" : 105.0,
+                "doc_count" : 1
+              },
+              {
+                "key" : "105.0-225.0",
+                "from" : 105.0,
+                "to" : 225.0,
+                "doc_count" : 9
+              },
+              {
+                "key" : "225.0-445.0",
+                "from" : 225.0,
+                "to" : 445.0,
+                "doc_count" : 0
+              },
+              {
+                "key" : "445.0-665.0",
+                "from" : 445.0,
+                "to" : 665.0,
+                "doc_count" : 0
+              },
+              {
+                "key" : "665.0-885.0",
+                "from" : 665.0,
+                "to" : 885.0,
+                "doc_count" : 0
+              },
+              {
+                "key" : "885.0-1115.0",
+                "from" : 885.0,
+                "to" : 1115.0,
+                "doc_count" : 10
+              },
+              {
+                "key" : "1115.0-1335.0",
+                "from" : 1115.0,
+                "to" : 1335.0,
+                "doc_count" : 20
+              },
+              {
+                "key" : "1335.0-1555.0",
+                "from" : 1335.0,
+                "to" : 1555.0,
+                "doc_count" : 20
+              },
+              {
+                "key" : "1555.0-1775.0",
+                "from" : 1555.0,
+                "to" : 1775.0,
+                "doc_count" : 20
+              },
+              {
+                "key" : "1775.0-*",
+                "from" : 1775.0,
+                "doc_count" : 20
+              }
+            ]
+          },
+          "ks_test" : {
+            "less" : 2.248673241788478E-4,
+            "greater" : 1.0,
+            "two_sided" : 5.791639181800257E-4
+          }
+        },
+        {
+          "key" : "2.0",
+          "doc_count" : 100,
+          "latency_ranges" : {
+            "buckets" : [
+              {
+                "key" : "*-0.0",
+                "to" : 0.0,
+                "doc_count" : 0
+              },
+              {
+                "key" : "0.0-105.0",
+                "from" : 0.0,
+                "to" : 105.0,
+                "doc_count" : 19
+              },
+              {
+                "key" : "105.0-225.0",
+                "from" : 105.0,
+                "to" : 225.0,
+                "doc_count" : 11
+              },
+              {
+                "key" : "225.0-445.0",
+                "from" : 225.0,
+                "to" : 445.0,
+                "doc_count" : 20
+              },
+              {
+                "key" : "445.0-665.0",
+                "from" : 445.0,
+                "to" : 665.0,
+                "doc_count" : 20
+              },
+              {
+                "key" : "665.0-885.0",
+                "from" : 665.0,
+                "to" : 885.0,
+                "doc_count" : 20
+              },
+              {
+                "key" : "885.0-1115.0",
+                "from" : 885.0,
+                "to" : 1115.0,
+                "doc_count" : 10
+              },
+              {
+                "key" : "1115.0-1335.0",
+                "from" : 1115.0,
+                "to" : 1335.0,
+                "doc_count" : 0
+              },
+              {
+                "key" : "1335.0-1555.0",
+                "from" : 1335.0,
+                "to" : 1555.0,
+                "doc_count" : 0
+              },
+              {
+                "key" : "1555.0-1775.0",
+                "from" : 1555.0,
+                "to" : 1775.0,
+                "doc_count" : 0
+              },
+              {
+                "key" : "1775.0-*",
+                "from" : 1775.0,
+                "doc_count" : 0
+              }
+            ]
+          },
+          "ks_test" : {
+            "less" : 0.9642895789647244,
+            "greater" : 4.58718174664754E-9,
+            "two_sided" : 5.916656831139733E-9
+          }
+        }
+      ]
+    }
+  }
+}
+----
diff --git a/x-pack/plugin/ml/build.gradle b/x-pack/plugin/ml/build.gradle
@@ -67,6 +67,7 @@ dependencies {
   // ml deps
   api project(':libs:elasticsearch-grok')
   api "net.sf.supercsv:super-csv:${versions.supercsv}"
+  api "org.apache.commons:commons-math3:3.6.1"
   nativeBundle("org.elasticsearch.ml:ml-cpp:${project.version}@zip") {
     changing = true
   }

diff --git a/x-pack/plugin/ml/licenses/commons-math3-3.6.1.jar.sha1 b/x-pack/plugin/ml/licenses/commons-math3-3.6.1.jar.sha1
@@ -0,0 +1 @@
+e4ba98f1d4b3c80ec46392f25e094a6a2e58fcbf