Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] adding new KS test pipeline aggregation #73334

Merged
merged 16 commits into from
Jun 4, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,15 @@
experimental::[]

A sibling pipeline aggregation which executes a two sample Kolmogorov–Smirnov test
(referred to as a "K-S test" from now on) against a provided distribution and
the distribution of documents counts in the configured sibling aggregation.
(referred to as a "K-S test" from now on) against a provided distribution, and the
distribution implied by the documents counts in the configured sibling aggregation.
Specifically, for some metric, assuming that the percentile intervals of the metric are
known beforehand or have been computed by an aggregation, then one would use range
aggregation for the sibling to compute the p-value of the distribution difference between
the metric and the restriction of that metric to a subset of the documents. A natural use
case is if the sibling aggregation range aggregation nested in a terms aggregation, in
which case one compares the overall distribution of metric to its restriction to each term.

This test is useful to determine if two samples (represented by `fractions` and `buckets_path`) are
drawn from the same distribution.

[[bucket-count-ks-test-agg-syntax]]
==== Parameters
Expand All @@ -28,13 +32,16 @@ For syntax, see <<buckets-path-syntax>>.
A list of string values indicating which K-S test alternative to calculate.
The valid values are: "greater", "less", "two_sided". This parameter is key for
determining the K-S statistic used when calculating the K-S test. Default value is
all possible alternative hypothesis.
all possible alternative hypotheses.

`fractions`::
(Optional, list)
A list of doubles indicating the distribution of the samples with which to compare to the
`buckets_path` results. The default is a uniform distribution of the same length as the
`buckets_path` buckets.
`buckets_path` results. In typical usage this is the overall proportion of documents in
each bucket, which is compared with the actual document proportions in each bucket
from the sibling aggregation counts. The default is to assume that overall documents
are uniformly distributed on these buckets, which they would be if one used equal
percentiles of a metric to define the bucket end points.

`sampling_method`::
(Optional, string)
Expand Down Expand Up @@ -71,7 +78,7 @@ The uniform distribution reflects the `latency` percentile buckets. Not shown is
which was done utilizing the
<<search-aggregations-metrics-percentile-aggregation,percentiles>> aggregation.

This example is only using the 10s percentiles.
This example is only using the deciles of `latency`.

[source,console]
-------------------------------------------------
Expand Down Expand Up @@ -205,7 +212,7 @@ And the following may be the response:
"ks_test" : {
"less" : 2.248673241788478E-4,
"greater" : 1.0,
"two_sided" : 2.248673241788478E-4
"two_sided" : 5.791639181800257E-4
}
},
{
Expand Down Expand Up @@ -282,7 +289,7 @@ And the following may be the response:
"ks_test" : {
"less" : 0.9642895789647244,
"greater" : 4.58718174664754E-9,
"two_sided" : 4.58718174664754E-9
"two_sided" : 5.916656831139733E-9
}
}
]
Expand Down
1 change: 1 addition & 0 deletions x-pack/plugin/ml/build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ dependencies {
// ml deps
api project(':libs:elasticsearch-grok')
api "net.sf.supercsv:super-csv:${versions.supercsv}"
api "org.apache.commons:commons-math3:3.6.1"
nativeBundle("org.elasticsearch.ml:ml-cpp:${project.version}@zip") {
changing = true
}
Expand Down
1 change: 1 addition & 0 deletions x-pack/plugin/ml/licenses/commons-math3-3.6.1.jar.sha1
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
e4ba98f1d4b3c80ec46392f25e094a6a2e58fcbf
Loading