-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] adding new KS test pipeline aggregation #73334
Changes from 6 commits
c79fcef
483bd2a
f50f354
b9ab642
eba5b9a
133a19e
cf456cd
b3038d4
471d691
609e7ec
70f32f0
87d3d87
f51f332
09eb655
66d89cf
5bb751b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,291 @@ | ||||||
[role="xpack"] | ||||||
[testenv="basic"] | ||||||
[[search-aggregations-bucket-count-ks-test-aggregation]] | ||||||
=== Bucket count K-S test correlation aggregation | ||||||
++++ | ||||||
<titleabbrev>Bucket count K-S test aggregation</titleabbrev> | ||||||
++++ | ||||||
|
||||||
experimental::[] | ||||||
|
||||||
A sibling pipeline aggregation which executes a two sample Kolmogorov–Smirnov test | ||||||
(referred to as a "K-S test" from now own) against a provided distribution and | ||||||
the distribution of documents counts in the configured sibling aggregation. | ||||||
|
||||||
This test is useful to determine if two samples (represented by `fractions` and `buckets_path`) are | ||||||
drawn from the same distribution. | ||||||
|
||||||
[[bucket-count-ks-test-agg-syntax]] | ||||||
==== Parameters | ||||||
|
||||||
`buckets_path`:: | ||||||
(Required, string) | ||||||
Path to the buckets that contain one set of values to correlate. Must be a `_count` path | ||||||
For syntax, see <<buckets-path-syntax>>. | ||||||
|
||||||
`alternative`:: | ||||||
(Required, list) | ||||||
A list of string values indicating which K-S test alternative to calculate. | ||||||
The valid values are: "greater", "less", "two_sided". This parameter is key for | ||||||
determining the K-S statistic used when calculating the K-S test. | ||||||
|
||||||
`fractions`:: | ||||||
(Optional, list) | ||||||
A list of doubles indicating the distribution of the samples with which to compare to the | ||||||
`buckets_path` results. The default is a uniform distribution of the same length as the | ||||||
`buckets_path` buckets. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We have rather specific requirements on what fractions mean. To produce a meaningful result from this aggregation they should be related to some metric distribution which is then used to create the sibling aggregation. A natural choice is to use equal percentile range queries to construct the sibling aggregation in which case the default is correct. I think it is worth capturing something along these lines. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd propose something like:
|
||||||
|
||||||
`sampling_method`:: | ||||||
(Optional, string) | ||||||
Indicates the sampling methodology when calculating the K-S test. Note, this is sampling | ||||||
of the returned values. This determines the cumulative distribution function (cdf) points | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Capitalised |
||||||
used comparing the two samples. Default is `upper_tail`, which emphasizes the upper | ||||||
end of the CDF points. Valid options are: `upper_tail`, `uniform`, and `lower_tail`. | ||||||
|
||||||
==== Syntax | ||||||
|
||||||
A `bucket_count_ks_test` aggregation looks like this in isolation: | ||||||
|
||||||
[source,js] | ||||||
-------------------------------------------------- | ||||||
{ | ||||||
"bucket_count_ks_test": { | ||||||
"buckets_path": "range_values>_count", <1> | ||||||
"alternative": ["less", "greater", "two_sided"], <2> | ||||||
"sampling_method": "upper_tail" <3> | ||||||
} | ||||||
} | ||||||
-------------------------------------------------- | ||||||
// NOTCONSOLE | ||||||
<1> The buckets containing the values to test against. | ||||||
davidkyle marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
<2> The alternatives to calculate | ||||||
<3> The sampling method for the K-S statistic | ||||||
|
||||||
|
||||||
[[bucket-count-ks-test-agg-example]] | ||||||
==== Example | ||||||
|
||||||
The following snippet runs the `bucket_count_ks_test` on the individual terms in the field `version` against a uniform distribution. | ||||||
The uniform distribution reflects the `latency` percentile buckets. Not shown is the pre-calculation of the `latency` indicator values, | ||||||
which was done utilizing the | ||||||
<<search-aggregations-metrics-percentile-aggregation,percentiles>> aggregation. | ||||||
|
||||||
This example is only using the 10s percentiles. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
[source,console] | ||||||
------------------------------------------------- | ||||||
POST correlate_latency/_search?size=0&filter_path=aggregations | ||||||
{ | ||||||
"aggs": { | ||||||
"buckets": { | ||||||
"terms": { <1> | ||||||
"field": "version", | ||||||
"size": 2 | ||||||
}, | ||||||
"aggs": { | ||||||
"latency_ranges": { | ||||||
"range": { <2> | ||||||
"field": "latency", | ||||||
"ranges": [ | ||||||
{ "to": 0.0 }, | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could |
||||||
{ "from": 0, "to": 105 }, | ||||||
{ "from": 105, "to": 225 }, | ||||||
{ "from": 225, "to": 445 }, | ||||||
{ "from": 445, "to": 665 }, | ||||||
{ "from": 665, "to": 885 }, | ||||||
{ "from": 885, "to": 1115 }, | ||||||
{ "from": 1115, "to": 1335 }, | ||||||
{ "from": 1335, "to": 1555 }, | ||||||
{ "from": 1555, "to": 1775 }, | ||||||
{ "from": 1775 } | ||||||
] | ||||||
} | ||||||
}, | ||||||
"ks_test": { <3> | ||||||
"bucket_count_ks_test": { | ||||||
"buckets_path": "latency_ranges>_count", | ||||||
"alternative": ["less", "greater", "two_sided"] | ||||||
} | ||||||
} | ||||||
} | ||||||
} | ||||||
} | ||||||
} | ||||||
------------------------------------------------- | ||||||
// TEST[setup:correlate_latency] | ||||||
|
||||||
<1> The term buckets containing a range aggregation and the bucket correlation aggregation. Both are utilized to calculate | ||||||
the correlation of the term values with the latency. | ||||||
<2> The range aggregation on the latency field. The ranges were created referencing the percentiles of the latency field. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be nice to have a full example somewhere in docs of showing how the ranges were found There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Possibly, I think though that is a larger "how correlations + K-S tests work" type of docs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually this isn't so bad as you explain that the latency ranges were calculated using a percentiles agg |
||||||
<3> The bucket count K-S test aggregation that determines if the count samples are from the same distribution as the uniform | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could this sentence be simplified to: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. technically, no. I will think of a better way of phrasing it. The two-sampled K-S test (which we are doing here) is verifying that two samples come from the same distribution. In this particular case, we are testing if one sample (a uniform fraction sample) and another (these _counts) both come from the same distribution. |
||||||
distribution. | ||||||
|
||||||
And the following may be the response: | ||||||
|
||||||
[source,console-result] | ||||||
---- | ||||||
{ | ||||||
"aggregations" : { | ||||||
"buckets" : { | ||||||
"doc_count_error_upper_bound" : 0, | ||||||
"sum_other_doc_count" : 0, | ||||||
"buckets" : [ | ||||||
{ | ||||||
"key" : "1.0", | ||||||
"doc_count" : 100, | ||||||
"latency_ranges" : { | ||||||
"buckets" : [ | ||||||
{ | ||||||
"key" : "*-0.0", | ||||||
"to" : 0.0, | ||||||
"doc_count" : 0 | ||||||
}, | ||||||
{ | ||||||
"key" : "0.0-105.0", | ||||||
"from" : 0.0, | ||||||
"to" : 105.0, | ||||||
"doc_count" : 1 | ||||||
}, | ||||||
{ | ||||||
"key" : "105.0-225.0", | ||||||
"from" : 105.0, | ||||||
"to" : 225.0, | ||||||
"doc_count" : 9 | ||||||
}, | ||||||
{ | ||||||
"key" : "225.0-445.0", | ||||||
"from" : 225.0, | ||||||
"to" : 445.0, | ||||||
"doc_count" : 0 | ||||||
}, | ||||||
{ | ||||||
"key" : "445.0-665.0", | ||||||
"from" : 445.0, | ||||||
"to" : 665.0, | ||||||
"doc_count" : 0 | ||||||
}, | ||||||
{ | ||||||
"key" : "665.0-885.0", | ||||||
"from" : 665.0, | ||||||
"to" : 885.0, | ||||||
"doc_count" : 0 | ||||||
}, | ||||||
{ | ||||||
"key" : "885.0-1115.0", | ||||||
"from" : 885.0, | ||||||
"to" : 1115.0, | ||||||
"doc_count" : 10 | ||||||
}, | ||||||
{ | ||||||
"key" : "1115.0-1335.0", | ||||||
"from" : 1115.0, | ||||||
"to" : 1335.0, | ||||||
"doc_count" : 20 | ||||||
}, | ||||||
{ | ||||||
"key" : "1335.0-1555.0", | ||||||
"from" : 1335.0, | ||||||
"to" : 1555.0, | ||||||
"doc_count" : 20 | ||||||
}, | ||||||
{ | ||||||
"key" : "1555.0-1775.0", | ||||||
"from" : 1555.0, | ||||||
"to" : 1775.0, | ||||||
"doc_count" : 20 | ||||||
}, | ||||||
{ | ||||||
"key" : "1775.0-*", | ||||||
"from" : 1775.0, | ||||||
"doc_count" : 20 | ||||||
} | ||||||
] | ||||||
}, | ||||||
"ks_test" : { | ||||||
"less" : 2.248673241788478E-4, | ||||||
"greater" : 1.0, | ||||||
"two_sided" : 2.248673241788478E-4 | ||||||
} | ||||||
}, | ||||||
{ | ||||||
"key" : "2.0", | ||||||
"doc_count" : 100, | ||||||
"latency_ranges" : { | ||||||
"buckets" : [ | ||||||
{ | ||||||
"key" : "*-0.0", | ||||||
"to" : 0.0, | ||||||
"doc_count" : 0 | ||||||
}, | ||||||
{ | ||||||
"key" : "0.0-105.0", | ||||||
"from" : 0.0, | ||||||
"to" : 105.0, | ||||||
"doc_count" : 19 | ||||||
}, | ||||||
{ | ||||||
"key" : "105.0-225.0", | ||||||
"from" : 105.0, | ||||||
"to" : 225.0, | ||||||
"doc_count" : 11 | ||||||
}, | ||||||
{ | ||||||
"key" : "225.0-445.0", | ||||||
"from" : 225.0, | ||||||
"to" : 445.0, | ||||||
"doc_count" : 20 | ||||||
}, | ||||||
{ | ||||||
"key" : "445.0-665.0", | ||||||
"from" : 445.0, | ||||||
"to" : 665.0, | ||||||
"doc_count" : 20 | ||||||
}, | ||||||
{ | ||||||
"key" : "665.0-885.0", | ||||||
"from" : 665.0, | ||||||
"to" : 885.0, | ||||||
"doc_count" : 20 | ||||||
}, | ||||||
{ | ||||||
"key" : "885.0-1115.0", | ||||||
"from" : 885.0, | ||||||
"to" : 1115.0, | ||||||
"doc_count" : 10 | ||||||
}, | ||||||
{ | ||||||
"key" : "1115.0-1335.0", | ||||||
"from" : 1115.0, | ||||||
"to" : 1335.0, | ||||||
"doc_count" : 0 | ||||||
}, | ||||||
{ | ||||||
"key" : "1335.0-1555.0", | ||||||
"from" : 1335.0, | ||||||
"to" : 1555.0, | ||||||
"doc_count" : 0 | ||||||
}, | ||||||
{ | ||||||
"key" : "1555.0-1775.0", | ||||||
"from" : 1555.0, | ||||||
"to" : 1775.0, | ||||||
"doc_count" : 0 | ||||||
}, | ||||||
{ | ||||||
"key" : "1775.0-*", | ||||||
"from" : 1775.0, | ||||||
"doc_count" : 0 | ||||||
} | ||||||
] | ||||||
}, | ||||||
"ks_test" : { | ||||||
"less" : 0.9642895789647244, | ||||||
"greater" : 4.58718174664754E-9, | ||||||
"two_sided" : 4.58718174664754E-9 | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would expect There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, the |
||||||
} | ||||||
} | ||||||
] | ||||||
} | ||||||
} | ||||||
} | ||||||
---- |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -97,29 +97,33 @@ public GapPolicy gapPolicy() { | |
@Override | ||
protected abstract PipelineAggregator createInternal(Map<String, Object> metadata); | ||
|
||
@Override | ||
protected void validate(ValidationContext context) { | ||
if (bucketsPaths.length != 1) { | ||
context.addBucketPathValidationError("must contain a single entry for aggregation [" + name + "]"); | ||
return; | ||
} | ||
protected void validateBucketPath(ValidationContext context, String bucketsPath) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could this method be |
||
// Need to find the first agg name in the buckets path to check its a | ||
// multi bucket agg: aggs are split with '>' and can optionally have a | ||
// metric name after them by using '.' so need to split on both to get | ||
// just the agg name | ||
final String firstAgg = bucketsPaths[0].split("[>\\.]")[0]; | ||
final String firstAgg = bucketsPath.split("[>\\.]")[0]; | ||
Optional<AggregationBuilder> aggBuilder = context.getSiblingAggregations().stream() | ||
.filter(builder -> builder.getName().equals(firstAgg)) | ||
.findAny(); | ||
.filter(builder -> builder.getName().equals(firstAgg)) | ||
.findAny(); | ||
if (aggBuilder.isEmpty()) { | ||
context.addBucketPathValidationError("aggregation does not exist for aggregation [" + name + "]: " + bucketsPaths[0]); | ||
context.addBucketPathValidationError("aggregation does not exist for aggregation [" + name + "]: " + bucketsPath); | ||
return; | ||
} | ||
if (aggBuilder.get().bucketCardinality() != AggregationBuilder.BucketCardinality.MANY) { | ||
context.addValidationError("The first aggregation in " + PipelineAggregator.Parser.BUCKETS_PATH.getPreferredName() | ||
+ " must be a multi-bucket aggregation for aggregation [" + name + "] found :" | ||
+ aggBuilder.get().getClass().getName() + " for buckets path: " + bucketsPaths[0]); | ||
+ " must be a multi-bucket aggregation for aggregation [" + name + "] found :" | ||
+ aggBuilder.get().getClass().getName() + " for buckets path: " + bucketsPath); | ||
} | ||
} | ||
|
||
@Override | ||
protected void validate(ValidationContext context) { | ||
if (bucketsPaths.length != 1) { | ||
context.addBucketPathValidationError("must contain a single entry for aggregation [" + name + "]"); | ||
return; | ||
} | ||
validateBucketPath(context, bucketsPaths[0]); | ||
} | ||
|
||
@Override | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.