[ML] adding new KS test pipeline aggregation #73334

benwtrent · 2021-05-24T16:08:06Z

This adds a new pipeline aggregation for calculating Kolmogorov–Smirnov test for a given sample and buckets path.

For now, the buckets path resolution needs to be _count. But, this may be relaxed in the future.

It accepts a parameter fractions that indicates the distribution of documents from some other pre-calculated sample.

This particular version of the K-S test is Two-sample, meaning, it calculates if the fractions and the distribution of _count values in the buckets_path are taken from the same distribution.

This in combination with the hypothesis alternatives (less, greater, two_sided) and sampling logic (upper_tail, lower_tail, uniform) allow for flexibility and usefulness when comparing two samples and determining the likelihood of them being from the same overall distribution.

Usage:

POST correlate_latency/_search?size=0&filter_path=aggregations
{
  "aggs": {
    "buckets": {
      "terms": { <1>
        "field": "version",
        "size": 2
      },
      "aggs": {
        "latency_ranges": {
          "range": { <2>
            "field": "latency",
            "ranges": [
              { "to": 0.0 },
              { "from": 0, "to": 105 },
              { "from": 105, "to": 225 },
              { "from": 225, "to": 445 },
              { "from": 445, "to": 665 },
              { "from": 665, "to": 885 },
              { "from": 885, "to": 1115 },
              { "from": 1115, "to": 1335 },
              { "from": 1335, "to": 1555 },
              { "from": 1555, "to": 1775 },
              { "from": 1775 }
            ]
          }
        },
        "ks_test": { <3>
          "bucket_count_ks_test": {
            "buckets_path": "latency_ranges>_count",
            "alternative": ["less", "greater", "two_sided"]
          }
        }
      }
    }
  }
}

NOTE: one might notice that the two_sided alternative is always equal to either less or greater. This is because two_sided is simply the max(less, greater).

elasticmachine · 2021-05-24T16:08:09Z

Pinging @elastic/ml-core (Team:ML)

benwtrent · 2021-05-24T17:23:15Z

@elasticmachine update branch

…search into feature/ml-ks-test-agg

…t-agg

przemekwitek

I don't feel comfortable reviewing the math-heavy part in BucketCountKSTestAggregator.java. In case you don't have another reviewer, it would be good to go through this code together so I can understand it in more detail.

przemekwitek · 2021-05-26T06:08:41Z

docs/reference/aggregations/pipeline/bucket-count-ks-test-aggregation.asciidoc

+experimental::[]
+
+A sibling pipeline aggregation which executes a two sample Kolmogorov–Smirnov test
+(referred to as a "K-S test" from now own) against a provided distribution and


Suggested change

(referred to as a "K-S test" from now own) against a provided distribution and

(referred to as a "K-S test" from now on) against a provided distribution and

przemekwitek · 2021-05-26T06:15:34Z

docs/reference/aggregations/pipeline/bucket-count-ks-test-aggregation.asciidoc

+          "range": { <2>
+            "field": "latency",
+            "ranges": [
+              { "to": 0.0 },


Could 0.0 be replaced with 0 so that it is an integer, like numbers in other ranges?

przemekwitek · 2021-05-26T06:18:14Z

docs/reference/aggregations/pipeline/bucket-count-ks-test-aggregation.asciidoc

+<1> The term buckets containing a range aggregation and the bucket correlation aggregation. Both are utilized to calculate
+    the correlation of the term values with the latency.
+<2> The range aggregation on the latency field. The ranges were created referencing the percentiles of the latency field.
+<3> The bucket count K-S test aggregation that determines if the count samples are from the same distribution as the uniform


Could this sentence be simplified to:
The bucket count K-S test aggregation that determines if the count samples come from the uniform distribution
?

technically, no. I will think of a better way of phrasing it.

The two-sampled K-S test (which we are doing here) is verifying that two samples come from the same distribution. In this particular case, we are testing if one sample (a uniform fraction sample) and another (these _counts) both come from the same distribution.

przemekwitek · 2021-05-26T06:21:05Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/DoubleArray.java

+
+package org.elasticsearch.xpack.ml.aggs;
+
+public final class DoubleArray {


Could you add a unit test for this class?

przemekwitek · 2021-05-26T06:22:52Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/DoubleArray.java

+
+    private DoubleArray() { }
+
+    public static double[] cumulativeSum(double[] xs) {


Could you add a method comment?
Especially a sentence that states that xs is immutable.

przemekwitek · 2021-05-26T06:54:44Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/MlBucketsHelper.java

+                        sublistedPath,
+                        BucketHelpers.GapPolicy.INSERT_ZEROS
+                    );
+                    if (bucketValue != null && Double.isNaN(bucketValue) == false) {


What happens if this if statement evaluates to false and, in consequence, values.size() will be less than the number of buckets?

If I supplied a list of fractions to the agg and that length of that list no longer matches the number of buckets how would that be resolved?

There should always be a bucket with a _count value, if not is this exceptional and the code should throw?

100%, I am throwing an execution exception here now. If we want to support something more nuanced later, we can change it.

przemekwitek · 2021-05-26T06:58:23Z

...lugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/kstest/InternalKSTestAggregation.java

+        throw invalidPathException(path);
+    }
+
+    private InvalidAggregationPathException invalidPathException(List<String> path) {


This method is only used once. Would it make sense to inline it?

refactoring slightly as the inference agg result also uses this method (or one similar enough)

przemekwitek · 2021-05-26T07:00:01Z

...l/src/test/java/org/elasticsearch/xpack/ml/aggs/kstest/BucketCountKSTestAggregatorTests.java

+        new double[] { 4, 6, 2, 3, 3, 2, 1, 1, 1, 1 }
+    );
+
+    final MlBucketsHelper.DoubleBucketValues UPPER_TAILED_VALUES = new MlBucketsHelper.DoubleBucketValues(


Suggested change

final MlBucketsHelper.DoubleBucketValues UPPER_TAILED_VALUES = new MlBucketsHelper.DoubleBucketValues(

private static final MlBucketsHelper.DoubleBucketValues UPPER_TAILED_VALUES = new MlBucketsHelper.DoubleBucketValues(

przemekwitek · 2021-05-26T07:00:10Z

...l/src/test/java/org/elasticsearch/xpack/ml/aggs/kstest/BucketCountKSTestAggregatorTests.java

+        new long[] { 10, 10, 10, 40, 40, 40, 40, 40, 40, 40 },
+        new double[] { 10, 10, 10, 40, 40, 40, 40, 40, 40, 40 }
+    );
+    final MlBucketsHelper.DoubleBucketValues UPPER_TAILED_VALUES_SPARSE = new MlBucketsHelper.DoubleBucketValues(


Suggested change

final MlBucketsHelper.DoubleBucketValues UPPER_TAILED_VALUES_SPARSE = new MlBucketsHelper.DoubleBucketValues(

private static final MlBucketsHelper.DoubleBucketValues UPPER_TAILED_VALUES_SPARSE = new MlBucketsHelper.DoubleBucketValues(

przemekwitek · 2021-05-26T07:02:27Z

...l/src/test/java/org/elasticsearch/xpack/ml/aggs/kstest/BucketCountKSTestAggregatorTests.java

+            allOf(hasKey(Alternative.GREATER.toString()), hasKey(Alternative.LESS.toString()), hasKey(Alternative.TWO_SIDED.toString()))
+        );
+        // Since these two distributions are the "same" (both uniform)
+        // Assume that the p-value is greater than 0.9


Suggested change

// Assume that the p-value is greater than 0.9

// assume that the p-value is greater than 0.9

davidkyle

I'm not familiar with the KS test and haven't dived into the maths but everything I understood LGTM 👍

docs/reference/aggregations/pipeline/bucket-count-ks-test-aggregation.asciidoc

davidkyle · 2021-05-26T08:07:39Z

docs/reference/aggregations/pipeline/bucket-count-ks-test-aggregation.asciidoc

+
+<1> The term buckets containing a range aggregation and the bucket correlation aggregation. Both are utilized to calculate
+    the correlation of the term values with the latency.
+<2> The range aggregation on the latency field. The ranges were created referencing the percentiles of the latency field.


It would be nice to have a full example somewhere in docs of showing how the ranges were found

Possibly, I think though that is a larger "how correlations + K-S tests work" type of docs.

Actually this isn't so bad as you explain that the latency ranges were calculated using a percentiles agg

davidkyle · 2021-05-26T08:22:55Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/MlBucketsHelper.java

+                        sublistedPath,
+                        BucketHelpers.GapPolicy.INSERT_ZEROS
+                    );
+                    if (bucketValue != null && Double.isNaN(bucketValue) == false) {


If I supplied a list of fractions to the agg and that length of that list no longer matches the number of buckets how would that be resolved?

There should always be a bucket with a _count value, if not is this exceptional and the code should throw?

davidkyle · 2021-05-26T08:53:04Z

...lugin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/kstest/InternalKSTestAggregation.java

+
+    private InvalidAggregationPathException invalidPathException(List<String> path) {
+        return new InvalidAggregationPathException(
+            "unknown property " + path + " for " + InferencePipelineAggregationBuilder.NAME + " aggregation [" + getName() + "]"


Suggested change

"unknown property " + path + " for " + InferencePipelineAggregationBuilder.NAME + " aggregation [" + getName() + "]"

"unknown property " + path + " for " + BucketCountKSTestAggregationBuilder.NAME + " aggregation [" + getName() + "]"

davidkyle

LGTM

davidkyle · 2021-05-27T11:04:57Z

docs/reference/aggregations/pipeline/bucket-count-ks-test-aggregation.asciidoc

+`sampling_method`::
+(Optional, string)
+Indicates the sampling methodology when calculating the K-S test. Note, this is sampling
+of the returned values. This determines the cumulative distribution function (cdf) points


Suggested change

of the returned values. This determines the cumulative distribution function (cdf) points

of the returned values. This determines the cumulative distribution function (CDF) points

Capitalised CDF is used in the next sentence

davidkyle · 2021-05-27T11:07:15Z

docs/reference/aggregations/pipeline/bucket-count-ks-test-aggregation.asciidoc

+
+<1> The term buckets containing a range aggregation and the bucket correlation aggregation. Both are utilized to calculate
+    the correlation of the term values with the latency.
+<2> The range aggregation on the latency field. The ranges were created referencing the percentiles of the latency field.


Actually this isn't so bad as you explain that the latency ranges were calculated using a percentiles agg

davidkyle · 2021-05-27T11:22:55Z

...rc/main/java/org/elasticsearch/xpack/ml/aggs/kstest/BucketCountKSTestAggregationBuilder.java

+        if (alternative == null) {
+            this.alternative = EnumSet.allOf(Alternative.class);
+        } else {
+            if (alternative.isEmpty()) {


The exact same check happens in validate() but it throws a validation error rather than IllegalArgumentException.

Should the gapPolicy check below also be in validate() (not the null part the bit about policy != INSERT_ZERORS). A quick survey of other agg builder classes does not show a strong convention either way.

I am gonna remove the check from validate(). All the fields are final and I already validate in the ctor.

benwtrent · 2021-05-27T12:28:45Z

@elasticmachine update branch

przemekwitek

LGTM

tveasey

I had some minor suggestions on documentation. Other than that, I think we need a different distribution for the sided test statistics. Let me get back to you with a suggestion for that.

tveasey · 2021-06-01T10:59:40Z

docs/reference/aggregations/pipeline/bucket-count-ks-test-aggregation.asciidoc

+A list of string values indicating which K-S test alternative to calculate.
+The valid values are: "greater", "less", "two_sided". This parameter is key for
+determining the K-S statistic used when calculating the K-S test. Default value is
+all possible alternative hypothesis.


typo

Suggested change

all possible alternative hypothesis.

all possible alternative hypotheses.

tveasey · 2021-06-01T11:06:34Z

docs/reference/aggregations/pipeline/bucket-count-ks-test-aggregation.asciidoc

+(referred to as a "K-S test" from now on) against a provided distribution and
+the distribution of documents counts in the configured sibling aggregation.


I think it is worth expanding this somewhat. For example, I think it is worth making clear that this is computing the K-S p-value between Y and E[Y | c] where c is the "event" selected by the sibling aggregation. The natural use case is when the sibling aggregation is a terms agg in which case it computes the K-S between Y and the restriction of Y to each term. This also means that one can't use any old values for fractions they have to match the actual proportion of the docs in each range bucket for Y.

I would propose something like the following:

A sibling pipeline aggregation which executes a two sample Kolmogorov–Smirnov test (referred to as a "K-S test" from now on) against a provided distribution and the distribution implied by the documents counts in the configured sibling aggregation. Specifically, for some metric, assuming that the percentile intervals of the metric are known beforehand or have been computed by an aggregation, then one would use range aggregation for the sibling to compute the p-value of the distribution difference between the metric and the restriction of that metric to a subset of the documents. A natural use case is if the sibling aggregation range aggregation nested in a terms aggregation, in which case one compares the overall distribution of metric to its restriction to each term.

(I know you have some discussion of this later on, but I think this is crucial to understanding how this is meant to work for someone who knows what the K-S test actually does so worth including upfront.)

tveasey · 2021-06-01T11:53:54Z

docs/reference/aggregations/pipeline/bucket-count-ks-test-aggregation.asciidoc

+A list of doubles indicating the distribution of the samples with which to compare to the
+`buckets_path` results. The default is a uniform distribution of the same length as the
+`buckets_path` buckets.


We have rather specific requirements on what fractions mean. To produce a meaningful result from this aggregation they should be related to some metric distribution which is then used to create the sibling aggregation. A natural choice is to use equal percentile range queries to construct the sibling aggregation in which case the default is correct. I think it is worth capturing something along these lines.

I'd propose something like:

A list of doubles indicating the distribution of the samples with which to compare to the `buckets_path` results. In typical usage this is the overall proportion of documents in each bucket, which is compared with the actual document proportions in each bucket from the sibling aggregation counts. The default is to assume that overall documents are uniformly distributed on these buckets, which they would be if one used equal percentiles of a metric to define the bucket end points.

tveasey · 2021-06-01T12:11:54Z

docs/reference/aggregations/pipeline/bucket-count-ks-test-aggregation.asciidoc

+which was done utilizing the
+<<search-aggregations-metrics-percentile-aggregation,percentiles>> aggregation.
+
+This example is only using the 10s percentiles.


Suggested change

This example is only using the 10s percentiles.

This example is only using the deciles of `latency`.

tveasey · 2021-06-01T12:14:04Z

docs/reference/aggregations/pipeline/bucket-count-ks-test-aggregation.asciidoc

+            "greater" : 4.58718174664754E-9,
+            "two_sided" : 4.58718174664754E-9


I would expect greater to be very nearly two_sided / 2 in this case.

Yeah, the two_sided isn't accurate. I think Hodge's approximation only works in the sided case. We will probably have to do something else.

tveasey · 2021-06-01T12:21:21Z

...gin/ml/src/main/java/org/elasticsearch/xpack/ml/aggs/kstest/BucketCountKSTestAggregator.java

+        // And c(α)=√(−ln(α * 0.5) * 0.5)
+        // Where Dₙ,ₘ is our statistic
+        // Then solve for α
+        // But I am not 100% sure which to choose in this case.


This distribution is for the usual K-S test statistic not the sided versions. For small p-values the sided versions are very nearly half the two-sided one. I'll dig around for a similar asymptotic approximation for these alternatives.

Ah, yeah, hodges approximation is for the sided-statistic. So, we will have to have something else for two_sided.

…t-agg

…search into feature/ml-ks-test-agg

droberts195 · 2021-06-01T15:25:59Z

x-pack/plugin/ml/licenses/commons-math3-LICENSE.txt

+The dictionary comes from Morfologik project. Morfologik uses data from 
+Polish ispell/myspell dictionary hosted at http://www.sjp.pl/slownik/en/ and 
+is licenced on the terms of (inter alia) LGPL and Creative Commons 
+ShareAlike. The part-of-speech tags were added in Morfologik project and
+are not found in the data from sjp.pl. The tagset is similar to IPI PAN
+tagset.


I don't think this file contains the correct license for commons-math3. If I download https://mirrors.gethosted.online/apache/commons/math/binaries/commons-math3-3.6.1-bin.tar.gz then the license file in it is different to this one.

Ah, let me look into this.

tveasey

LGTM. Good stuff!

benwtrent · 2021-06-04T11:42:27Z

@elasticmachine update branch

benwtrent · 2021-06-04T12:31:07Z

run elasticsearch-ci/part-2

benwtrent · 2021-06-04T13:21:24Z

@elasticmachine update branch

This adds a new pipeline aggregation for calculating Kolmogorov–Smirnov test for a given sample and buckets path. For now, the buckets path resolution needs to be `_count`. But, this may be relaxed in the future. It accepts a parameter `fractions` that indicates the distribution of documents from some other pre-calculated sample. This particular version of the K-S test is Two-sample, meaning, it calculates if the `fractions` and the distribution of `_count` values in the buckets_path are taken from the same distribution. This in combination with the hypothesis alternatives (`less`, `greater`, `two_sided`) and sampling logic (`upper_tail`, `lower_tail`, `uniform`) allow for flexibility and usefulness when comparing two samples and determining the likelihood of them being from the same overall distribution. Usage: ``` POST correlate_latency/_search?size=0&filter_path=aggregations { "aggs": { "buckets": { "terms": { <1> "field": "version", "size": 2 }, "aggs": { "latency_ranges": { "range": { <2> "field": "latency", "ranges": [ { "to": 0.0 }, { "from": 0, "to": 105 }, { "from": 105, "to": 225 }, { "from": 225, "to": 445 }, { "from": 445, "to": 665 }, { "from": 665, "to": 885 }, { "from": 885, "to": 1115 }, { "from": 1115, "to": 1335 }, { "from": 1335, "to": 1555 }, { "from": 1555, "to": 1775 }, { "from": 1775 } ] } }, "ks_test": { <3> "bucket_count_ks_test": { "buckets_path": "latency_ranges>_count", "alternative": ["less", "greater", "two_sided"] } } } } } } ``` Co-authored-by: Elastic Machine <[email protected]>

[ML] adding new KS test pipeline aggregation

c79fcef

benwtrent added >enhancement :ml Machine learning v8.0.0 v7.14.0 labels May 24, 2021

elasticmachine added the Team:ML Meta label for the ML team label May 24, 2021

fixing docs

483bd2a

elasticmachine and others added 4 commits May 24, 2021 13:23

Merge branch 'master' into feature/ml-ks-test-agg

f50f354

preventing sparse data case and improving testing

b9ab642

Merge branch 'feature/ml-ks-test-agg' of github.com:benwtrent/elastic…

eba5b9a

…search into feature/ml-ks-test-agg

Merge remote-tracking branch 'upstream/master' into feature/ml-ks-tes…

133a19e

…t-agg

przemekwitek self-requested a review May 26, 2021 05:53

przemekwitek reviewed May 26, 2021

View reviewed changes

davidkyle reviewed May 26, 2021

View reviewed changes

addressing pr comments

cf456cd

benwtrent requested review from przemekwitek and davidkyle May 26, 2021 14:42

davidkyle approved these changes May 27, 2021

View reviewed changes

addressing PR comments

b3038d4

Merge branch 'master' into feature/ml-ks-test-agg

471d691

przemekwitek approved these changes May 31, 2021

View reviewed changes

tveasey reviewed Jun 1, 2021

View reviewed changes

benwtrent added 3 commits June 1, 2021 10:06

Merge remote-tracking branch 'upstream/master' into feature/ml-ks-tes…

609e7ec

…t-agg

Fixing two-tailed calculation and adding apache math3

70f32f0

Merge branch 'feature/ml-ks-test-agg' of github.com:benwtrent/elastic…

87d3d87

…search into feature/ml-ks-test-agg

droberts195 reviewed Jun 1, 2021

View reviewed changes

fixing license

f51f332

fixing date

09eb655

tveasey approved these changes Jun 1, 2021

View reviewed changes

Merge branch 'master' into feature/ml-ks-test-agg

66d89cf

Merge branch 'master' into feature/ml-ks-test-agg

5bb751b

benwtrent merged commit 30cf4dc into elastic:master Jun 4, 2021

benwtrent deleted the feature/ml-ks-test-agg branch June 4, 2021 14:04

benwtrent mentioned this pull request Jun 4, 2021

[7.x] [ML] adding new KS test pipeline aggregation (#73334) #73782

Merged

qn895 mentioned this pull request Jun 21, 2021

[ML] APM Latency Correlations elastic/kibana#99905

Merged

11 tasks

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

	(referred to as a "K-S test" from now own) against a provided distribution and
	(referred to as a "K-S test" from now on) against a provided distribution and


		package org.elasticsearch.xpack.ml.aggs;

		public final class DoubleArray {


		private DoubleArray() { }

		public static double[] cumulativeSum(double[] xs) {

	final MlBucketsHelper.DoubleBucketValues UPPER_TAILED_VALUES = new MlBucketsHelper.DoubleBucketValues(
	private static final MlBucketsHelper.DoubleBucketValues UPPER_TAILED_VALUES = new MlBucketsHelper.DoubleBucketValues(

	// Assume that the p-value is greater than 0.9
	// assume that the p-value is greater than 0.9

	"unknown property " + path + " for " + InferencePipelineAggregationBuilder.NAME + " aggregation [" + getName() + "]"
	"unknown property " + path + " for " + BucketCountKSTestAggregationBuilder.NAME + " aggregation [" + getName() + "]"

	of the returned values. This determines the cumulative distribution function (cdf) points
	of the returned values. This determines the cumulative distribution function (CDF) points

	all possible alternative hypothesis.
	all possible alternative hypotheses.

		(referred to as a "K-S test" from now on) against a provided distribution and
		the distribution of documents counts in the configured sibling aggregation.

	This example is only using the 10s percentiles.
	This example is only using the deciles of `latency`.

		"greater" : 4.58718174664754E-9,
		"two_sided" : 4.58718174664754E-9

[ML] adding new KS test pipeline aggregation #73334

[ML] adding new KS test pipeline aggregation #73334

Conversation

benwtrent commented May 24, 2021

elasticmachine commented May 24, 2021

benwtrent commented May 24, 2021

przemekwitek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidkyle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidkyle left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benwtrent commented May 27, 2021

przemekwitek left a comment

Choose a reason for hiding this comment

tveasey left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tveasey left a comment

Choose a reason for hiding this comment

benwtrent commented Jun 4, 2021

benwtrent commented Jun 4, 2021

benwtrent commented Jun 4, 2021