BucketSelector pipeline aggregation extension #374

rishabhmaurya · 2021-05-07T22:34:31Z

Description of changes:
BucketSelectorExt is an extension to BucketSelector pipeline aggregation. Some of the limitations with BucketSelector -

Only one bucket selector can be applied on a parent multi-bucket aggregation. Since, BucketSelector retains the selected buckets and discards the others, so allowing multiple BucketSelector for an aggregation isn't the ideal behavior.
Key filters are not supported. Script variables in BucketSelector only pertains to the numeric values and keys cannot be used a script variable to write selection expression.
Doesn't works with composite aggregations: None of the pipeline aggregation works with composite aggregation as most of the pipeline aggregations works on entirety of the results, whereas, composite aggregations results could be paginated. Refer: Pipeline metrics aggregations do not recognize composite aggregations as multi-bucket elastic/elasticsearch#32692

With BucketSelectorExt we are trying to address above limitations.

For 1, each BucketSelectorExt will have its own output section displaying the index of the selected buckets from the parent multi-bucket aggregation instead of the actual bucket. Also, the parent aggregation will contain all buckets and BucketSelectorExt will not have any impact of its result.
For 2, there would be a optional filter field. Here one can pass an include/exclude filter which would works on the lines of term aggregation filtering supported by elasticsearch.
For 3, key filters should support passing the filter for each source of the composite aggregation. For this, we have introduced a new key, value object where key is the name of the source and value is the key filter for the corresponding source in composite aggregation. Refer examples below.

Parameters:
parent_bucket_path - this is to navigate to the right parent multi-bucket aggregation on which selector has to be applied. It supports nested aggregations but should comply with below constraint -
agg1>agg2>agg3 - where agg1 and agg2 are all single-bucket aggs. Whereas, agg3 i.e. the last aggregation in the hierarchy should be a multi-bucket aggregation on which bucket selector would be applicable.
buckets_path - this is same as existing BucketSelector buckets_path

script - this is same as existing BucketSelector script

filter - key filter condition. First keys are filtered and then the bucket selector scripts are executed on the filtered keys.
It containsinclude/exclude filter which works on the lines of term aggregation filtering supported by elasticsearch.

composite_agg_filter: key filter condition for composite aggregations. Refer to example below for usage.

Some usage examples -

Simple case -

"aggs": {
   "<parent_agg_name>": {
      "<child_multi_bucket_agg_name>": { "terms": {"field": "<fieldname>"}}},
      "aggs": {
         "<metric_agg_name>": { "stats": { "field": "<fieldname>" } }
      }
   },
   "<bucket_selector_name>": {
      "bucket_selector_ext": {
         "buckets_path": {
           "metric_value": "<metric_agg_name>.<metric_name>"
         },
         "script": {
           "source": "params.metric_value >= 10.0"
         },
         "parent_bucket_path": "<parent_agg_name>"
      }
   }
}

Multiple bucket selectors selectors

"aggs": {
   "<parent_agg_name>": {
      "<child_multi_bucket_agg_name>": { "terms": {"field": "<fieldname>"}}},
      "aggs": {
         "<metric_agg_name>": { "stats": { "field": "<fieldname>" } }
      }
   },
   "<bucket_selector_name_1>": {
      "bucket_selector_ext": {
         "buckets_path": {
           "metric_value": "<metric_agg_name>.<metric_name>"
         },
         "script": {
           "source": "params.metric_value >= 10.0"
         },
         "parent_bucket_path": "<parent_agg_name>"
      }
   },
   "<bucket_selector_name_2>": {
      "bucket_selector_ext": {
         "buckets_path": {
           "metric_value": "<metric_agg_name>.<metric_name>"
         },
         "script": {
           "source": "params.metric_value >= 10.0"
         },
         "parent_bucket_path": "<parent_agg_name>"
      }
   }
}

Key filters -

"aggs": {
   "<parent_agg_name>": {
      "<child_multi_bucket_agg_name>": { "terms": {"field": "<fieldname>"}}},
      "aggs": {
         "<metric_agg_name>": { "stats": { "field": "<fieldname>" } }
      }
   },
   "<bucket_selector_name>": {
      "bucket_selector_ext": {
         "buckets_path": {
           "metric_value": "<metric_agg_name>.<metric_name>"
         },
         "script": {
           "source": "params.metric_value >= 10.0"
         },
         "parent_bucket_path": "<parent_agg_name>",
         "filter": {
           "include": ["key1", "key2"]
         }
      }
   }
}

For regex, refer lucene regular expression

"aggs": {
   "<parent_agg_name>": {
      "<child_multi_bucket_agg_name>": { "terms": {"field": "<fieldname>"}}},
      "aggs": {
         "<metric_agg_name>": { "stats": { "field": "<fieldname>" } }
      }
   },
   "<bucket_selector_name_1>": {
      "bucket_selector_ext": {
         "buckets_path": {
           "metric_value": "<metric_agg_name>.<metric_name>"
         },
         "script": {
           "source": "params.metric_value >= 10.0"
         },
         "parent_bucket_path": "<parent_agg_name>",
         "filter": {
           "include": "key_prefix*"
         }
      }
   }
}

Composite aggregation -

"aggs": {
   "<parent_agg_name>": {
      "composite": {
         "sources": [
            {"<source_1>": { "terms": {"field": "<field_1>"}}},
            {"<source_2>": { "terms": {"field": "<field_2>" }}}  
         ]
      },      
      "aggs": {
         "<metric_agg_name>": { "stats": { "field": "<fieldname>" } }
      }
   },
   "<bucket_selector_name_1>": {
      "bucket_selector_ext": {
         "buckets_path": {
           "metric_value": "<metric_agg_name>.<metric_name>"
         },
         "script": {
           "source": "params.metric_value >= 10.0"
         },
         "parent_bucket_path": "<parent_agg_name>",
         "composite_agg_filter": {
            "<source_1>" : {
               "include": ["<include_key_1>"]
            },
            "<source_2>" : {
               "include": "@"
            }
         }
      }
   }
}

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
have the right to submit it under the open source license
indicated in the file; or

(b) The contribution is based upon previous work that, to the best
of my knowledge, is covered under an appropriate open source
license and I have the right under that license to submit that
work with modifications, whether created in whole or in part
by me, under the same open source license (unless I am
permitted to submit under a different license), as indicated
in the file; or

(c) The contribution was provided directly to me by some other
person who certified (a), (b) or (c) and I have not modified
it.

(d) I understand and agree that this project and the contribution
are public and that a record of the contribution (including all
personal information I submit with it, including my sign-off) is
maintained indefinitely and may be redistributed consistent with
this project or the open source license(s) involved.

…gator on a MultiBucketAggregation

qreshi

I left some comments but the main logic looks good to me. You can merge this in and address the comments in the follow-up PRs if you prefer so you're not blocked on sending those out.

qreshi · 2021-05-10T08:13:50Z

...relasticsearch/alerting/aggregation/bucketselectorext/BucketSelectorExtAggregationBuilder.kt

+        val mapSize: Int = sin.readVInt()
+        bucketsPathsMap = java.util.HashMap(mapSize)
+        for (i in 0 until mapSize) {
+            bucketsPathsMap[sin.readString()] = sin.readString()
+        }


This can alternatively be replaced with bucketsPathsMap = sin.readMap() as Map<String, String>

qreshi · 2021-05-10T08:16:35Z

...relasticsearch/alerting/aggregation/bucketselectorext/BucketSelectorExtAggregationBuilder.kt

+        out.writeVInt(bucketsPathsMap.size)
+        for ((key, value) in bucketsPathsMap) {
+            out.writeString(key)
+            out.writeString(value)
+        }


Similarly, this can be replaced with out.writeMap(bucketsPathsMap as Map<String, String>)

qreshi · 2021-05-10T08:20:22Z

...relasticsearch/alerting/aggregation/bucketselectorext/BucketSelectorExtAggregationBuilder.kt

+
+    @Throws(IOException::class)
+    public override fun internalXContent(builder: XContentBuilder, params: Params): XContentBuilder {
+        builder.field(PipelineAggregator.Parser.BUCKETS_PATH.preferredName, bucketsPathsMap as Map<String, Any>?)


NP: Builder calls can be chained to reduce text.

Ex.

builder.field() .field() .field()

qreshi · 2021-05-10T08:44:58Z

...relasticsearch/alerting/aggregation/bucketselectorext/BucketSelectorExtAggregationBuilder.kt

+        private val PARENT_BUCKET_PATH = ParseField("parent_bucket_path")
+
+        @Throws(IOException::class)
+        fun parse(reducerName: String, parser: XContentParser): BucketSelectorExtAggregationBuilder {


To clean this up a bit and be more consistent with our other parse functions, I think this can be simplified to assume that parse is being called on the start_object of bucket_select_ext. This way, we can fetch the field name and then the next token should be the contents of the field. Then within the single when, we can cover the different formats of the field being parsed.

Ex.

fun parse(reducerName: String, xcp: XContentParser): BucketSelectorExtAggregationBuilder { var bucketsPathsMap: MutableMap<String, String>? = null var gapPolicy: GapPolicy? = null var script: Script? = null var parentBucketPath: String? = null var filter: BucketSelectorExtFilter? = null ensureExpectedToken(Token.START_OBJECT, xcp.currentToken(), xcp) while(xcp.nextToken() != Token.END_OBJECT) { val fieldName = xcp.currentName() xcp.nextToken() when (fieldName) { PipelineAggregator.Parser.BUCKETS_PATH -> { if (xcp.currentToken == Token.START_OBJECT) { ... } else if (xcp.current == Token.START_ARRAY) { while (xcp.nextToken() != Token.END_ARRAY) { ... } } else { ... } } PipelineAggregator.Parser.GAP_POLICY -> { ... } Script.SCRIPT_PARSE_FIELD -> { ... } PARENT_BUCKET_PATH -> { ... } else -> { ... } } } ... }

qreshi · 2021-05-10T08:59:59Z

...distroforelasticsearch/alerting/aggregation/bucketselectorext/BucketSelectorExtAggregator.kt

+    constructor(sin: StreamInput) : super(sin.readString(), null, null) {
+        script = Script(sin)
+        gapPolicy = GapPolicy.readFrom(sin)
+        bucketsPathsMap = sin.readGenericValue() as Map<String, String>


Can make this sin.readMap() to be a little more explicit

qreshi · 2021-05-10T09:46:13Z

...orelasticsearch/alerting/aggregation/bucketselectorext/BucketSelectorExtAggregatorTestsIT.kt

+import java.util.function.Consumer
+import java.util.function.Function
+
+class BucketSelectorExtAggregatorTestsIT : AggregatorTestCase() {


NP: We can probably just call these BucketSelectorExtAggregatorTests (same for BucketSelectExtAggregationBuilderTestsIT) since they're more unit tests. The IT tests in this package typically extend ODFE/ESRestTestCase() and make the actual API calls on the test cluster.

rishabhmaurya added 3 commits May 7, 2021 13:22

BucketSelectorExtAggregator to support multiple bucket selector aggre…

8030372

…gator on a MultiBucketAggregation

Add copyright license header

d8d59bc

remove * imports

8b97255

rishabhmaurya mentioned this pull request May 7, 2021

BucketSelector pipeline aggregation extension opensearch-project/OpenSearch#674

Open

rishabhmaurya assigned rishabhmaurya and unassigned rishabhmaurya May 7, 2021

rishabhmaurya added the doc-level-alerting label May 7, 2021

qreshi approved these changes May 10, 2021

View reviewed changes

Addressed review comments

e2b85a6

rishabhmaurya merged commit 3bc057d into opendistro-for-elasticsearch:doc-level-alerting-dev May 10, 2021

rishabhmaurya mentioned this pull request May 11, 2021

AggregationTrigger implementation #375

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BucketSelector pipeline aggregation extension #374

BucketSelector pipeline aggregation extension #374

rishabhmaurya commented May 7, 2021

qreshi left a comment

qreshi May 10, 2021

qreshi May 10, 2021

qreshi May 10, 2021

qreshi May 10, 2021

qreshi May 10, 2021

qreshi May 10, 2021

BucketSelector pipeline aggregation extension #374

BucketSelector pipeline aggregation extension #374

Conversation

rishabhmaurya commented May 7, 2021

qreshi left a comment

Choose a reason for hiding this comment

qreshi May 10, 2021

Choose a reason for hiding this comment

qreshi May 10, 2021

Choose a reason for hiding this comment

qreshi May 10, 2021

Choose a reason for hiding this comment

qreshi May 10, 2021

Choose a reason for hiding this comment

qreshi May 10, 2021

Choose a reason for hiding this comment

qreshi May 10, 2021

Choose a reason for hiding this comment