[Security Solution][Alerts] Explore using multi_terms or composite aggs for threshold rules #125703

marshallmain · 2022-02-15T17:54:56Z

The current threshold rule implementation uses multiple levels of nested terms aggregations, one level for each field, to allow threshold rules to bucket by multiple fields. While this works most of the time, in some cases the results can depend on the order in which the nested buckets are defined. If the cardinality of one level of terms aggregation is too high then some buckets will be excluded from the results and sub-buckets will be limited to the included results. In this case, swapping the order of the aggregations could return different results.

In 7.12, Elasticsearch gained the multi_terms aggregation capability, which allows aggregating by multiple fields in a single aggregation. This could replace N nested levels of terms aggs with a single multi_terms aggregation that can be sorted by the final bucket sizes directly and removing the dependency on the order the field names are defined in. The multi_terms aggregation docs do come with a warning about performance, so we should test the performance of any new implementation. It may end up being faster than the nested implementation anyway.

Alternatively, we could investigate composite aggregations to replace the N nested levels of aggs with a single aggregation. Composite aggregations don't allow sorting by bucket size, but are supposed to be faster than multi_terms and there appear to be workarounds with bucket selectors that at least allow filtering by bucket size.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-02-23T20:04:35Z

Pinging @elastic/security-detections-response (Team:Detections and Resp)

madirey · 2022-04-12T16:02:08Z

multi_terms is consistently significantly slower than our nested terms aggs, by about 10x. Testing of composite is still in progress, but looks promising. However, the inability to sort by cardinality means that we may have to issue multiple queries (composite gives us a nice search_after-based paging option) to ensure that the highest cardinality items are handled. This could negate the performance gains on large sets, but also opens up the possibility of alerting on more than 10k items.

madirey · 2022-05-06T13:17:21Z

Below I'll summarize some problems with the current threshold rule implementation, and how we can address them:

Problem: As outlined above, results depend on the order of terms provided, as each shard is limited in the number of buckets it can return.
Solution: composite allows each shard to return a fully resolved set of results, taking each term into account at each ES node. This eliminates the immediate problem and we should be able to get correct and complete results every time.

Caveats:

Determining which alerts are generated first, before hitting max_signals is difficult. Results are returned by natural ordering of the values per key, NOT by the bucket count or by their timestamp. We could attempt to order by timestamp by using a date_histogram as one of the composite sources, but this gets really tricky. We may need to count results from multiple date buckets for each combination of terms to determine which alerts should be generated first. This means paging over ALL of the results. Therefore it's reasonable to assume that we won't feasibly have any control over which alerts are generated before max_signals is hit.
If a cardinality condition is specified, this check must be performed by a bucket_selector aggregation. This agg is a pipeline aggregation, which will be performed AFTER the composite agg has been resolved. This should work as a filter, but we'll be unable to sort by the cardinality. This won't have the same problems as we saw in [Security Solution][Detections][Threshold Rules] Filtering by Cardinality may miss alerts when bucket count is high #95258, as composite will return fully resolved buckets across all terms, BUT again we won't be able to ensure that higher cardinality alerts are generated before max_signals is hit. This could be solved by post-processing in Kibana, rather than letting ES perform the operation using a bucket_selector, but this is likely to result in a significant performance hit.
Another caveat with using the bucket_selector agg is that we may not receive a full result set for each "page." Since this agg is performed AFTER the composite page is computed, bucket_selector will throw out some of the results, leaving us with fewer items than we requested. This isn't really a problem, AFAICT... just something to be aware of. I suppose it could result in excessive paging if a high cardinality threshold is requested over low cardinality data.
The composite aggregation doesn't provide a min_doc_count option, which is something we rely on in the existing terms-based implementation to filter out buckets that don't meet the threshold. Thus we'll need to use either a bucket_selector pipeline aggregation to throw out those buckets, or perform this operation in Kibana.

marshallmain added Feature:Threshold Rule Security Solution Threshold rule type Team:Detection Alerts Security Detection Alerts Area Team 8.2 candidate considered, but not committed, for 8.2 release labels Feb 15, 2022

MindyRS added the Team:Detections and Resp Security Detection Response Team label Feb 23, 2022

marshallmain added 8.3 candidate and removed 8.2 candidate considered, but not committed, for 8.2 release labels Feb 24, 2022

madirey mentioned this issue Feb 24, 2022

[Security Solution] Threshold rule performance fixes #113587

Closed

5 tasks

marshallmain added the technical debt Improvement of the software architecture and operational architecture label Mar 30, 2022

marshallmain assigned madirey Mar 31, 2022

madirey mentioned this issue May 20, 2022

[Security Solution] Use Composite Agg for Threshold Rules #131985

Merged

3 tasks

marshallmain added the 8.4 candidate label Jun 14, 2022

marshallmain removed the 8.3 candidate label Jun 23, 2022

madirey closed this as completed Jul 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Security Solution][Alerts] Explore using multi_terms or composite aggs for threshold rules #125703

[Security Solution][Alerts] Explore using multi_terms or composite aggs for threshold rules #125703

marshallmain commented Feb 15, 2022

elasticmachine commented Feb 23, 2022

madirey commented Apr 12, 2022

madirey commented May 6, 2022 •

edited

Loading

[Security Solution][Alerts] Explore using multi_terms or composite aggs for threshold rules #125703

[Security Solution][Alerts] Explore using multi_terms or composite aggs for threshold rules #125703

Comments

marshallmain commented Feb 15, 2022

elasticmachine commented Feb 23, 2022

madirey commented Apr 12, 2022

madirey commented May 6, 2022 • edited Loading

madirey commented May 6, 2022 •

edited

Loading