Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solution][Alerts] Explore using multi_terms or composite aggs for threshold rules #125703

Closed
marshallmain opened this issue Feb 15, 2022 · 3 comments
Assignees
Labels
8.4 candidate Feature:Threshold Rule Security Solution Threshold rule type Team:Detection Alerts Security Detection Alerts Area Team Team:Detections and Resp Security Detection Response Team technical debt Improvement of the software architecture and operational architecture

Comments

@marshallmain
Copy link
Contributor

The current threshold rule implementation uses multiple levels of nested terms aggregations, one level for each field, to allow threshold rules to bucket by multiple fields. While this works most of the time, in some cases the results can depend on the order in which the nested buckets are defined. If the cardinality of one level of terms aggregation is too high then some buckets will be excluded from the results and sub-buckets will be limited to the included results. In this case, swapping the order of the aggregations could return different results.

In 7.12, Elasticsearch gained the multi_terms aggregation capability, which allows aggregating by multiple fields in a single aggregation. This could replace N nested levels of terms aggs with a single multi_terms aggregation that can be sorted by the final bucket sizes directly and removing the dependency on the order the field names are defined in. The multi_terms aggregation docs do come with a warning about performance, so we should test the performance of any new implementation. It may end up being faster than the nested implementation anyway.

Alternatively, we could investigate composite aggregations to replace the N nested levels of aggs with a single aggregation. Composite aggregations don't allow sorting by bucket size, but are supposed to be faster than multi_terms and there appear to be workarounds with bucket selectors that at least allow filtering by bucket size.

@marshallmain marshallmain added Feature:Threshold Rule Security Solution Threshold rule type Team:Detection Alerts Security Detection Alerts Area Team 8.2 candidate considered, but not committed, for 8.2 release labels Feb 15, 2022
@MindyRS MindyRS added the Team:Detections and Resp Security Detection Response Team label Feb 23, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@marshallmain marshallmain added 8.3 candidate and removed 8.2 candidate considered, but not committed, for 8.2 release labels Feb 24, 2022
@marshallmain marshallmain added the technical debt Improvement of the software architecture and operational architecture label Mar 30, 2022
@madirey
Copy link
Contributor

madirey commented Apr 12, 2022

multi_terms is consistently significantly slower than our nested terms aggs, by about 10x. Testing of composite is still in progress, but looks promising. However, the inability to sort by cardinality means that we may have to issue multiple queries (composite gives us a nice search_after-based paging option) to ensure that the highest cardinality items are handled. This could negate the performance gains on large sets, but also opens up the possibility of alerting on more than 10k items.

@madirey
Copy link
Contributor

madirey commented May 6, 2022

Below I'll summarize some problems with the current threshold rule implementation, and how we can address them:

Problem: As outlined above, results depend on the order of terms provided, as each shard is limited in the number of buckets it can return.
Solution: composite allows each shard to return a fully resolved set of results, taking each term into account at each ES node. This eliminates the immediate problem and we should be able to get correct and complete results every time.

Caveats:

  1. Determining which alerts are generated first, before hitting max_signals is difficult. Results are returned by natural ordering of the values per key, NOT by the bucket count or by their timestamp. We could attempt to order by timestamp by using a date_histogram as one of the composite sources, but this gets really tricky. We may need to count results from multiple date buckets for each combination of terms to determine which alerts should be generated first. This means paging over ALL of the results. Therefore it's reasonable to assume that we won't feasibly have any control over which alerts are generated before max_signals is hit.

  2. If a cardinality condition is specified, this check must be performed by a bucket_selector aggregation. This agg is a pipeline aggregation, which will be performed AFTER the composite agg has been resolved. This should work as a filter, but we'll be unable to sort by the cardinality. This won't have the same problems as we saw in [Security Solution][Detections][Threshold Rules] Filtering by Cardinality may miss alerts when bucket count is high #95258, as composite will return fully resolved buckets across all terms, BUT again we won't be able to ensure that higher cardinality alerts are generated before max_signals is hit. This could be solved by post-processing in Kibana, rather than letting ES perform the operation using a bucket_selector, but this is likely to result in a significant performance hit.

  3. Another caveat with using the bucket_selector agg is that we may not receive a full result set for each "page." Since this agg is performed AFTER the composite page is computed, bucket_selector will throw out some of the results, leaving us with fewer items than we requested. This isn't really a problem, AFAICT... just something to be aware of. I suppose it could result in excessive paging if a high cardinality threshold is requested over low cardinality data.

  4. The composite aggregation doesn't provide a min_doc_count option, which is something we rely on in the existing terms-based implementation to filter out buckets that don't meet the threshold. Thus we'll need to use either a bucket_selector pipeline aggregation to throw out those buckets, or perform this operation in Kibana.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.4 candidate Feature:Threshold Rule Security Solution Threshold rule type Team:Detection Alerts Security Detection Alerts Area Team Team:Detections and Resp Security Detection Response Team technical debt Improvement of the software architecture and operational architecture
Projects
None yet
Development

No branches or pull requests

4 participants