[SIEM] Agg refactoring suggestions #69172

polyfractal · 2020-06-15T16:24:03Z

Hiya SIEM team. Over at Elasticsearch we've been looking into a few performance related items, and some of the aggs that SIEM dashboard uses caught our eye.

Benchmarks?

Do we benchmark any of the dashboards? The Elasticsearch team uses Rally extensively, perhaps we could find a way to translate the dashboard requests into some kind of rally track? It'd help both of us keep an eye on performance, make changes easier to think about, and easier to collaborate on since we'd have shared dataset to look at

Usage of `filter` aggs

There seems to be widespread use of filter aggs, which is non-ideal. Filter aggs are relatively expensive, especially when compared to filtering in the query component of a search request. Each individual filter agg needs to load the bitset of docs that contain that value, and check it against the doc one-by-one (as opposed to query filters which can use a leap-frog mechanism to minimize checks).

So the first thing would be trying to move filter aggs up into the query where possible, if they are being used to exclude documents.

If they are being used for counts (like here), there are some options:

Try to rewrite some of those to operate as terms aggs. E.g. if multiple filters share the same field (event.module or something), a terms agg will give you doc counts for all the different event modules. Terms is pretty aggressively optimized because it is so widely used. It's hard to say for sure if it would help, but from some informal testing (see rally test at end) it tends to be noticeably faster.
For fields that are non-overlapping and sparse, a value_count agg can be useful. E.g. if only a subset of docs have a certain field and you want to know how many there are, a value_count on that field will return the count without having to bucket them. A relatively niche usage here, but handy if applicable
Rewrite into an msearch and skip aggregating all together. Each msearch clause will be a single search request filtering for specifically the criteria needed. With size: 0 you don't incur a fetch-overhead, and with track_total_hits: true you can still get the total count.

3b. If you don't need exact counts, setting track_total_hits: false will enable the new block max-wand optimization and return results very fast. You can configure a threshold when it stops counting, so you can say "> 100,000 results", etc

I ran a simple test showing msearch ("count"), filter, filters, term and value_count. As you can see, msearch is fastest by a large margin, followed by term and value_count. Filter/filters are generally slower

`terms` instead of `filter` for partitioning

Related to 1) above, if there is a scenario where you wish to partition the same field into multiple buckets, a terms agg will be faster (and simpler query) than a series of filter aggs. For example, this request uses two filter aggs to create "success" and "failure" buckets.

Instead, a single terms agg on the field will produce both buckets and do it cheaper. In addition, the child filter: event.outcome: success agg is unnecessary because by the nature of the parent bucket, all docs in that bucket are already success/failure. You can just grab the count from the bucket doc_count.

If there are unrelated values in the field and you only want "success"/"failure", you can use the include/exclude functionality of a terms agg to only include terms you care about.

AutoDateHistogram min_interval

There's some optimization work done in ES (coming 7.8/7.9) which will improve auto-date-histo speed noticeably. But in the mean time, specifying a min_interval will help prevent extra work. E.g. auto-date-histo will start with second-level intervals and round up from there. If querying a 12h time range it almost never makes sense to look at second-intervals, so that part of the rounding is wasted effort.

This does remove some of the convenience of "fire and forget" aspect of auto-date-histo, but it can translate into notable performance improvements. I'm not sure the best option here, but if there's a way to intelligently set min_interval it'd probably help.

Closing

Sorry for the long ticket! I decided to file this as a ticket instead of email/slack/google doc/etc because it seemed easier to work through on github. Feel free to ping me if you have questions, happy to help out! It's hard to say for sure if any of these suggestions will actually help (although the msearch case is very compelling due to how it works), which is why I led with the question about benchmarks. Setting those up might be a good first step so we can quantitatively tweak the queries/aggs.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-06-18T07:01:51Z

Pinging @elastic/siem (Team:SIEM)

monfera added the Team:SIEM label Jun 18, 2020

MindyRS added the Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. label Oct 27, 2020

This was referenced Nov 17, 2020

Add usage collection for savedObject tagging #83160

Merged

Reasons for not using saved objects for storing kibana data #80912

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SIEM] Agg refactoring suggestions #69172

[SIEM] Agg refactoring suggestions #69172

polyfractal commented Jun 15, 2020

elasticmachine commented Jun 18, 2020

[SIEM] Agg refactoring suggestions #69172

[SIEM] Agg refactoring suggestions #69172

Comments

polyfractal commented Jun 15, 2020

Benchmarks?

Usage of filter aggs

terms instead of filter for partitioning

AutoDateHistogram min_interval

Closing

elasticmachine commented Jun 18, 2020

Usage of `filter` aggs

`terms` instead of `filter` for partitioning