Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SIEM] Agg refactoring suggestions #69172

Open
polyfractal opened this issue Jun 15, 2020 · 1 comment
Open

[SIEM] Agg refactoring suggestions #69172

polyfractal opened this issue Jun 15, 2020 · 1 comment
Labels
Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:SIEM

Comments

@polyfractal
Copy link
Contributor

Hiya SIEM team. Over at Elasticsearch we've been looking into a few performance related items, and some of the aggs that SIEM dashboard uses caught our eye.

Benchmarks?

Do we benchmark any of the dashboards? The Elasticsearch team uses Rally extensively, perhaps we could find a way to translate the dashboard requests into some kind of rally track? It'd help both of us keep an eye on performance, make changes easier to think about, and easier to collaborate on since we'd have shared dataset to look at

Usage of filter aggs

There seems to be widespread use of filter aggs, which is non-ideal. Filter aggs are relatively expensive, especially when compared to filtering in the query component of a search request. Each individual filter agg needs to load the bitset of docs that contain that value, and check it against the doc one-by-one (as opposed to query filters which can use a leap-frog mechanism to minimize checks).

So the first thing would be trying to move filter aggs up into the query where possible, if they are being used to exclude documents.

If they are being used for counts (like here), there are some options:

  1. Try to rewrite some of those to operate as terms aggs. E.g. if multiple filters share the same field (event.module or something), a terms agg will give you doc counts for all the different event modules. Terms is pretty aggressively optimized because it is so widely used. It's hard to say for sure if it would help, but from some informal testing (see rally test at end) it tends to be noticeably faster.

  2. For fields that are non-overlapping and sparse, a value_count agg can be useful. E.g. if only a subset of docs have a certain field and you want to know how many there are, a value_count on that field will return the count without having to bucket them. A relatively niche usage here, but handy if applicable

  3. Rewrite into an msearch and skip aggregating all together. Each msearch clause will be a single search request filtering for specifically the criteria needed. With size: 0 you don't incur a fetch-overhead, and with track_total_hits: true you can still get the total count.

    3b. If you don't need exact counts, setting track_total_hits: false will enable the new block max-wand optimization and return results very fast. You can configure a threshold when it stops counting, so you can say "> 100,000 results", etc

I ran a simple test showing msearch ("count"), filter, filters, term and value_count. As you can see, msearch is fastest by a large margin, followed by term and value_count. Filter/filters are generally slower

image

terms instead of filter for partitioning

Related to 1) above, if there is a scenario where you wish to partition the same field into multiple buckets, a terms agg will be faster (and simpler query) than a series of filter aggs. For example, this request uses two filter aggs to create "success" and "failure" buckets.

Instead, a single terms agg on the field will produce both buckets and do it cheaper. In addition, the child filter: event.outcome: success agg is unnecessary because by the nature of the parent bucket, all docs in that bucket are already success/failure. You can just grab the count from the bucket doc_count.

If there are unrelated values in the field and you only want "success"/"failure", you can use the include/exclude functionality of a terms agg to only include terms you care about.

AutoDateHistogram min_interval

There's some optimization work done in ES (coming 7.8/7.9) which will improve auto-date-histo speed noticeably. But in the mean time, specifying a min_interval will help prevent extra work. E.g. auto-date-histo will start with second-level intervals and round up from there. If querying a 12h time range it almost never makes sense to look at second-intervals, so that part of the rounding is wasted effort.

This does remove some of the convenience of "fire and forget" aspect of auto-date-histo, but it can translate into notable performance improvements. I'm not sure the best option here, but if there's a way to intelligently set min_interval it'd probably help.

Closing

Sorry for the long ticket! I decided to file this as a ticket instead of email/slack/google doc/etc because it seemed easier to work through on github. Feel free to ping me if you have questions, happy to help out! It's hard to say for sure if any of these suggestions will actually help (although the msearch case is very compelling due to how it works), which is why I led with the question about benchmarks. Setting those up might be a good first step so we can quantitatively tweak the queries/aggs.

@elasticmachine
Copy link
Contributor

Pinging @elastic/siem (Team:SIEM)

@MindyRS MindyRS added the Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. label Oct 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:SIEM
Projects
None yet
Development

No branches or pull requests

4 participants