Truncate lastFailureMessage for siem-detection-engine-rule-status documents #109815

rudolf · 2021-08-24T11:25:36Z

siem-detection-engine-rule-status documents stores the lastFailureMessage a string which is indexed as type: "text" but some failure messages are so large that these documents are up to 26MB.

These large documents cause migrations to fail because a batch of 1000 documents easily exceed Elasticsearch's http.max_content_length which defaults to 100mb.

Even though we're fixing the root problem with migrations in #107641 these large documents still consume a lot of resources like storage and particularly heap and in my tests easily cause Elasticsearch to run into circuit_breaking_exceptions.

Can we truncate these failure messages to limit the size of these documents and/or not index the entire contents of this field?

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-08-24T11:25:38Z

Pinging @elastic/security-detections-response (Team:Detections and Resp)

elasticmachine · 2021-08-24T15:36:40Z

Pinging @elastic/security-solution (Team: SecuritySolution)

banderror · 2021-08-25T07:08:53Z

A note on why this can happen:

Every rule execution runs a loop under the hood. During each iteration of this loop Detection Engine queries source indices (containing source events), a small time period per iteration. If it finds any events matching the rule criteria, it creates detection alerts from them and attempts to bulk index them.

When querying and bulk indexing within an iteration, it’s possible that we catch exceptions. All the exceptions are being collected and returned from this loop. Then Detection Engine joins all of them (all their messages) and changes the status of the rule to failed with this Bulk Indexing of signals failed message. So it doesn't necessarily mean all errors there are indexing errors, it can contain search errors as well.

So when the loop contains a lot of iterations, it’s possible to get a very large list of caught exceptions and a long status message. So we definitely need to truncate this list.

banderror · 2021-08-25T07:23:37Z

A note on how to truncate:

Truncate on write:
- get unique error messages from the list
- truncate the list (leave max 10-20 errors, for example)
- join its items to a string
- truncate the resulting string (leave max 1024 characters, for example)
Write a migration for siem-detection-engine-rule-status that would truncate existing status documents:
- truncate the existing message fields (leave max 1024 characters, for example)
Consider adjusting the mappings of siem-detection-engine-rule-status (e.g. using ignore_above, but I guess it's only for keywords?)

banderror · 2021-08-25T07:26:01Z

A note on the impact and which version to target from @rudolf:

This has a high impact on users and causes downtime for them and support escalations. Even if we are fixing some of the problems already this is also causing OOM errors for users with smaller 1GB Kibana instances which is harder to address. At the moment it doesn't seem to be affecting a large number of users, but I suspect this will increase as more users adopt these features.

This sounds like it might be a trivial fix, if that's the case I hope we can fix it by 7.14.2

…detection-engine-rule-status documents (#112257) **Ticket:** #109815 ## Summary **Background:** `siem-detection-engine-rule-status` documents stores the `lastFailureMessage` a string which is indexed as `type: "text"` but some failure messages are so large that these documents are up to 26MB. These large documents cause migrations to fail because a batch of 1000 documents easily exceed Elasticsearch's `http.max_content_length` which defaults to 100mb. This PR truncates `lastFailureMessage` and `lastSuccessMessage` in the following cases: 1. When we write new or update existing status SOs: - The lists of errors/warnings are deduped -> truncated to max `20` items -> joined to a string - The resulting strings are truncated to max `10240` characters 2. When we migrate `siem-detection-engine-rule-status` SOs to 7.15.2: - The two message fields are truncated to max `10240` characters ### Checklist Delete any items that are not applicable to this PR. - [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios

…detection-engine-rule-status documents (elastic#112257) **Ticket:** elastic#109815 ## Summary **Background:** `siem-detection-engine-rule-status` documents stores the `lastFailureMessage` a string which is indexed as `type: "text"` but some failure messages are so large that these documents are up to 26MB. These large documents cause migrations to fail because a batch of 1000 documents easily exceed Elasticsearch's `http.max_content_length` which defaults to 100mb. This PR truncates `lastFailureMessage` and `lastSuccessMessage` in the following cases: 1. When we write new or update existing status SOs: - The lists of errors/warnings are deduped -> truncated to max `20` items -> joined to a string - The resulting strings are truncated to max `10240` characters 2. When we migrate `siem-detection-engine-rule-status` SOs to 7.15.2: - The two message fields are truncated to max `10240` characters ### Checklist Delete any items that are not applicable to this PR. - [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios

…detection-engine-rule-status documents (#112257) (#115038) **Ticket:** #109815 ## Summary **Background:** `siem-detection-engine-rule-status` documents stores the `lastFailureMessage` a string which is indexed as `type: "text"` but some failure messages are so large that these documents are up to 26MB. These large documents cause migrations to fail because a batch of 1000 documents easily exceed Elasticsearch's `http.max_content_length` which defaults to 100mb. This PR truncates `lastFailureMessage` and `lastSuccessMessage` in the following cases: 1. When we write new or update existing status SOs: - The lists of errors/warnings are deduped -> truncated to max `20` items -> joined to a string - The resulting strings are truncated to max `10240` characters 2. When we migrate `siem-detection-engine-rule-status` SOs to 7.15.2: - The two message fields are truncated to max `10240` characters ### Checklist Delete any items that are not applicable to this PR. - [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios Co-authored-by: Georgii Gorbachev <[email protected]>

…r siem-detection-engine-rule-status documents (elastic#112257) **Ticket:** elastic#109815 **Background:** `siem-detection-engine-rule-status` documents stores the `lastFailureMessage` a string which is indexed as `type: "text"` but some failure messages are so large that these documents are up to 26MB. These large documents cause migrations to fail because a batch of 1000 documents easily exceed Elasticsearch's `http.max_content_length` which defaults to 100mb. This PR truncates `lastFailureMessage` and `lastSuccessMessage` in the following cases: 1. When we write new or update existing status SOs: - The lists of errors/warnings are deduped -> truncated to max `20` items -> joined to a string - The resulting strings are truncated to max `10240` characters 2. When we migrate `siem-detection-engine-rule-status` SOs to 7.15.2: - The two message fields are truncated to max `10240` characters Delete any items that are not applicable to this PR. - [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios

…r siem-detection-engine-rule-status documents (#112257) (#115166) **Ticket:** #109815 **Background:** `siem-detection-engine-rule-status` documents stores the `lastFailureMessage` a string which is indexed as `type: "text"` but some failure messages are so large that these documents are up to 26MB. These large documents cause migrations to fail because a batch of 1000 documents easily exceed Elasticsearch's `http.max_content_length` which defaults to 100mb. This PR truncates `lastFailureMessage` and `lastSuccessMessage` in the following cases: 1. When we write new or update existing status SOs: - The lists of errors/warnings are deduped -> truncated to max `20` items -> joined to a string - The resulting strings are truncated to max `10240` characters 2. When we migrate `siem-detection-engine-rule-status` SOs to 7.15.2: - The two message fields are truncated to max `10240` characters Delete any items that are not applicable to this PR. - [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios

rudolf added the Team:Detections and Resp Security Detection Response Team label Aug 24, 2021

mshustov mentioned this issue Aug 24, 2021

Migrations should dynamically adjust batch size to prevent failing on 413 errors from Elasticsearch #107641

Closed

peluja1012 added the Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. label Aug 24, 2021

peluja1012 added bug Fixes for quality problems that affect the customer experience impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. labels Aug 24, 2021

peluja1012 assigned banderror Aug 24, 2021

peluja1012 added the v7.14.2 label Aug 31, 2021

banderror mentioned this issue Sep 15, 2021

[Security Solution][Detections] Truncate lastFailureMessage for siem-detection-engine-rule-status documents #112257

Merged

1 task

banderror added v7.15.2 Team:Detection Rule Management Security Detection Rule Management Team and removed v7.14.2 labels Oct 14, 2021

banderror closed this as completed Oct 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Truncate lastFailureMessage for siem-detection-engine-rule-status documents #109815

Truncate lastFailureMessage for siem-detection-engine-rule-status documents #109815

rudolf commented Aug 24, 2021 •

edited by banderror

Loading

elasticmachine commented Aug 24, 2021

elasticmachine commented Aug 24, 2021

banderror commented Aug 25, 2021

banderror commented Aug 25, 2021

banderror commented Aug 25, 2021

Truncate lastFailureMessage for siem-detection-engine-rule-status documents #109815

Truncate lastFailureMessage for siem-detection-engine-rule-status documents #109815

Comments

rudolf commented Aug 24, 2021 • edited by banderror Loading

elasticmachine commented Aug 24, 2021

elasticmachine commented Aug 24, 2021

banderror commented Aug 25, 2021

banderror commented Aug 25, 2021

banderror commented Aug 25, 2021

rudolf commented Aug 24, 2021 •

edited by banderror

Loading