-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Truncate lastFailureMessage for siem-detection-engine-rule-status documents #109815
Comments
Pinging @elastic/security-detections-response (Team:Detections and Resp) |
Pinging @elastic/security-solution (Team: SecuritySolution) |
A note on why this can happen: Every rule execution runs a loop under the hood. During each iteration of this loop Detection Engine queries source indices (containing source events), a small time period per iteration. If it finds any events matching the rule criteria, it creates detection alerts from them and attempts to bulk index them. When querying and bulk indexing within an iteration, it’s possible that we catch exceptions. All the exceptions are being collected and returned from this loop. Then Detection Engine joins all of them (all their messages) and changes the status of the rule to failed with this So when the loop contains a lot of iterations, it’s possible to get a very large list of caught exceptions and a long status message. So we definitely need to truncate this list. |
A note on how to truncate:
|
A note on the impact and which version to target from @rudolf: This has a high impact on users and causes downtime for them and support escalations. Even if we are fixing some of the problems already this is also causing OOM errors for users with smaller 1GB Kibana instances which is harder to address. At the moment it doesn't seem to be affecting a large number of users, but I suspect this will increase as more users adopt these features. This sounds like it might be a trivial fix, if that's the case I hope we can fix it by 7.14.2 |
…detection-engine-rule-status documents (#112257) **Ticket:** #109815 ## Summary **Background:** `siem-detection-engine-rule-status` documents stores the `lastFailureMessage` a string which is indexed as `type: "text"` but some failure messages are so large that these documents are up to 26MB. These large documents cause migrations to fail because a batch of 1000 documents easily exceed Elasticsearch's `http.max_content_length` which defaults to 100mb. This PR truncates `lastFailureMessage` and `lastSuccessMessage` in the following cases: 1. When we write new or update existing status SOs: - The lists of errors/warnings are deduped -> truncated to max `20` items -> joined to a string - The resulting strings are truncated to max `10240` characters 2. When we migrate `siem-detection-engine-rule-status` SOs to 7.15.2: - The two message fields are truncated to max `10240` characters ### Checklist Delete any items that are not applicable to this PR. - [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
…detection-engine-rule-status documents (elastic#112257) **Ticket:** elastic#109815 ## Summary **Background:** `siem-detection-engine-rule-status` documents stores the `lastFailureMessage` a string which is indexed as `type: "text"` but some failure messages are so large that these documents are up to 26MB. These large documents cause migrations to fail because a batch of 1000 documents easily exceed Elasticsearch's `http.max_content_length` which defaults to 100mb. This PR truncates `lastFailureMessage` and `lastSuccessMessage` in the following cases: 1. When we write new or update existing status SOs: - The lists of errors/warnings are deduped -> truncated to max `20` items -> joined to a string - The resulting strings are truncated to max `10240` characters 2. When we migrate `siem-detection-engine-rule-status` SOs to 7.15.2: - The two message fields are truncated to max `10240` characters ### Checklist Delete any items that are not applicable to this PR. - [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
…detection-engine-rule-status documents (#112257) (#115038) **Ticket:** #109815 ## Summary **Background:** `siem-detection-engine-rule-status` documents stores the `lastFailureMessage` a string which is indexed as `type: "text"` but some failure messages are so large that these documents are up to 26MB. These large documents cause migrations to fail because a batch of 1000 documents easily exceed Elasticsearch's `http.max_content_length` which defaults to 100mb. This PR truncates `lastFailureMessage` and `lastSuccessMessage` in the following cases: 1. When we write new or update existing status SOs: - The lists of errors/warnings are deduped -> truncated to max `20` items -> joined to a string - The resulting strings are truncated to max `10240` characters 2. When we migrate `siem-detection-engine-rule-status` SOs to 7.15.2: - The two message fields are truncated to max `10240` characters ### Checklist Delete any items that are not applicable to this PR. - [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios Co-authored-by: Georgii Gorbachev <[email protected]>
…r siem-detection-engine-rule-status documents (elastic#112257) **Ticket:** elastic#109815 **Background:** `siem-detection-engine-rule-status` documents stores the `lastFailureMessage` a string which is indexed as `type: "text"` but some failure messages are so large that these documents are up to 26MB. These large documents cause migrations to fail because a batch of 1000 documents easily exceed Elasticsearch's `http.max_content_length` which defaults to 100mb. This PR truncates `lastFailureMessage` and `lastSuccessMessage` in the following cases: 1. When we write new or update existing status SOs: - The lists of errors/warnings are deduped -> truncated to max `20` items -> joined to a string - The resulting strings are truncated to max `10240` characters 2. When we migrate `siem-detection-engine-rule-status` SOs to 7.15.2: - The two message fields are truncated to max `10240` characters Delete any items that are not applicable to this PR. - [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
…r siem-detection-engine-rule-status documents (#112257) (#115166) **Ticket:** #109815 **Background:** `siem-detection-engine-rule-status` documents stores the `lastFailureMessage` a string which is indexed as `type: "text"` but some failure messages are so large that these documents are up to 26MB. These large documents cause migrations to fail because a batch of 1000 documents easily exceed Elasticsearch's `http.max_content_length` which defaults to 100mb. This PR truncates `lastFailureMessage` and `lastSuccessMessage` in the following cases: 1. When we write new or update existing status SOs: - The lists of errors/warnings are deduped -> truncated to max `20` items -> joined to a string - The resulting strings are truncated to max `10240` characters 2. When we migrate `siem-detection-engine-rule-status` SOs to 7.15.2: - The two message fields are truncated to max `10240` characters Delete any items that are not applicable to this PR. - [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
siem-detection-engine-rule-status
documents stores the lastFailureMessage a string which is indexed astype: "text"
but some failure messages are so large that these documents are up to 26MB.These large documents cause migrations to fail because a batch of 1000 documents easily exceed Elasticsearch's
http.max_content_length
which defaults to 100mb.Even though we're fixing the root problem with migrations in #107641 these large documents still consume a lot of resources like storage and particularly heap and in my tests easily cause Elasticsearch to run into circuit_breaking_exceptions.
Can we truncate these failure messages to limit the size of these documents and/or not index the entire contents of this field?
The text was updated successfully, but these errors were encountered: