Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Truncate lastFailureMessage for siem-detection-engine-rule-status documents #109815

Closed
rudolf opened this issue Aug 24, 2021 · 5 comments · Fixed by #112257, #115038 or #115166
Closed

Truncate lastFailureMessage for siem-detection-engine-rule-status documents #109815

rudolf opened this issue Aug 24, 2021 · 5 comments · Fixed by #112257, #115038 or #115166
Assignees
Labels
bug Fixes for quality problems that affect the customer experience impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v7.15.2

Comments

@rudolf
Copy link
Contributor

rudolf commented Aug 24, 2021

siem-detection-engine-rule-status documents stores the lastFailureMessage a string which is indexed as type: "text" but some failure messages are so large that these documents are up to 26MB.

These large documents cause migrations to fail because a batch of 1000 documents easily exceed Elasticsearch's http.max_content_length which defaults to 100mb.

Even though we're fixing the root problem with migrations in #107641 these large documents still consume a lot of resources like storage and particularly heap and in my tests easily cause Elasticsearch to run into circuit_breaking_exceptions.

Can we truncate these failure messages to limit the size of these documents and/or not index the entire contents of this field?

@rudolf rudolf added the Team:Detections and Resp Security Detection Response Team label Aug 24, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@peluja1012 peluja1012 added the Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. label Aug 24, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@peluja1012 peluja1012 added bug Fixes for quality problems that affect the customer experience impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. labels Aug 24, 2021
@banderror
Copy link
Contributor

A note on why this can happen:

Every rule execution runs a loop under the hood. During each iteration of this loop Detection Engine queries source indices (containing source events), a small time period per iteration. If it finds any events matching the rule criteria, it creates detection alerts from them and attempts to bulk index them.

When querying and bulk indexing within an iteration, it’s possible that we catch exceptions. All the exceptions are being collected and returned from this loop. Then Detection Engine joins all of them (all their messages) and changes the status of the rule to failed with this Bulk Indexing of signals failed message. So it doesn't necessarily mean all errors there are indexing errors, it can contain search errors as well.

So when the loop contains a lot of iterations, it’s possible to get a very large list of caught exceptions and a long status message. So we definitely need to truncate this list.

@banderror
Copy link
Contributor

A note on how to truncate:

  1. Truncate on write:
    • get unique error messages from the list
    • truncate the list (leave max 10-20 errors, for example)
    • join its items to a string
    • truncate the resulting string (leave max 1024 characters, for example)
  2. Write a migration for siem-detection-engine-rule-status that would truncate existing status documents:
    • truncate the existing message fields (leave max 1024 characters, for example)
  3. Consider adjusting the mappings of siem-detection-engine-rule-status (e.g. using ignore_above, but I guess it's only for keywords?)

@banderror
Copy link
Contributor

A note on the impact and which version to target from @rudolf:

This has a high impact on users and causes downtime for them and support escalations. Even if we are fixing some of the problems already this is also causing OOM errors for users with smaller 1GB Kibana instances which is harder to address. At the moment it doesn't seem to be affecting a large number of users, but I suspect this will increase as more users adopt these features.

This sounds like it might be a trivial fix, if that's the case I hope we can fix it by 7.14.2

@banderror banderror added v7.15.2 Team:Detection Rule Management Security Detection Rule Management Team and removed v7.14.2 labels Oct 14, 2021
banderror added a commit that referenced this issue Oct 14, 2021
…detection-engine-rule-status documents (#112257)

**Ticket:** #109815

## Summary

**Background:** `siem-detection-engine-rule-status` documents stores the `lastFailureMessage` a string which is indexed as `type: "text"` but some failure messages are so large that these documents are up to 26MB. These large documents cause migrations to fail because a batch of 1000 documents easily exceed Elasticsearch's `http.max_content_length` which defaults to 100mb.

This PR truncates `lastFailureMessage` and `lastSuccessMessage` in the following cases:

1. When we write new or update existing status SOs:
    - The lists of errors/warnings are deduped -> truncated to max `20` items -> joined to a string
    - The resulting strings are truncated to max `10240` characters
2. When we migrate `siem-detection-engine-rule-status` SOs to 7.15.2:
    - The two message fields are truncated to max `10240` characters

### Checklist

Delete any items that are not applicable to this PR.

- [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
kibanamachine pushed a commit to kibanamachine/kibana that referenced this issue Oct 14, 2021
…detection-engine-rule-status documents (elastic#112257)

**Ticket:** elastic#109815

## Summary

**Background:** `siem-detection-engine-rule-status` documents stores the `lastFailureMessage` a string which is indexed as `type: "text"` but some failure messages are so large that these documents are up to 26MB. These large documents cause migrations to fail because a batch of 1000 documents easily exceed Elasticsearch's `http.max_content_length` which defaults to 100mb.

This PR truncates `lastFailureMessage` and `lastSuccessMessage` in the following cases:

1. When we write new or update existing status SOs:
    - The lists of errors/warnings are deduped -> truncated to max `20` items -> joined to a string
    - The resulting strings are truncated to max `10240` characters
2. When we migrate `siem-detection-engine-rule-status` SOs to 7.15.2:
    - The two message fields are truncated to max `10240` characters

### Checklist

Delete any items that are not applicable to this PR.

- [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
kibanamachine added a commit that referenced this issue Oct 14, 2021
…detection-engine-rule-status documents (#112257) (#115038)

**Ticket:** #109815

## Summary

**Background:** `siem-detection-engine-rule-status` documents stores the `lastFailureMessage` a string which is indexed as `type: "text"` but some failure messages are so large that these documents are up to 26MB. These large documents cause migrations to fail because a batch of 1000 documents easily exceed Elasticsearch's `http.max_content_length` which defaults to 100mb.

This PR truncates `lastFailureMessage` and `lastSuccessMessage` in the following cases:

1. When we write new or update existing status SOs:
    - The lists of errors/warnings are deduped -> truncated to max `20` items -> joined to a string
    - The resulting strings are truncated to max `10240` characters
2. When we migrate `siem-detection-engine-rule-status` SOs to 7.15.2:
    - The two message fields are truncated to max `10240` characters

### Checklist

Delete any items that are not applicable to this PR.

- [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios

Co-authored-by: Georgii Gorbachev <[email protected]>
banderror added a commit to banderror/kibana that referenced this issue Oct 15, 2021
…r siem-detection-engine-rule-status documents (elastic#112257)

**Ticket:** elastic#109815

**Background:** `siem-detection-engine-rule-status` documents stores the `lastFailureMessage` a string which is indexed as `type: "text"` but some failure messages are so large that these documents are up to 26MB. These large documents cause migrations to fail because a batch of 1000 documents easily exceed Elasticsearch's `http.max_content_length` which defaults to 100mb.

This PR truncates `lastFailureMessage` and `lastSuccessMessage` in the following cases:

1. When we write new or update existing status SOs:
    - The lists of errors/warnings are deduped -> truncated to max `20` items -> joined to a string
    - The resulting strings are truncated to max `10240` characters
2. When we migrate `siem-detection-engine-rule-status` SOs to 7.15.2:
    - The two message fields are truncated to max `10240` characters

Delete any items that are not applicable to this PR.

- [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
banderror added a commit that referenced this issue Oct 15, 2021
…r siem-detection-engine-rule-status documents (#112257) (#115166)

**Ticket:** #109815

**Background:** `siem-detection-engine-rule-status` documents stores the `lastFailureMessage` a string which is indexed as `type: "text"` but some failure messages are so large that these documents are up to 26MB. These large documents cause migrations to fail because a batch of 1000 documents easily exceed Elasticsearch's `http.max_content_length` which defaults to 100mb.

This PR truncates `lastFailureMessage` and `lastSuccessMessage` in the following cases:

1. When we write new or update existing status SOs:
    - The lists of errors/warnings are deduped -> truncated to max `20` items -> joined to a string
    - The resulting strings are truncated to max `10240` characters
2. When we migrate `siem-detection-engine-rule-status` SOs to 7.15.2:
    - The two message fields are truncated to max `10240` characters

Delete any items that are not applicable to this PR.

- [ ] [Unit or functional tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html) were updated or added to match the most common scenarios
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v7.15.2
Projects
None yet
4 participants