Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solution] detection_engine/rules/_find returns 500 when query matches many rules #119853

Closed
xcrzx opened this issue Nov 29, 2021 · 6 comments · Fixed by #121110
Closed
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Feature:Rule Management Security Solution Detection Rule Management area impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v8.0.0 v8.1.0

Comments

@xcrzx
Copy link
Contributor

xcrzx commented Nov 29, 2021

Describe the bug:

detection_engine/rules/_find API endpoint returns a 500 error when the provided query matches many rules.

Steps to reproduce:

  1. Create a large number of detection rules. In my case, there were ~ 2100 rules. You can use bulk duplicate action to do that quickly.
  2. curl 'http://localhost:5601/kbn/api/detection_engine/rules/_find?page=1&per_page=10000&sort_field=enabled&sort_order=desc'

Response: {"message":"An internal server error occurred","status_code":500}

Any additional context (logs, chat logs, magical formulas, etc.):

Inside the detection_engine/rules/_find route we call savedObjectsClient.find and pass all found ruleIds to it:

await savedObjectsClient.find<LegacyIRuleActionsAttributesSavedObjectAttributes>({
    type: legacyRuleActionsSavedObjectType,
    perPage: 10000,
    hasReference: references, // <-we pass all ruleIds as a references array (~2100 items)
});

That call later transforms into the following Elasticsearch query:

{
  "method": "POST",
  "path": "/.kibana_8.1.0/_search",
  "body": {
    "size": 10000,
    "seq_no_primary_term": true,
    "from": 0,
    "query": {
      "bool": {
        "filter": [
          {
            "bool": {
              "should": [
                {
                  "nested": {
                    "path": "references",
                    "query": {
                      "bool": {
                        "must": [
                          {
                            "term": {
                              "references.id": "8ad6dded-47b6-11ec-95d4-55575fdf691c"
                            }
                          },
                          { "term": { "references.type": "alert" } }
                        ]
                      }
                    }
                  }
                }
                // ~2100 other should clauses
              ],
              "minimum_should_match": 1
            }
          },
          {
            "bool": {
              "should": [
                {
                  "bool": {
                    "must": [
                      {
                        "term": { "type": "siem-detection-engine-rule-actions" }
                      },
                      { "terms": { "namespaces": ["default", "*"] } }
                    ],
                    "must_not": [{ "exists": { "field": "namespace" } }]
                  }
                }
              ],
              "minimum_should_match": 1
            }
          }
        ]
      }
    }
  }
}

Which results into an error:

all shards failed: search_phase_execution_exception: [too_many_nested_clauses] 
Reason: Query contains too many nested clauses; 
maxClauseCount is set to 4096
@xcrzx xcrzx added bug Fixes for quality problems that affect the customer experience triage_needed v8.0.0 Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v8.1.0 Team:Detection Rule Management Security Detection Rule Management Team labels Nov 29, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-detections-response (Team:Detections and Resp)

@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@pgayvallet
Copy link
Contributor

I opened #120900 to discuss the possible fix at the SavedObjectRepository.find level.

@kobelb
Copy link
Contributor

kobelb commented Dec 9, 2021

@xcrzx can we make legacyGetBulkRuleActionsSavedObject make multiple requests to SavedObjectsClient#find with a smaller number of alerts?

Also, do we need to load 10,000 detection rules? That's going to consume a lot of memory, and users almost certainly won't want to be viewing all 10,000 detection rules.

@pgayvallet
Copy link
Contributor

Also, do we need to load 10,000 detection rules? That's going to consume a lot of memory, and users almost certainly won't want to be viewing all 10,000 detection rules.

Yea, that was going to be my next question. If the request is initiated from the UI, I'm assuming adding pagination would solve this altogether? Displaying 10k, or even 1k, item in a simple page of a table doesn't make much sense in term of usability?

@xcrzx
Copy link
Contributor Author

xcrzx commented Dec 13, 2021

can we make legacyGetBulkRuleActionsSavedObject make multiple requests to SavedObjectsClient#find with a smaller number of alerts?

Yes, it's possible. Still, I think it would be good to optimize the ES query sent by SavedObjectsClient.find. Currently, it's putting every reference.id in its own clause, which generates 200Kb JSON per 1000 rules requested. That creates unnecessary pressure on the network, increases parsing time, etc.

So optimizing the query on the SO client side + chunking requests on the solution side would make the most sense, IMO.

Also, do we need to load 10,000 detection rules? That's going to consume a lot of memory, and users almost certainly won't want to be viewing all 10,000 detection rules.

Yea, that was going to be my next question. If the request is initiated from the UI, I'm assuming adding pagination would solve this altogether? Displaying 10k, or even 1k, item in a simple page of a table doesn't make much sense in term of usability?

We need all rules to implement sorting and filtering. Unfortunately, current Alerting framework limitations don't allow us to do that on the server side, so we have to work around these limitations by fetching all rules and filtering/sorting them on the client side. Of course, we are aware of all flaws and scalability issues of in-memory implementation. Still, at this point, we decided to go with it as another option would be to not deliver very basic features like sorting and filtering to our users at all.

@banderror banderror added impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. Feature:Rule Management Security Solution Detection Rule Management area labels Dec 13, 2021
@xcrzx xcrzx linked a pull request Dec 14, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Rule Management Security Solution Detection Rule Management area impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. Team:Detection Rule Management Security Detection Rule Management Team Team:Detections and Resp Security Detection Response Team Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. v8.0.0 v8.1.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants