[Security Solution][Alerts] Decouple gap detection from additional lookback #138933

marshallmain · 2022-08-16T15:53:28Z

Security rules attempt to warn users if the rules are not running on schedule. The "additional lookback" parameter on each rule controls how much a rule is allowed to drift from its rule interval before a "gap detected" warning is written. However, the "additional lookback" parameter also controls the overlap of time ranges that consecutive rule executions will query. It's not always clear to users what a good choice of "additional lookback" would be, and in some cases we've seen users pick relatively large values which then suppress warnings about rule scheduling drift. The "additional lookback" value can also be different for each individual rule, meaning that warnings about drift can show for some rules but not others - even though all rules are likely to be drifting if any rules are.

In one SDH we saw a customer choose additional lookback values of 30m or 1h for many of their rules, while using rule intervals of around 1-5 minutes. As they enabled more rules, they exceeded the number of rules per minute Kibana could process and rules started to get backed up and drift away from their scheduled execution times. However, the long additional lookback values suppressed any warnings about this drift in the Security Solution UI.

We should try to warn users more consistently any time their rules are not running on schedule. There are at least 3 ways we might implement this. All 3 solutions below would compute drift and drift tolerance in the same way for all rules in a cluster, which would resolve the problem where some rules can report drift warnings while others don't, even though they all drift the same way.

1. Add a "security solution drift tolerance" config and check the drift compared to this tolerance in the rule executors

This is the simplest to implement and avoids exposing framework implementation details (poll_interval) to solutions. The default value of "security solution drift tolerance" would be something like 5 seconds, slightly longer than the default poll_interval. Any users that increase the poll interval config value would need to increase the drift tolerance as well to avoid getting false positive warnings about drift.

2. Expose the Task Manager `poll_interval` config to rule executors and check drift compared to `poll_interval`

This reduces the amount of configuration options in kibana.yml, but couples the Security solution rule implementation to a specific config setting in the Task Manager. Coupling solution implementation to details of the framework is not ideal.

3. Implement gap detection in the Task Manager or alerting framework by comparing drift to `poll_interval`

This solution would also have reduced configuration options compared to (1), but avoids coupling the solution and framework implementations. However, it would require more changes on a technical level to move that logic out of the security rule executors but expose the resulting gap information to the solution for gap remediation (i.e. running the core search logic multiple times over different time ranges to cover the gap).

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-08-16T15:53:30Z

Pinging @elastic/security-solution (Team: SecuritySolution)

kobelb · 2022-08-16T19:22:08Z

As of 8.4, we are now exposing the rule/action queue length and queue duration in Stack Monitoring:

This was the first phase in providing external observability to alerting rules and it allow users to monitor their system for drift in alerting rules running. This also allows users to creating alerting rules on their alerting health so they can be notified when alerting rules drift beyond their threshold.

If we don't want to rely on Stack Monitoring for this and display something in the cluster that is running the alerting rules, that's fine also. I would prefer that we do this at the framework level and that we don't tie it to the poll interval but allow users to configure an explicit drift tolerance, similar to option 1). A delay slightly greater than the poll interval is tolerable for all alerting rules. For security detection rules which should be generating all alerts, the gap remediation makes a slight delay completely tolerable; and observability rules are concerned with the current state of the systems that are being monitored, and slight delays are negligible. However, large delays are a concern for both security solution and observability rules.

marshallmain · 2022-08-17T00:24:48Z

On the security side we'd definitely like to be able to display the gap warnings in the Security Rule management UI. We currently report a failure when the gap is too large to be covered by the automated remediation.

However, it would be useful for us to report a warning any time the drift exceeds the configured drift tolerance, even if it can be automatically remediated. Our Rule Execution log UI automatically picks up any framework level errors (and I assume warnings, but I haven't tested that).

If we implement this at the framework level, could we write a warning in the framework event log when excessive drift is detected?

banderror · 2022-08-31T13:59:39Z

@marshallmain I think it's the right direction overall. Some thoughts and questions:

Less config is better than more because it's less chance for users to shoot themselves in the foot
What is poll_interval? If we'd need to customize it for Security detection rules only, would it be possible to change it from our side when registering them?
It would be nice to consolidate Framework and Security rule execution logs and summaries. We already have tickets for that:
- [ResponseOps] API for reading/writing rule execution summary #135127
- [ResponseOps] Write rule execution results to Event Log #135209
- We can now write generic messages of different log levels to event log which can be used for writing warnings and errors WITHOUT changing the rule status. As an option, this could be used for logging detected drift ASAP, and showing it in the UI. Could be consolidated with the Framework. More details in [Security Solution][Detections] Extended rule execution logging to Event Log #126063

UX questions:

Have you discussed it with the PM and UX folks?
Should there be necessarily a "tolerance" value? Why binary?
Let's say we compared the drift and tolerance values. How would we process this result and then expose it in the UI?
What about gap detection in its current form? Gaps are not the same as drift because they cover rules downtime and missed source events (e.g. when Kibana process crashes)

marshallmain · 2022-08-31T21:51:10Z

Summary from today's meeting on this issue (Framework-level drift detection and warning):

Glossary

Global warning: A warning that is not specific to any particular rule. This does not require the warning to be global as in appearing on all pages in Kibana, it could, or it could appear on certain pages like the Detection Rules Management page or others.

We should continue to discuss the best form for this warning about rules not running on schedule. One option is to display a global warning somewhere (detailed below) if rules are not running on schedule, since we expect that in most cases many rules will start to drift from their schedules concurrently. The other option is to display a warning on each rule that is not running on schedule.

Where we choose to display the warning should take into account the actions we expect users to take to resolve the warning. Per @jethr0null, when there are too many rules to run on schedule the first preferred option is to guide users to provision more resources for Kibana. If provisioning more resources is not feasible for the user, the secondary option would be to disable some of their rules.

New proposed warning location

The initial proposal in this issue was to warn users on each rule if that particular rule is not running on schedule. However, @mikecote suggested that we could potentially warn users sooner and with better context if we create a "global" warning when rules are not running on schedule. The warning might display at the top of the Rules Management page in the Security Solution, for example, but might also display in other places in Kibana like Stack Management or Observability. A global warning would fit better with the expected use case here, since a rule drifting from its schedule is typically not actually a problem with the specific rule experiencing drift, but instead means that Kibana is under provisioned. In a global warning we could also include a list of the rules that are taking the most execution time in Kibana, providing quick insight into possible root causes for the warning.

How exactly we would compute the global warning is still an open question, and there are a few options we discussed - though the list below is not an exhaustive list of viable options and we should continue exploring these and others.

Reuse the mechanism that is developed for the future Kibana autoscaling project, as that should take the load from rules into account when determining if more/larger Kibana instances are needed. The warning could appear when it detects that more instances are needed, and the warning could automatically resolve if autoscaling is enabled. If autoscaling is disabled, it could provide guidance to manually scale Kibana.
Warn if taskScheduleDelay exceeds some threshold configured in kibana.yml. This may be susceptible to a thundering herd problem and generate a warning if many rules are enabled simultaneously. The warning would disappear if Kibana has enough capacity for the enabled rules on average as the rules spread out. This option tries to get the warning in front of the user as quickly as possible, at the cost of possible false positives.
Warn if the interval between subsequent rule executions exceeds the configured rule interval by more than some threshold value. This option is similar to above, but could reduce false positive warnings in the thundering herd scenario by comparing the start times of subsequent executions of the same rule. The warning would not appear until at least one rule executes for the second time.

Possible to layer warnings

We are able to have both a global warning and warnings specific to individual rules. This may be useful if individual rules can run off-schedule for reasons other than Kibana being under provisioned. We could then separate the global "Kibana is under provisioned" warning from individual rule "running off schedule" warnings.

When layered, we could take the approach that:

A single rule running off schedule is a problem on its own, so we have a warning at the individual rule level
Many rules running off schedule would indicate Kibana is under provisioned, so a global warning would appear and advise users to increase Kibana resources or disable some rules

Warning users when they're close to capacity

It could be useful to warn users when they're close to capacity instead of waiting until rules are running off schedule. This likely ties closely to the autoscaling work.

Displaying global warnings

While we would like any of these warnings to be computed at the stack level, we want to be able to display the warning inside solutions. For the global warning, one option is to include it in the alerting health API and have solutions call the API and display the warning as appropriate.

Open questions

How does autoscaling relate to this issue? Especially regarding reusing the mechanism for determining when Kibana needs to scale and release timing of the autoscaling work
Is "rules not running on schedule" a problem on its own that could have multiple root causes, e.g. an individual rule has some issue that makes it run too late, or are we simply using it as a proxy to determine when Kibana is under provisioned? If we're just using it as a proxy for overall system resource provisioning, a global warning makes sense. If instead a rule could run off-schedule for other reasons, we may want to have the ability to warn on specific rules that are off schedule.
If Kibana is under provisioned and all rules run off schedule, would it be a problem if every rule generates a warning on each rule execution? What's the desired behavior in this situation?

mikecote · 2022-09-07T13:15:36Z

Thanks for the recap @marshallmain! I'm interested to see from a security product perspective what the expectation of the user is once they see this warning (disable rules, scale kibana, contact their administrators, just to let them know rules are behind, etc)

marshallmain added the sdh-linked label Aug 17, 2022

marshallmain added the 8.5 candidate label Aug 31, 2022

marshallmain self-assigned this Aug 31, 2022

marshallmain removed the 8.5 candidate label Nov 1, 2022

yctercero added Team:Detection Engine Security Solution Detection Engine Area technical debt Improvement of the software architecture and operational architecture and removed Team:Detection Alerts Security Detection Alerts Area Team labels May 13, 2023

yctercero added the consider-next label Aug 13, 2023

yctercero mentioned this issue Sep 6, 2023

[DE] - Detection Engine backlog overview #165878

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Security Solution][Alerts] Decouple gap detection from additional lookback #138933

[Security Solution][Alerts] Decouple gap detection from additional lookback #138933

marshallmain commented Aug 16, 2022

elasticmachine commented Aug 16, 2022

kobelb commented Aug 16, 2022

marshallmain commented Aug 17, 2022 •

edited

Loading

banderror commented Aug 31, 2022

marshallmain commented Aug 31, 2022 •

edited

Loading

mikecote commented Sep 7, 2022

[Security Solution][Alerts] Decouple gap detection from additional lookback #138933

[Security Solution][Alerts] Decouple gap detection from additional lookback #138933

Comments

marshallmain commented Aug 16, 2022

1. Add a "security solution drift tolerance" config and check the drift compared to this tolerance in the rule executors

2. Expose the Task Manager poll_interval config to rule executors and check drift compared to poll_interval

3. Implement gap detection in the Task Manager or alerting framework by comparing drift to poll_interval

elasticmachine commented Aug 16, 2022

kobelb commented Aug 16, 2022

marshallmain commented Aug 17, 2022 • edited Loading

banderror commented Aug 31, 2022

marshallmain commented Aug 31, 2022 • edited Loading

New proposed warning location

Possible to layer warnings

Warning users when they're close to capacity

Displaying global warnings

Open questions

mikecote commented Sep 7, 2022

2. Expose the Task Manager `poll_interval` config to rule executors and check drift compared to `poll_interval`

3. Implement gap detection in the Task Manager or alerting framework by comparing drift to `poll_interval`

marshallmain commented Aug 17, 2022 •

edited

Loading

marshallmain commented Aug 31, 2022 •

edited

Loading