Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solution][Alerts] Decouple gap detection from additional lookback #138933

Open
Tracked by #165878
marshallmain opened this issue Aug 16, 2022 · 6 comments
Open
Tracked by #165878
Assignees
Labels
consider-next discuss enhancement New value added to drive a business result Feature:Detection Alerts Security Solution Detection Alerts Feature sdh-linked Team:Detection Engine Security Solution Detection Engine Area Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. technical debt Improvement of the software architecture and operational architecture

Comments

@marshallmain
Copy link
Contributor

Security rules attempt to warn users if the rules are not running on schedule. The "additional lookback" parameter on each rule controls how much a rule is allowed to drift from its rule interval before a "gap detected" warning is written. However, the "additional lookback" parameter also controls the overlap of time ranges that consecutive rule executions will query. It's not always clear to users what a good choice of "additional lookback" would be, and in some cases we've seen users pick relatively large values which then suppress warnings about rule scheduling drift. The "additional lookback" value can also be different for each individual rule, meaning that warnings about drift can show for some rules but not others - even though all rules are likely to be drifting if any rules are.

In one SDH we saw a customer choose additional lookback values of 30m or 1h for many of their rules, while using rule intervals of around 1-5 minutes. As they enabled more rules, they exceeded the number of rules per minute Kibana could process and rules started to get backed up and drift away from their scheduled execution times. However, the long additional lookback values suppressed any warnings about this drift in the Security Solution UI.

We should try to warn users more consistently any time their rules are not running on schedule. There are at least 3 ways we might implement this. All 3 solutions below would compute drift and drift tolerance in the same way for all rules in a cluster, which would resolve the problem where some rules can report drift warnings while others don't, even though they all drift the same way.

1. Add a "security solution drift tolerance" config and check the drift compared to this tolerance in the rule executors

This is the simplest to implement and avoids exposing framework implementation details (poll_interval) to solutions. The default value of "security solution drift tolerance" would be something like 5 seconds, slightly longer than the default poll_interval. Any users that increase the poll interval config value would need to increase the drift tolerance as well to avoid getting false positive warnings about drift.

2. Expose the Task Manager poll_interval config to rule executors and check drift compared to poll_interval

This reduces the amount of configuration options in kibana.yml, but couples the Security solution rule implementation to a specific config setting in the Task Manager. Coupling solution implementation to details of the framework is not ideal.

3. Implement gap detection in the Task Manager or alerting framework by comparing drift to poll_interval

This solution would also have reduced configuration options compared to (1), but avoids coupling the solution and framework implementations. However, it would require more changes on a technical level to move that logic out of the security rule executors but expose the resulting gap information to the solution for gap remediation (i.e. running the core search logic multiple times over different time ranges to cover the gap).

@marshallmain marshallmain added discuss enhancement New value added to drive a business result Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Feature:Detection Alerts Security Solution Detection Alerts Feature Team:Detection Alerts Security Detection Alerts Area Team labels Aug 16, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

@kobelb
Copy link
Contributor

kobelb commented Aug 16, 2022

As of 8.4, we are now exposing the rule/action queue length and queue duration in Stack Monitoring:

Screen Shot 2022-08-16 at 12 10 38 PM

This was the first phase in providing external observability to alerting rules and it allow users to monitor their system for drift in alerting rules running. This also allows users to creating alerting rules on their alerting health so they can be notified when alerting rules drift beyond their threshold.

If we don't want to rely on Stack Monitoring for this and display something in the cluster that is running the alerting rules, that's fine also. I would prefer that we do this at the framework level and that we don't tie it to the poll interval but allow users to configure an explicit drift tolerance, similar to option 1). A delay slightly greater than the poll interval is tolerable for all alerting rules. For security detection rules which should be generating all alerts, the gap remediation makes a slight delay completely tolerable; and observability rules are concerned with the current state of the systems that are being monitored, and slight delays are negligible. However, large delays are a concern for both security solution and observability rules.

@marshallmain
Copy link
Contributor Author

marshallmain commented Aug 17, 2022

On the security side we'd definitely like to be able to display the gap warnings in the Security Rule management UI. We currently report a failure when the gap is too large to be covered by the automated remediation.
image

However, it would be useful for us to report a warning any time the drift exceeds the configured drift tolerance, even if it can be automatically remediated. Our Rule Execution log UI automatically picks up any framework level errors (and I assume warnings, but I haven't tested that).

If we implement this at the framework level, could we write a warning in the framework event log when excessive drift is detected?

@banderror
Copy link
Contributor

@marshallmain I think it's the right direction overall. Some thoughts and questions:

UX questions:

  • Have you discussed it with the PM and UX folks?
  • Should there be necessarily a "tolerance" value? Why binary?
  • Let's say we compared the drift and tolerance values. How would we process this result and then expose it in the UI?
  • What about gap detection in its current form? Gaps are not the same as drift because they cover rules downtime and missed source events (e.g. when Kibana process crashes)

@marshallmain
Copy link
Contributor Author

marshallmain commented Aug 31, 2022

Summary from today's meeting on this issue (Framework-level drift detection and warning):

Glossary

  • Global warning: A warning that is not specific to any particular rule. This does not require the warning to be global as in appearing on all pages in Kibana, it could, or it could appear on certain pages like the Detection Rules Management page or others.

We should continue to discuss the best form for this warning about rules not running on schedule. One option is to display a global warning somewhere (detailed below) if rules are not running on schedule, since we expect that in most cases many rules will start to drift from their schedules concurrently. The other option is to display a warning on each rule that is not running on schedule.

Where we choose to display the warning should take into account the actions we expect users to take to resolve the warning. Per @jethr0null, when there are too many rules to run on schedule the first preferred option is to guide users to provision more resources for Kibana. If provisioning more resources is not feasible for the user, the secondary option would be to disable some of their rules.

New proposed warning location

The initial proposal in this issue was to warn users on each rule if that particular rule is not running on schedule. However, @mikecote suggested that we could potentially warn users sooner and with better context if we create a "global" warning when rules are not running on schedule. The warning might display at the top of the Rules Management page in the Security Solution, for example, but might also display in other places in Kibana like Stack Management or Observability. A global warning would fit better with the expected use case here, since a rule drifting from its schedule is typically not actually a problem with the specific rule experiencing drift, but instead means that Kibana is under provisioned. In a global warning we could also include a list of the rules that are taking the most execution time in Kibana, providing quick insight into possible root causes for the warning.

How exactly we would compute the global warning is still an open question, and there are a few options we discussed - though the list below is not an exhaustive list of viable options and we should continue exploring these and others.

  • Reuse the mechanism that is developed for the future Kibana autoscaling project, as that should take the load from rules into account when determining if more/larger Kibana instances are needed. The warning could appear when it detects that more instances are needed, and the warning could automatically resolve if autoscaling is enabled. If autoscaling is disabled, it could provide guidance to manually scale Kibana.
  • Warn if taskScheduleDelay exceeds some threshold configured in kibana.yml. This may be susceptible to a thundering herd problem and generate a warning if many rules are enabled simultaneously. The warning would disappear if Kibana has enough capacity for the enabled rules on average as the rules spread out. This option tries to get the warning in front of the user as quickly as possible, at the cost of possible false positives.
  • Warn if the interval between subsequent rule executions exceeds the configured rule interval by more than some threshold value. This option is similar to above, but could reduce false positive warnings in the thundering herd scenario by comparing the start times of subsequent executions of the same rule. The warning would not appear until at least one rule executes for the second time.

Possible to layer warnings

We are able to have both a global warning and warnings specific to individual rules. This may be useful if individual rules can run off-schedule for reasons other than Kibana being under provisioned. We could then separate the global "Kibana is under provisioned" warning from individual rule "running off schedule" warnings.

When layered, we could take the approach that:

  • A single rule running off schedule is a problem on its own, so we have a warning at the individual rule level
  • Many rules running off schedule would indicate Kibana is under provisioned, so a global warning would appear and advise users to increase Kibana resources or disable some rules

Warning users when they're close to capacity

It could be useful to warn users when they're close to capacity instead of waiting until rules are running off schedule. This likely ties closely to the autoscaling work.

Displaying global warnings

While we would like any of these warnings to be computed at the stack level, we want to be able to display the warning inside solutions. For the global warning, one option is to include it in the alerting health API and have solutions call the API and display the warning as appropriate.

Open questions

  • How does autoscaling relate to this issue? Especially regarding reusing the mechanism for determining when Kibana needs to scale and release timing of the autoscaling work
  • Is "rules not running on schedule" a problem on its own that could have multiple root causes, e.g. an individual rule has some issue that makes it run too late, or are we simply using it as a proxy to determine when Kibana is under provisioned? If we're just using it as a proxy for overall system resource provisioning, a global warning makes sense. If instead a rule could run off-schedule for other reasons, we may want to have the ability to warn on specific rules that are off schedule.
  • If Kibana is under provisioned and all rules run off schedule, would it be a problem if every rule generates a warning on each rule execution? What's the desired behavior in this situation?

@mikecote
Copy link
Contributor

mikecote commented Sep 7, 2022

Thanks for the recap @marshallmain! I'm interested to see from a security product perspective what the expectation of the user is once they see this warning (disable rules, scale kibana, contact their administrators, just to let them know rules are behind, etc)

@yctercero yctercero added Team:Detection Engine Security Solution Detection Engine Area technical debt Improvement of the software architecture and operational architecture and removed Team:Detection Alerts Security Detection Alerts Area Team labels May 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
consider-next discuss enhancement New value added to drive a business result Feature:Detection Alerts Security Solution Detection Alerts Feature sdh-linked Team:Detection Engine Security Solution Detection Engine Area Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. technical debt Improvement of the software architecture and operational architecture
Projects
None yet
Development

No branches or pull requests

6 participants