-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Security Solution][Alerts] Decouple gap detection from additional lookback #138933
Comments
Pinging @elastic/security-solution (Team: SecuritySolution) |
As of 8.4, we are now exposing the rule/action queue length and queue duration in Stack Monitoring: This was the first phase in providing external observability to alerting rules and it allow users to monitor their system for drift in alerting rules running. This also allows users to creating alerting rules on their alerting health so they can be notified when alerting rules drift beyond their threshold. If we don't want to rely on Stack Monitoring for this and display something in the cluster that is running the alerting rules, that's fine also. I would prefer that we do this at the framework level and that we don't tie it to the poll interval but allow users to configure an explicit drift tolerance, similar to option 1). A delay slightly greater than the poll interval is tolerable for all alerting rules. For security detection rules which should be generating all alerts, the gap remediation makes a slight delay completely tolerable; and observability rules are concerned with the current state of the systems that are being monitored, and slight delays are negligible. However, large delays are a concern for both security solution and observability rules. |
@marshallmain I think it's the right direction overall. Some thoughts and questions:
UX questions:
|
Summary from today's meeting on this issue (Framework-level drift detection and warning): Glossary
We should continue to discuss the best form for this warning about rules not running on schedule. One option is to display a global warning somewhere (detailed below) if rules are not running on schedule, since we expect that in most cases many rules will start to drift from their schedules concurrently. The other option is to display a warning on each rule that is not running on schedule. Where we choose to display the warning should take into account the actions we expect users to take to resolve the warning. Per @jethr0null, when there are too many rules to run on schedule the first preferred option is to guide users to provision more resources for Kibana. If provisioning more resources is not feasible for the user, the secondary option would be to disable some of their rules. New proposed warning locationThe initial proposal in this issue was to warn users on each rule if that particular rule is not running on schedule. However, @mikecote suggested that we could potentially warn users sooner and with better context if we create a "global" warning when rules are not running on schedule. The warning might display at the top of the Rules Management page in the Security Solution, for example, but might also display in other places in Kibana like Stack Management or Observability. A global warning would fit better with the expected use case here, since a rule drifting from its schedule is typically not actually a problem with the specific rule experiencing drift, but instead means that Kibana is under provisioned. In a global warning we could also include a list of the rules that are taking the most execution time in Kibana, providing quick insight into possible root causes for the warning. How exactly we would compute the global warning is still an open question, and there are a few options we discussed - though the list below is not an exhaustive list of viable options and we should continue exploring these and others.
Possible to layer warningsWe are able to have both a global warning and warnings specific to individual rules. This may be useful if individual rules can run off-schedule for reasons other than Kibana being under provisioned. We could then separate the global "Kibana is under provisioned" warning from individual rule "running off schedule" warnings. When layered, we could take the approach that:
Warning users when they're close to capacityIt could be useful to warn users when they're close to capacity instead of waiting until rules are running off schedule. This likely ties closely to the autoscaling work. Displaying global warningsWhile we would like any of these warnings to be computed at the stack level, we want to be able to display the warning inside solutions. For the global warning, one option is to include it in the alerting health API and have solutions call the API and display the warning as appropriate. Open questions
|
Thanks for the recap @marshallmain! I'm interested to see from a security product perspective what the expectation of the user is once they see this warning (disable rules, scale kibana, contact their administrators, just to let them know rules are behind, etc) |
Security rules attempt to warn users if the rules are not running on schedule. The "additional lookback" parameter on each rule controls how much a rule is allowed to drift from its rule interval before a "gap detected" warning is written. However, the "additional lookback" parameter also controls the overlap of time ranges that consecutive rule executions will query. It's not always clear to users what a good choice of "additional lookback" would be, and in some cases we've seen users pick relatively large values which then suppress warnings about rule scheduling drift. The "additional lookback" value can also be different for each individual rule, meaning that warnings about drift can show for some rules but not others - even though all rules are likely to be drifting if any rules are.
In one SDH we saw a customer choose additional lookback values of 30m or 1h for many of their rules, while using rule intervals of around 1-5 minutes. As they enabled more rules, they exceeded the number of rules per minute Kibana could process and rules started to get backed up and drift away from their scheduled execution times. However, the long additional lookback values suppressed any warnings about this drift in the Security Solution UI.
We should try to warn users more consistently any time their rules are not running on schedule. There are at least 3 ways we might implement this. All 3 solutions below would compute drift and drift tolerance in the same way for all rules in a cluster, which would resolve the problem where some rules can report drift warnings while others don't, even though they all drift the same way.
1. Add a "security solution drift tolerance" config and check the drift compared to this tolerance in the rule executors
This is the simplest to implement and avoids exposing framework implementation details (poll_interval) to solutions. The default value of "security solution drift tolerance" would be something like 5 seconds, slightly longer than the default
poll_interval
. Any users that increase the poll interval config value would need to increase the drift tolerance as well to avoid getting false positive warnings about drift.2. Expose the Task Manager
poll_interval
config to rule executors and check drift compared topoll_interval
This reduces the amount of configuration options in
kibana.yml
, but couples the Security solution rule implementation to a specific config setting in the Task Manager. Coupling solution implementation to details of the framework is not ideal.3. Implement gap detection in the Task Manager or alerting framework by comparing drift to
poll_interval
This solution would also have reduced configuration options compared to (1), but avoids coupling the solution and framework implementations. However, it would require more changes on a technical level to move that logic out of the security rule executors but expose the resulting gap information to the solution for gap remediation (i.e. running the core search logic multiple times over different time ranges to cover the gap).
The text was updated successfully, but these errors were encountered: