-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Alerting] Investigate resilience / side effects of excessively long running rules #111259
Comments
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
The description of the current behavior provided above is accurate. For recurring tasks (like all rules), task manager sets the Another thing to note is that while multiple Kibana instances is specifically called out in the above description, this behavior occurs whether there is 1 or many Kibana instances. The task claiming logic queries for tasks where Suppose a rule is configured to run every minute and consistently takes 10 minutes to run. The following would occur:
This causes a rule that was supposed to run every minute to actually run every ~5 minutes. To further complicate things, one rule executions can return For example the ES query rule type passes the timestamp of the last matched document between rule executions. If a particular rule execution took a particularly long time:
|
It seems to me like there are 2 issues here:
I think in scenario (1), we definitely want to avoid having Kibana kick off a second version of a rule when the first hasn't finished running because we don't want to overload an already possibly overloaded cluster with another long-running rule. The second case is less clear to me. If we add the appropriate UX changes to make the user aware of the performance of the rule and they configure it aggressively anyways, should we let them do it? Or not because they don't understand the full impact of having multiple instances of a rule running in parallel and possibly updating the state out of order? |
When the task manager is claiming tasks, it is looking for tasks where For recurring tasks, the first clause will claims those tasks whose previous runs have completed successfully. The second clause tries to handle the case where Kibana gets restarted or crashes mid task execution. The logic being if it's still running after the configured timeout, it must not actually be running. I am wondering if we can pass a heartbeat callback into task manager tasks that when triggered updates the |
We could have the framework recognize when the rule is "cancelled" by task manager, and write an event log entry indicating it went over the timeout, and try to nullify the rule's later effects, like queuing actions, updating state, etc. What about rule registry - guessing we don't want that updated either.
Do we? ES will cancel searches if the connection that started them closes. Could we close the connection somehow? Or provide a default (maybe not overrideable?) timeout, which we'd conveniently set to a value near the task timeout. There's probably something we can do ...
Ya, easy to imagine. I think we'll need to make sure this can't happen. And we should give rules which are being cancelled the opportunity of returning new state to be stored; that will be the reward for implementing timeout cancellation :-)
Ya, interesting idea. What if the rule is in an endless loop though? Second timeout? Gets complicated. :-). And feels more important to stop "things" at a timeout first, before we consider letting them being extended. But it's a good idea to allow for extra slop to handle short-lived busy periods. Open an issue for later?
+1 We'd probably want to implement this if we figure out how to "cancel" rules at timeouts, to provide some user control over it. And we'll need diagnostics and/or UI. |
I wonder if we can rely on conflicts (409 errors) to prevent this from happening? Like capturing the SO In the case of a conflict, it would mean the task timed out and another (or the same) Kibana re-claimed the SO, changed the In the framework, as a stepping stone, to ensure the framework doesn't do anything funky in timeout cases, we could capture if the task got cancelled during the |
Love y'alll. A couple of thoughts:
It would be worth while collecting all the telemetry we feel we're missing and slipping it into the 7.16 release. Please add any telemetry you identify to this issue so that we can fit it in. :)
Absolutely true on both points. 👍 In relation to users: In relation to rule type implementers:
I love the idea... but I'm racking my brain wrt failure cases and the additional overhead.
I'm not too worried about endless loops as we have a "self healing" mechanism for that, which is the event loop and CPU are blocked and all of Kibana exhibits problems forcing us to address the issue.
Yes please :)
We don't have anything mutating tasks other than TM, and that was intentional, but I think it's only a matter of time as richer access to the state might be a prerequisite to some complex rule type implementations. For example, in discussing the Metric and Log Threshold rule types with @jasonrhodes and @Kerry350 and rethinking their "look back" implementation we theorised the use of state to track past alert instances and use that to detect when something is missing. We might need richer access to that state to remove things that were tracked before and the user wants to forget. Another thought on 409s: phew... lots of thoughts. Great work so far! I'm really looking forward to seeing the broken down issues we'd like to explore further. |
Yup. +1 on an RFC. I suspect there are some "quick wins" - more notification-y stuff when timeouts occur (telemetry, health, event log), but there's definitely a lot of work we're going to have to do to figure out the right long-term stuff - RFC seems right. |
Thanks for all the input! After research and some POCs, here's what I've come up with: We should be trying to handle both the cases where a rule is taking an unexpectedly long time to execute due to cluster load (in which case we should be taking steps to prevent multiple parallel executions) AND the case where a rule expectedly takes a long time to execute. Rule always takes a long time to executeIn this scenario, a particular rule type always takes a long time to execute. This may be only true within a cluster, where rule type A, which in development looks performant, always takes upwards of 10 minutes to execute based on the volume and characteristic of the data it is processing. In this case, our goal as a framework is to try to surface these durations to the user and to provide rule type producers with more tools to react to these types of scenarios. In the near term, I propose we do the following: 1. Allow rule types to pass in their own timeout values #111804 2. Allow rule types to set default and minimum schedule intervals #51203 3. Surface execution durations to the user in the stack management UI #111805 Rule occasionally takes a long time to executeIn this scenario, the long rule execution is an aberration and our goal as a framework should be to make sure that these aberrant executions are cancellable, or if they are not cancellable, if they return long past the attempted cancellation, their state does not overwrite the state of other executions that have started & completed while it was running. In the near term, I propose we do the following: 1. Provide framework level support for trying to cancel rule executions. POC for cancel behavior (Issue to come)
Note that while we can provide a 2. Ensure that when the long-running rule returns, its outdated state does not override state from subsequent executions. 3. Provide telemetry to determine how often we see these long running rules - Added this to the general alerting services telemetry issue Longer term, I propose we investigate the following:
[1] As an alternative, I also investigated [using versions to control task updates](https://github.com//pull/111573) if we want to allow "cancelled" tasks to update state as long as no other Kibana has picked up the task before the "cancelled" task returns. This is also a viable option but makes the task manager logic a little more complicated (even more than it already is!) |
I created issues for the items under |
Great observations and recommendations @ymao1 ! :elasticheart:
Can we enforce this somehow?Or at least nudge teams into doing it?
++ 100% agree.
Agree with everything here. |
@gmmorris Yes, happy to do so. We are currently working on the Rules page to get an Observability solution in place. The current plan is to simply bring over the Security version. However, I know monitoring/health are top concerns for Security as well and we have a task to update these screens. It would definitely be helpful to know Stack's needs during this process. |
@mdefazio I tagged you on the specific UI issue for this. Thanks for your help! |
Not clear how we could enforce it, but I'm also hopeful there might be some way to provide an augmented es/so client (or use the existing one with some config we're not aware of), to force the time out. So the rule would presumably end up throwing an error on an es/so call after the timeout was exceeded. Maybe it's just a wrapper around the clients that checks the timeout and throws an error if ever invoked. Seems like some unexplored areas to me. Other than that, logging a warning on the timeouts would be good, if we're not doing that already. |
BTW found this oldie :) |
Similar to the But if we want this in the UI, then we'll want this in the execution status in the rule, not just in the event log. I'm poking through that stuff right now for the RAC stuff, there may already be a decent fit as we have this "failure" reasons, so we could probably add a new reason of "execute-timeout" or similar: kibana/x-pack/plugins/alerting/common/alert.ts Lines 27 to 34 in 6991f22
|
@pmuellr Yes! That's exactly what I explored in my POC. The Elasticsearch client search function has a built-in
Yep! Task manager is already logging a warning when tasks time out:
|
Do we feel this issue influences the priority of this related issue ? I know they are only related, but it does make the point that other task types can affect rule and action execution. |
Yes and no :-) We may need some special attention on long running connectors, and then I think we'd lump all the other task types into "other task types". I think both of these are lower priority than rules, as we generally haven't seen a problem with them, and in fact at least one task specifically has a long timeout because it was designed to "spread out it's work" over time (can't remember which one right now tho). So - yes, we probably have some work to do for connectors and "other task types" - and presumably any changes we'd make for "other tasks" could affect the rule cancellation work. But I think the rule cancellation work takes priority over everything else. |
Turning this issue into a Thanks for all the hard work @ymao1 ! Follow up issues:
|
Removing from project board since it is now a Meta issue. |
We've seen certain rules run for excessively long duration, far exceeding what we had expected to see rules run.
For example, we've seen large customers experience execution durations of well over 10 minutes (our general expectation was that rules would several seconds, definitely not several minutes).
This is concerning, as such a behaviour could have side effects we're not aware of.
We should investigate the possible implications of this, and track how often this happens using logging and telemetry.
What might be happening?
Task Manager is configured to assume that a task (in this case, a rule task) has "timed out" after 5 minutes, at which point it tries to run it again.
Due to the distributed nature of Kibana instances, the attempt to rerun the rule task might be picked up by any Kibana instance (not necessarily the Kibana that ran it before). As we have no way of knowing if a rule is in fact still running or whether it has in fact crashed and "timed out", we assume the rule has in fact failed (as we had not expected a healthy rule task to ever run this long) and try to rerun it.
At that point, if the rule is in fact still running (and has simply exceeded 5 minutes), we likely end up with two instances of the same rule running twice in parallel. We aren't sure what the side effect of this might be, but likely one instance wil end up overwriting the result of the other - this is an unhealthy state, would likely have unexpected consequences and shouldn't happen.
Additionally, it's worth noting that most of the time this execution duration far exceeds the interval configured by the customer. This means that a rule might be configured to run every 1 minute, but ends up running every 11 minutes, or worse, every random amount of time above 10 minutes.
What should we do?
Feel free to change this list of actions, but this is what I think we should do of the top of my head:
Thoughts on guardrails
Adding guardrails around this at framework level is very difficult, but presumably our goal is to reduce the likelihood of this kind of thing happening.
Directions worth exploring:
p90
/p99
of each task type in Task Manager's health stats anyway, can we use that perhaps (when it tends to exceed5m
)?cancel
the rule when it exceeds the5m
execution time? We know we can't (currently) cancel the ES/SO queries performed by the implementation, but perhaps we can cancel it at framework level so that it's result doesn't actually get saved at framework level?I'm sure there are more directions, but these feel like a god starting point.
The text was updated successfully, but these errors were encountered: