-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISCUSS] Task manager update API to allow changing a task's interval #45152
Comments
Updated the description to something more explicit of what is needed from the task manager. |
Pinging @elastic/kibana-stack-services (Team:Stack Services) |
Making this a Discuss Issue (for a while at least) as I feel this requires some debate. Current StateSo currently, if I'm understanding thing correctly, when an Alert is created/enabled we schedule a task without specifying a As part of the implementation of Alerting we provide a TaskTypeDefinition for TaskManager which gets called whenever an Alert needs to be evaluated. The issue we're hitting is that as the Alert is run immediately after creation/enablement, resulting in the next Issues To Think AboutThere are a couple of things that jump out at me here
Suggested Plan Of Action
Note: Steps 2 & 3 will likely be done together in a single PR, as separating them might break existing update functionality in Alerting... but not 100% yet. Any thoughts? |
Another issue to think about that we haven't tackled yet: TM should really support So, just something to keep in the back of our heads, nothing to do about it now. |
It could be that this was done to handle the case where an action actually knows WHEN it should retry (eg, Slack 429 rate limiting responses, that indicate the time when you can try again). Since this already doesn't work quite right with our current slack action, I don't think this should hold us back from implementing as you suggest (have alerting use TM scheduling). Oh, actually that would be in actions, not alerts, but maybe actions are also doing their own scheduling? Will need to look closer at that, I think. In any case, TM clearly has to deal with scheduling, so as much as possible it's probably best to have it be the source of truth and single way to deal with it. |
Not clear why we need a separate |
This isn't quite in addition, as Task Manager doesn't have an That said, the Store has
|
I think Actions return a boolean |
So, playing around with the code today I came to the conclusion that there's actually a difference between
Any thoughts @pmuellr ? There is another, related, question that comes to mind- should rescheduling update the |
argh, just realised the unit tests validating dates in TaskStore are unreliable as we mock the time and so actually ignore the time returned by the repository… if the wrong date is present - we reset it to the mock, so it doesn’t catch breakages. 😬 |
yup, I can confirm nothing is broken, it just obfuscated a potential bug. This is now fixed on the branch for this issue. |
Here's another issue to iron out: When we schedule a Task with an interval we run the task and the interval comes into effect when the scheduling the next run. |
Further investigation has unearthed a few more moving parts to take into account:
Still contemplating these cases and the repercussion of rescheduling them.
|
By design, I believe we're comfortable with this feature. Though
I believe they behave the same? but are coded differently. We changed the task manager's interval feature to match what we're doing in Alerting and what we think tasks should do in general.
I think @peterschretlen any thoughts on how we expect alerts to re-schedule when changing the interval?
I'm not sure we would need a reschedule API for alerting. We would need to update the task's
++ for the reschedule API if we develop it.
Yeah, I think we'll have to add some logic to the update API.
Agreed, I found this issue #50272 |
They're mildly different actually, but not for any reason I can identify.
I've gone with
TaskManager doesn't currently have an Instead I prefer to expose a
Hence, better to separate |
Haha no worries, yeah, my thoughts exactly but hopefully we won't need an
additional API. 🤞
…On Wed, Nov 20, 2019, 14:57 Patrick Mueller ***@***.***> wrote:
Sorry for the late reply - I didn't realize we didn't have a public
update() method, and I was worried about having two ways of updating the
schedule, given the new reschedule() api. With no public update() method,
I'm obviously happy with having a new reschedule() api :-).
I guess if we DO add an update() API of some sort, we'll need to figure
out what the semantics of that are - does it automagically reschedule, or
just update relevant things without neccessarily rescheduling, or maybe
takes an additional option indicating whether rescheduling should happen.
Kinda thing.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#45152?email_source=notifications&email_token=AAC6JIBZIGA66JBURKFXMKLQUVF4JA5CNFSM4IU4SPLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEESHYVI#issuecomment-556039253>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAC6JIC2WXHJYQZSJURMO5TQUVF4JANCNFSM4IU4SPLA>
.
|
Going back on forth on this with @mikecote I've come to the conclusion that this current model is a little dangerous, especially with the expected growth of the use of Task Manager in Alerting. This is due to the fact that we use a single document to manage both configuration and running-state of a task. This model doesn't scale, especially in a distributed system, and I think we should hold off on merging this change as it introduces a greater likelihood of clashes that in some cases might be hard to recover from in a manner that can be reasoned about. I'm prioritising looking into how we might be able to separate the configuration of a task from it's running-state, in such a way that configuration updates can't clash with state updates. |
Since this request is being driven by alerting, perhaps we should revisit the assumption that a reschedule API (or any kind of API for configuration update) is needed. Rather than fighting the issues with optimistic concurrency control, what if we consider task configuration immutable? To update the configuration, you have to tear down and re-create the task. That pushes responsibility on the client. I don’t know if that would just introduce other problems, but if we’re seeing lots of collisions like this maybe we need to revisit the assumption we need a reschedule api on task manager. At some point I think it makes sense to try and solve the underlying problem of separating updates to state and configuration. As-is though that problem is by design, and it may not be worth revisiting the design as part of this issue. |
Some thoughts from @mikecote on potential issues with immutable config:
|
Agreed, as the original need for
If the immutable task approach were to be handled in TM we would have to ensure 100% safety in carrying over state, so this would still require some kind of reconciliation across states. |
If we take a step back and look at the original problem, it basically comes down the alert not picking up the updated interval right away. We could solve this problem by just running the alert after update ("run now" feature) and let the post execution runAt calculation handle the rest. This would sort of mean we still don't use the This will at least guarantee the alert won't lose state but at the same time not guarantee the updated interval will take effect immediately. But it will eventually which I think is more important than potentially corrupting a task. This then moves the remaining problems to the "run now" API for things like "what if it's already running now" (which could be an error or pretend it in fact did run immediately). We can also add some UX to handle these corner cases like a note to say "it may take longer to update the alert's check interval". This idea is still very fresh and probably has its downsides I didn't get a chance to think of, but food for thought. |
I'm not sure I follow the idea. Also... sorry, but what's OCC? |
The scenario here is that we would / could swallow the error and let the next execution return a new runAt with the updated interval.
Optimistic Concurrency Control. The cause of the lovely 409 errors. |
ahhh 🤦♂ ofcourse. |
Ah, you mean Alerting would handle that? |
Yeah I think we just do that for now, it would check the box for alerting and we can look at maybe letting TM do so later. We could just focus on a run now API which basically calls the claim process for a specific task and skips it if it's not I think we learned a lot with this exercise about what updating a task means and how it can go wrong in a few ways. I'm not opposed to still explore the delete + schedule new one either but figured I would put this as an option. Curious what @peterschretlen and @pmuellr think? |
Yeah the "run now" approach sounds promising. And clean - I suspect it would have fewer edge cases than a "delete + schedule" approach. |
Cool, I'll look into that api, that should be fine. |
I've began looking into the work needed for One thought though is that we need to give some information about the Task run back to the API caller (I presume), so I was wondering if it's just about giving back success vs. failure (w/ error message) information? Or do we need to give more information like updated state? Either way there's work in TM to do, as we don't currently expose anything about success/failure other than internally in the runner, but this seems like a good opportunity to figure out how to do it as we said we'd want that for diagnosis anyway. |
Yeah, we should develop this #50214 in preparation for this #50215. We'll be able to re-use this run now API from task manager when creating an API to run alerts immediately. I think basically setting a We can worry about bypassing available workers and such at a later time if necessary. We can make the function communicate when the task isn't idle, got updated during attempt to set This will be mostly for server logging purposes for the alert update API to communicate if the task didn't get queued. That way we've done our best to make the new |
hmm, I wasn't going to update the If there's no need to return such a response (and we rely on the log instead) then I'd still prefer to utilise the The one downside to this approach is that whichever Kibana gets the request will handle the execution (rather than allowing any Kibana to pick it up) but this feels much safer than naively updating the field (as it's not just 409s on update I'm worried about, but also 409s due to synchronisation issues). |
You're right, this is probably the approach we should go. Thinking of it, the alert run now API wouldn't respond until execution is complete. |
We are now going to have two "successful" outcomes to running
Any ideas how we should notify the user about these things? Any ideas? |
We discussed the above issue on slack and decided the following: |
There should be an update API in the task manager that allows to change certain properties within a task. Whenever changing the task's interval, it should if possible, re-adjust the
runAt
.The text was updated successfully, but these errors were encountered: