Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSS] Task manager update API to allow changing a task's interval #45152

Closed
gmmorris opened this issue Sep 9, 2019 · 40 comments · Fixed by #53143
Closed

[DISCUSS] Task manager update API to allow changing a task's interval #45152

gmmorris opened this issue Sep 9, 2019 · 40 comments · Fixed by #53143
Assignees
Labels
discuss Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.6.0

Comments

@gmmorris
Copy link
Contributor

gmmorris commented Sep 9, 2019

There should be an update API in the task manager that allows to change certain properties within a task. Whenever changing the task's interval, it should if possible, re-adjust the runAt.

@gmmorris gmmorris self-assigned this Sep 9, 2019
@mikecote mikecote changed the title Adjust the scheduled task's runAt when updating an alert if the interval has changed Task manager update API to allow changing a task's interval Sep 18, 2019
@mikecote
Copy link
Contributor

Updated the description to something more explicit of what is needed from the task manager.

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-stack-services (Team:Stack Services)

@gmmorris gmmorris changed the title Task manager update API to allow changing a task's interval [DISCUSS] Task manager update API to allow changing a task's interval Nov 14, 2019
@gmmorris
Copy link
Contributor Author

gmmorris commented Nov 14, 2019

Making this a Discuss Issue (for a while at least) as I feel this requires some debate.

Current State

So currently, if I'm understanding thing correctly, when an Alert is created/enabled we schedule a task without specifying a runAt field or interval field to Task Manager.
What this means is the the Alert is always run "immediately" after creation/enablement.

As part of the implementation of Alerting we provide a TaskTypeDefinition for TaskManager which gets called whenever an Alert needs to be evaluated.
In the TaskTypeDefinition we use the ability of tasks to override the runAt field as part of their run to use the interval (which is specified on the Alert) to calculate when the next run should take place.

The issue we're hitting is that as the Alert is run immediately after creation/enablement, resulting in the next runAt being calculated based on the interval as it was at the that time, it doesn't get a chance to recalculate the runAt until it gets run again. This means that if we update the alert at any time, including changing the interval, the changes won't take affect until it is rerun by Task Manager at some later point.

Issues To Think About

There are a couple of things that jump out at me here

  1. This means that the use of interval in Alerting is different than the use of interval in regular tasks in Task Manager, but I can't see a good reason for this distinction. It means we have two places to maintain and keep in mind when thinking about intervals, and also means behaviour can diverge over time. For example, if we implement updating as part of this issue, then the semantics between updating a task vs. an alert could easily diverge.
    In fact, as the interval is stored differently in alerts, an Alerting task doesn't actually appear to have an interval at all from Task Manager's perspective, which harms maintainability already.
  2. In relation to the semantic around updating an interval, there is a nuance we need to discuss: if an interval on a task/alert is updated, _do we recalculate the runAt as (scheduledAt / lastRun) + interval or updatedTime + interval? My instinct is that the interval should be calculated relative to the last run (or in lieu of a previous run, the scheduledAt time), but want o hear thoughts.

Suggested Plan Of Action

  1. Even though it isn't actually addressing the cause for this Issue, we begin by adding a reschedule api in Task Manager which is used to reschedule an already scheduled task, which will support updating the interval and will calculate the correct runAt based on the semantics that we'll agree after discussing issue 2 above.

  2. Then we move Alerting to use Task Manager's built in intervals (as per Convert alerts to use task manager intervals #46001) as this will reduce the dangers in issue 1 above.

  3. Change Alerting's update api to use Task Manager's new reschedule api.

Note: Steps 2 & 3 will likely be done together in a single PR, as separating them might break existing update functionality in Alerting... but not 100% yet.


Any thoughts?

@pmuellr
Copy link
Member

pmuellr commented Nov 14, 2019

Another issue to think about that we haven't tackled yet: TM should really support cron type scheduling as well as the somewhat vague "interval" it already supports. Watcher itself is kinda uber-flexible here, I think I'd be happy with just an interval or cron specification allowed, for now - https://www.elastic.co/guide/en/elasticsearch/reference/current/trigger-schedule.html .

So, just something to keep in the back of our heads, nothing to do about it now.

@pmuellr
Copy link
Member

pmuellr commented Nov 14, 2019

  1. Then we move Alerting to use Task Manager's built in intervals (as per Convert alerts to use task manager intervals #46001) as this will reduce the dangers in issue 1 above.

It could be that this was done to handle the case where an action actually knows WHEN it should retry (eg, Slack 429 rate limiting responses, that indicate the time when you can try again). Since this already doesn't work quite right with our current slack action, I don't think this should hold us back from implementing as you suggest (have alerting use TM scheduling). Oh, actually that would be in actions, not alerts, but maybe actions are also doing their own scheduling? Will need to look closer at that, I think.

In any case, TM clearly has to deal with scheduling, so as much as possible it's probably best to have it be the source of truth and single way to deal with it.

@pmuellr
Copy link
Member

pmuellr commented Nov 14, 2019

Not clear why we need a separate reschedule API, since presumably an update can also essentially do a reschedule. I'm always a little leery of having "more than one way to do it". :-)

@gmmorris
Copy link
Contributor Author

gmmorris commented Nov 14, 2019

Not clear why we need a separate reschedule API, since presumably an update can also essentially do a reschedule. I'm always a little leery of having "more than one way to do it". :-)

This isn't quite in addition, as Task Manager doesn't have an update api at the moment, only a schedule.

That said, the Store has schedule and update and I'm now considering the addition of reschedule because they are actually quite different.

  • schedule - task a TaskTypeDefinition, instanciate a Task and schedule it to run immediately (or in the future if a runAt is specified).
  • reschedule - take an existing task and recalculate its existing runAt field, taking into account its current state, previous interval if it has one. Ideally we'll end up with a normal Task reshceduling using this API, instead of it happening inside of TaskRunner.
  • update - Take an existing Task instance and update whichever fields I tell you to, no questions asked. Very much an internal thing used to manage state.

@gmmorris
Copy link
Contributor Author

Oh, actually that would be in actions, not alerts, but maybe actions are also doing their own scheduling? Will need to look closer at that, I think.

I think Actions return a boolean retry which then Alerting acts upon.

@gmmorris
Copy link
Contributor Author

gmmorris commented Nov 14, 2019

So, playing around with the code today I came to the conclusion that there's actually a difference between update and reschedule in the TaskStore:

  • update is a naive update as it takes whatever attributes you give it, such as runAt etc. and it used by the TaskManager, TaskRunner etc. while manipulating an existing Task
  • reschedule on the other hand takes whatever attributes of a Task and updates the existing task with these fields (like update), but it also needs to apply some scheduling semantics, such as resetting the runAt, and presumably wiping some scheduling state, such as retryAt and status fields. So, it is actually a mix between schedule and update, so reschedule does feel right to me.

Any thoughts @pmuellr ?

There is another, related, question that comes to mind- should rescheduling update the scheduledAt field? I think @mikecote will have thoughts on this...

@gmmorris
Copy link
Contributor Author

argh, just realised the unit tests validating dates in TaskStore are unreliable as we mock the time and so actually ignore the time returned by the repository… if the wrong date is present - we reset it to the mock, so it doesn’t catch breakages. 😬
I don’t think anything is broken, but mocking time is dangerous… I’ll look into cleaning that up while I’m in there.

@gmmorris
Copy link
Contributor Author

argh, just realised the unit tests validating dates in TaskStore are unreliable as we mock the time and so actually ignore the time returned by the repository… if the wrong date is present - we reset it to the mock, so it doesn’t catch breakages. 😬
I don’t think anything is broken, but mocking time is dangerous… I’ll look into cleaning that up while I’m in there.

yup, I can confirm nothing is broken, it just obfuscated a potential bug. This is now fixed on the branch for this issue.

@gmmorris
Copy link
Contributor Author

Here's another issue to iron out:

When we schedule a Task with an interval we run the task and the interval comes into effect when the scheduling the next run.
In the case where we're using the reschedule api with a new interval, the default behaviour you might expect is for the rescheduled task to be run immediately (like when calling schedule) and then for the new interval to apply to the next scheduled run.
My instinct, though, is not to run the task immediately, but rather just adjust the interval based on when the task was last ran - wanted to verify we agree on this behaviour.

@gmmorris
Copy link
Contributor Author

Further investigation has unearthed a few more moving parts to take into account:

  1. This work is essentially introducing a second place where a task get rescheduled, as this also happens as part of the TaskRunner's work when a recurring task completes. I'll see if we can merge these into a single place for the logic to reside, which will help make TaskRunner smaller, which was already a concern I had as it does many different things.

  2. There are several states a task might be in when reschedule is called which all need to be addressed:

  • A non recurring Task whose RunAt is in the future (we don't currently create these as far as I'm aware, but the API allows it)
  • A recurring task which has been scheduled to run in the future at some point
  • A task (recurring or otherwise) which is mid run and we need to decide whether to reschedule in such a state of flux
  • A task (recurring or otherwise) which has failed and is . about to be retried in the near future, does rescheduling reset the runAt same as it would with an idle task?

Still contemplating these cases and the repercussion of rescheduling them.

  1. It turns out we don't know when a recurring task was last run , only when it was scheduled (startedAt gets wiped when a task completes and gets rescheduled) so calculating the runAt when an interval is changed is tricky as scheduledAt + interval is very likely to be in the past, resulting in an immediate run of the task. I'm looking into using the existing interval to subtract from the current runAt to get the previous run time and then add the new interval to it, but that makes out scheduling logic even more messy, so I'm trying to simplify the moving parts while I'm in there.

@mikecote
Copy link
Contributor

The issue we're hitting is that as the Alert is run immediately after creation/enablement

By design, I believe we're comfortable with this feature. Though it doesn't get a chance to recalculate the runAt until it gets run again is definitely part of the problem. A user can update an alert anytime (not just after initially creating the alert) and the issue outlined exists.

This means that the use of interval in Alerting is different than the use of interval in regular tasks in Task Manager

I believe they behave the same? but are coded differently. We changed the task manager's interval feature to match what we're doing in Alerting and what we think tasks should do in general.

In relation to the semantic around updating an interval, there is a nuance we need to discuss: if an interval on a task/alert is updated, _do we recalculate the runAt as (scheduledAt / lastRun) + interval or updatedTime + interval?

I think now + interval would be the easiest unless we feel we should go down the path of calculating the last time the task ran by doing (runAt - previous interval) + new interval

@peterschretlen any thoughts on how we expect alerts to re-schedule when changing the interval?

Change Alerting's update api to use Task Manager's new reschedule api.

I'm not sure we would need a reschedule API for alerting. We would need to update the task's interval parameter which I'm assuming would cause a re-calculation of runAt. This could bring a few more complications like "how do I know if the interval changed" and "did I previously have an interval" but I think making the update API a bit smarter would be my pick.

My instinct, though, is not to run the task immediately, but rather just adjust the interval based on when the task was last ran - wanted to verify we agree on this behaviour.

++ for the reschedule API if we develop it.

There are several states a task might be in when reschedule is called which all need to be addressed

Yeah, I think we'll have to add some logic to the update API.

@pmuellr

Another issue to think about that we haven't tackled yet: TM should really support cron type scheduling as well as the somewhat vague "interval" it already supports.

Agreed, I found this issue #50272

@gmmorris
Copy link
Contributor Author

gmmorris commented Nov 18, 2019

I believe they behave the same? but are coded differently. We changed the task manager's interval feature to match what we're doing in Alerting and what we think tasks should do in general.

They're mildly different actually, but not for any reason I can identify.
Alerting defaults to Date.now() if the startedAt + interval are in the past (TM doesn't do this), and the interval is stored in params instead of the interval field.
In anycase - the logic is duplicated between the two, and mildly different - merging these seems the right direction.

I think now + interval would be the easiest unless we feel we should go down the path of calculating the last time the task ran by doing (runAt - previous interval) + new interval

I've gone with (runAt - previous interval) + new interval, it wasn't that messy, and it feels more correct, but can easily strip this out.

I'm not sure we would need a reschedule API for alerting. We would need to update the task's interval parameter which I'm assuming would cause a re-calculation of runAt. This could bring a few more complications like "how do I know if the interval changed" and "did I previously have an interval" but I think making the update API a bit smarter would be my pick.

TaskManager doesn't currently have an update api, only a schedule api, so either way we're exposing a new api.
I don't want to expose the update api as it feels internal and naive and lets you change lots of internal fields (anything on ConcreteTaskInstance that I don't think we should let any PlugIn update blindly as it can easily cause a broken state in a task...

Instead I prefer to expose a reschedule api that only lets you change fields that were available in the original schedule.

There are several states a task might be in when reschedule is called which all need to be addressed

Yeah, I think we'll have to add some logic to the update API.

Hence, better to separate update from reschedule by keeping update internal, and exposing a new api.

@gmmorris
Copy link
Contributor Author

gmmorris commented Nov 20, 2019 via email

@gmmorris
Copy link
Contributor Author

Going back on forth on this with @mikecote I've come to the conclusion that this current model is a little dangerous, especially with the expected growth of the use of Task Manager in Alerting.
The current model relies heavily on the versioning of documents and naively tries to update documents in the hope it hasn't been touched.

This is due to the fact that we use a single document to manage both configuration and running-state of a task. This model doesn't scale, especially in a distributed system, and I think we should hold off on merging this change as it introduces a greater likelihood of clashes that in some cases might be hard to recover from in a manner that can be reasoned about.

I'm prioritising looking into how we might be able to separate the configuration of a task from it's running-state, in such a way that configuration updates can't clash with state updates.
The sooner we tackle this the better.

@peterschretlen
Copy link
Contributor

Since this request is being driven by alerting, perhaps we should revisit the assumption that a reschedule API (or any kind of API for configuration update) is needed.

Rather than fighting the issues with optimistic concurrency control, what if we consider task configuration immutable?

To update the configuration, you have to tear down and re-create the task. That pushes responsibility on the client. I don’t know if that would just introduce other problems, but if we’re seeing lots of collisions like this maybe we need to revisit the assumption we need a reschedule api on task manager.

At some point I think it makes sense to try and solve the underlying problem of separating updates to state and configuration. As-is though that problem is by design, and it may not be worth revisiting the design as part of this issue.

@peterschretlen
Copy link
Contributor

Some thoughts from @mikecote on potential issues with immutable config:

There could be some challenges if the task state doesn’t transfer (resets throttling, loses alert instance state, etc) but maybe another use case to move alert instance logic out or we copy it over.

@gmmorris
Copy link
Contributor Author

Agreed, as the original need for reschedule (as opposed to remove + schedule) was that we would have to handle failure mid way and recovery, but we're now doing that anyway in TM and due to the generic nature of TM, it's a more complicated task to achieve there.

Some thoughts from @mikecote on potential issues with immutable config:

There could be some challenges if the task state doesn’t transfer (resets throttling, loses alert instance state, etc) but maybe another use case to move alert instance logic out or we copy it over.

If the immutable task approach were to be handled in TM we would have to ensure 100% safety in carrying over state, so this would still require some kind of reconciliation across states.
If, knowing the specific needs of Alerting, we can live with some state loss then it's probably preferable to handle this at Alerting level rather than Task Manager accepting that kind of data loss in general.

@mikecote
Copy link
Contributor

If we take a step back and look at the original problem, it basically comes down the alert not picking up the updated interval right away. We could solve this problem by just running the alert after update ("run now" feature) and let the post execution runAt calculation handle the rest. This would sort of mean we still don't use the interval feature of task manager but maybe because of OCC we shouldn't.

This will at least guarantee the alert won't lose state but at the same time not guarantee the updated interval will take effect immediately. But it will eventually which I think is more important than potentially corrupting a task.

This then moves the remaining problems to the "run now" API for things like "what if it's already running now" (which could be an error or pretend it in fact did run immediately). We can also add some UX to handle these corner cases like a note to say "it may take longer to update the alert's check interval".

This idea is still very fresh and probably has its downsides I didn't get a chance to think of, but food for thought.

@gmmorris
Copy link
Contributor Author

I'm not sure I follow the idea.
Wouldn't that still require setting the runAt of the task and encountering the same issue?
I might be missing something.

Also... sorry, but what's OCC?

@mikecote
Copy link
Contributor

mikecote commented Nov 22, 2019

Wouldn't that still require setting the runAt of the task and encountering the same issue?

The scenario here is that we would / could swallow the error and let the next execution return a new runAt with the updated interval.

what's OCC?

Optimistic Concurrency Control. The cause of the lovely 409 errors.

@gmmorris
Copy link
Contributor Author

ahhh 🤦‍♂ ofcourse.

@gmmorris
Copy link
Contributor Author

Wouldn't that still require setting the runAt of the task and encountering the same issue?

The scenario here is that we would / could swallow the error and let the next execution return a new runAt with the updated interval.

Ah, you mean Alerting would handle that?
Because TM can't afford to ignore the error if it happens in the lifecycle.

@mikecote
Copy link
Contributor

Ah, you mean Alerting would handle that?

Yeah I think we just do that for now, it would check the box for alerting and we can look at maybe letting TM do so later. We could just focus on a run now API which basically calls the claim process for a specific task and skips it if it's not idle state.

I think we learned a lot with this exercise about what updating a task means and how it can go wrong in a few ways. I'm not opposed to still explore the delete + schedule new one either but figured I would put this as an option.

Curious what @peterschretlen and @pmuellr think?

@peterschretlen
Copy link
Contributor

Yeah the "run now" approach sounds promising. And clean - I suspect it would have fewer edge cases than a "delete + schedule" approach.

@gmmorris
Copy link
Contributor Author

Cool, I'll look into that api, that should be fine.
I'll also amend / add issues and pick them up.

@gmmorris
Copy link
Contributor Author

I've began looking into the work needed for run now.
I believe that if we can plug it into the normal lifecycle of Task Manager, essentially triggering an early polling phase with the requested task being handled first (perhaps even above Worker capacity, though not sure we want to do that), then it shouldn't add too much complexity to Task Manager.

One thought though is that we need to give some information about the Task run back to the API caller (I presume), so I was wondering if it's just about giving back success vs. failure (w/ error message) information? Or do we need to give more information like updated state?

Either way there's work in TM to do, as we don't currently expose anything about success/failure other than internally in the runner, but this seems like a good opportunity to figure out how to do it as we said we'd want that for diagnosis anyway.

@mikecote
Copy link
Contributor

Yeah, we should develop this #50214 in preparation for this #50215. We'll be able to re-use this run now API from task manager when creating an API to run alerts immediately.

I think basically setting a runAt to now is good enough and maybe calling a function to search for tasks based on availability of workers. If you want to go one step further, maybe the minimum between now and the existing runAt in the scenario the task manager has a backlog of tasks to run.

We can worry about bypassing available workers and such at a later time if necessary. We can make the function communicate when the task isn't idle, got updated during attempt to set runAt (409) or queued / success.

This will be mostly for server logging purposes for the alert update API to communicate if the task didn't get queued. That way we've done our best to make the new interval take effect but worst case it'll happen at the next run. Also with a server log this will make easier to debug in the future.

@gmmorris
Copy link
Contributor Author

hmm, I wasn't going to update the runAt but rather extend the claim so that on the next poll it'll pick that task up no matter what.
The thinking is that this would mean we avoid the issue of multiple Kibana updating a task in parallel, so it feels safer, and it also means we can reply to the API call with meaningful data (such as task has been run successfully / failed to run with the following error).

If there's no need to return such a response (and we rely on the log instead) then I'd still prefer to utilise the claim step rather than update the field, but then we can simply reply with whether the task has been successfully queued or not.

The one downside to this approach is that whichever Kibana gets the request will handle the execution (rather than allowing any Kibana to pick it up) but this feels much safer than naively updating the field (as it's not just 409s on update I'm worried about, but also 409s due to synchronisation issues).

@mikecote
Copy link
Contributor

You're right, this is probably the approach we should go. Thinking of it, the alert run now API wouldn't respond until execution is complete.

@bmcconaghy bmcconaghy added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) and removed Team:Stack Services labels Dec 12, 2019
@gmmorris
Copy link
Contributor Author

runNow has been merged into master and I'm working on using it to rerun the alert the moment it's interval is changed.
At the moment, the update method on the AlertsClient returns the updated alert when it .is updated successfully and throws an error if it fails.

We are now going to have two "successful" outcomes to running update:

  1. We've updated the alert, and the interval has successfully been rerun so the interval is up to date.
  2. We've updated the alert, but running it has failed ,so it will update the interval on the next run.

Any ideas how we should notify the user about these things?
The first case can simply return the updated alert. The second, on the other hand, is tricky, as we want the user to understand that the alert has actually been updated successfully, but we also want them to understand that the alert won't run as per the updated interval, but rather the previous value.

Any ideas?

@gmmorris
Copy link
Contributor Author

We discussed the above issue on slack and decided the following:
We won't wait for the runNow to complete, so return value remains the same - but we will log if it fails.

@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v7.6.0
Projects
None yet
7 participants