-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task Manager v2 #42055
Comments
Pinging @elastic/kibana-stack-services |
Maps and oss telemetry are utilizing the task manager. Both would benefit from the overwrite / ignore 409 option in the task manager as they currently log the following warning at server startup:
|
I've spent the last couple of days familiarising myself with Task Manager's internals and I'm going to start experimenting with approaches we can take to address the scaling issues mentioned under #1 Using updateByQuery to claim tasks. There are a couple of quick wins I think I've identified, that feel worth trying out but I don't want to start making random changes without metrics we can work off of ( "If you can't measure it, you can't improve it" ), so my first priority is to figure out how to get measurements of the current TM and get a baseline. Looking at @mikecote's update-by-query-demo it looks like we have a good starting point, but I'd like to collect stats for running this amount of tasks through Kibana itself, so I'm going to see about setting up an equivalent test against Kibana & ES instances and the actual TM. Once I have that environment setup, and I measure a baseline of performance, I'm going to try the following experiments:
Once we have metrics for the baseline, the concurrent |
There's now a branch on my fork that includes an Integration Tests which spawns tasks and measures how many tasks are run per second and what the lead time from a scheduled task reaching it's runAt time until it actually runs. You can see it in this draft PR. The results look like this:
and you'll see the server logging the number of tasks run within consecutive 5 second windows.
This will help us measure a baseline for comparison, but obviously not a proper stress test environment. |
awesome; and I like the idea of marking these as skip in the FT, meaning we can keep them with the rest of the FT tests, and easily turn them on locally for some testing. |
Glad you like it. :) A local run does seem to confirm that our current implementation caps out at a handful of tasks per second, which is what Mike already found, so I've moved on to trying to implement the first quick win (parallelise claim ownership). I'll update when I have a working prototype. |
@mikecote can you add this to your list above (it feels wrong just editing your comment): Support Single Instance TasksFollowing the removal of numWorkers it is now impossible to prevent two instances of the same task type from running concurrently. This is needed for Reporting as multiple instances of Chromium can cause issues. |
@gmmorris as discussed, feel free to edit the description at your own leisure. You can re-organize this as you like. The meta issue is for the team to edit and collaborate. |
I've pushed a change which applies the first experiment. I'll add some more extensive docs around this, but the core change is this: The results from a local run on my laptop:
This suggest that this change has the following impact (sketchy numbers, as it's my laptop, but proportionally we can still learn a lot) :
It would be interesting running this on a more substantial environment such as AWS. |
As much has I like this change, I'm disappointed by how little of an impact it has actually had. That's my next task. |
If we were to add a field onto tasks which helps us identify when a Kibana instance tried to take ownership of a task and when it might be allegeable to have it's ownership take by another instance (this is to prevent a situation where ownership is taken, but
OR
I have my own preference, but I don't want to anchor - so what do you think? :) But why do we even need this? |
I prefer option 2, adding the Is there already a field for the UUID of the Kibana server that claims the task? Joel and I have recently added fields like that for Reporting-over-ESQueue, and we've just labeled them |
Thanks @tsullivan , how do you feel about the fact that the If we use Regarding the UUID - yes, I've added a |
Another way I see it is we could introduce a new status |
hmm good idea, I like that. 🤔 |
In cases where claiming a task failed, I'm not sure what value the existing If claiming failed and If claiming failed and In general, I lean towards adding more fields to help ensure a task state is stable, and a run doesn't fall through the cracks. They give us more debugging power if we ever need to diagnose something going wrong.
This would be my concern too. I like Mike's idea to introduce the new |
It's not so much that claiming might fail, it's that an instance will have claimed (as part of the fetch as a single call) and for some reason it then never got a chance to update from 'claiming' to 'running', and in the meantime I think that's enough of an indicator and lets us avoid a new field. Does that make sense? |
Update on Using updateByQuery to claim tasks:
In the meantime, we have extended the Work is now being done to use this new functionality as part of the operations on tasks (the goal is for the two update steps marking a claimed task as running and marking a task run as complete to be done in bulk whenever possible). In local perf tests this has bumped up the amount of tasks Kibana can handle per second considerably due to parallelisation of the mark tasks as running step and the drop in conflicts between multiple Kibana attempting to take ownership of the same task. We're now trying to figure out how to perform some more extensive perf tests. |
Submitted an issue to Elastic search regarding the potential optimisations we've discussed in |
Do we have any thoughts on the prioritisation considerations between the following:
At the moment I'm treating them as the above priority, but I'm not sure if there was some past conversation around this before my time. |
Had a chat with @bmcconaghy and he feels the correct prioritisation is:
As the NP work, broadly, is high priority. |
A new issue has appeared in relation to the Performance Improvements we released in 7.5. At the moment this doesn't break anything new as it appears the components using TM already rely on inline scripts and are failing anyway, but long term we don't want this to be the case, so the issue will be discussed further. |
Feels like the
|
Closing as outdated. |
This meta issue is still in WIP. More feature requests will be added to the list over time.
1.
Using updateByQuery to claim tasksThere is performance issues scaling the current implementation to handle hundreds or thousands of tasks per second. Using the updateByQuery will allow task manager to claim tasks more efficiently by doing all the optimistic concurrency on the Elasticsearch side within a single request.
2.
Queued requests (#43589)Instead of throwing errors when task manager isn't ready. It would be nice to queue the requests until they can be processed. (Ex: scheduling a task).
3.
Support Single Instance Tasks(#54916)Following the removal of numWorkers it is now impossible to prevent two instances of the same task type from running concurrently. This is needed for Reporting as multiple instances of Chromium can cause issues.
We will aim to support a simple toggle that allows you to set a certain Task Type is a "Single Instance" task which won't allow TM to run two instances of the task concurrently on the same host.
4. Convert search to use KQL
This is the only function within the task manager store that still uses
callCluster
and replicates the saved objectsrawToSavedObject
functionality. It would be nice to convert it to use KQL once the support is implemented (#41136).5. Encrypted attributes support
Now that task manager uses saved objects, we can pass encrypted attributes as parameters to a given task. This would be useful in alerting to pass the API Key to use when execution an action on behalf of an alert. This may require changes to the encrypted saved object plugin to allow custom ids (task manager supports this).
6. History
There should be a separate index containing the history of each task. The data stored should be denormalized from the task manager (
attempts
,params
,state
, etc). This probably will require task manager to never clean up finished tasks.7. UI
There should be a UI for end users with the following functionality:
8. HTTP API
There should be the following routes:
/api/task/_search
/api/task/{id}/history/_search
/api/task/{id}/_run
9. Documentation
Add asciidoc documentation for the task manager.
10. Testability
Task manager should be easier to test integrations with. Right now we have to use
retry.retryForTime
to wait until a task finished execution. It would be nice to have a way to speed up and / or have a hook to know when task finished running.11. Scheduling
Ability to not throw an error when 409 is returned.12. Cron syntax support
Ability to have more advance scheduling.
The text was updated successfully, but these errors were encountered: