Task Manager v2 #42055

mikecote · 2019-07-26T11:43:29Z

This meta issue is still in WIP. More feature requests will be added to the list over time.

1. Using updateByQuery to claim tasks

There is performance issues scaling the current implementation to handle hundreds or thousands of tasks per second. Using the updateByQuery will allow task manager to claim tasks more efficiently by doing all the optimistic concurrency on the Elasticsearch side within a single request.

2. Queued requests (#43589)

Instead of throwing errors when task manager isn't ready. It would be nice to queue the requests until they can be processed. (Ex: scheduling a task).

3. Support Single Instance Tasks (#54916)

Following the removal of numWorkers it is now impossible to prevent two instances of the same task type from running concurrently. This is needed for Reporting as multiple instances of Chromium can cause issues.
We will aim to support a simple toggle that allows you to set a certain Task Type is a "Single Instance" task which won't allow TM to run two instances of the task concurrently on the same host.

4. Convert search to use KQL

This is the only function within the task manager store that still uses callCluster and replicates the saved objects rawToSavedObject functionality. It would be nice to convert it to use KQL once the support is implemented (#41136).

5. Encrypted attributes support

Now that task manager uses saved objects, we can pass encrypted attributes as parameters to a given task. This would be useful in alerting to pass the API Key to use when execution an action on behalf of an alert. This may require changes to the encrypted saved object plugin to allow custom ids (task manager supports this).

6. History

There should be a separate index containing the history of each task. The data stored should be denormalized from the task manager (attempts, params, state, etc). This probably will require task manager to never clean up finished tasks.

7. UI

There should be a UI for end users with the following functionality:

See what tasks are scheduled
See the history for a given task
Re-execute a task

8. HTTP API

There should be the following routes:

GET /api/task/_search
GET /api/task/{id}/history/_search
POST /api/task/{id}/_run

9. Documentation

Add asciidoc documentation for the task manager.

10. Testability

Task manager should be easier to test integrations with. Right now we have to use retry.retryForTime to wait until a task finished execution. It would be nice to have a way to speed up and / or have a hook to know when task finished running.

11. Scheduling

~~Ability to not throw an error when 409 is returned.~~
Ability to overwrite.

12. Cron syntax support

Ability to have more advance scheduling.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-07-26T11:43:31Z

Pinging @elastic/kibana-stack-services

Bamieh · 2019-08-07T14:48:38Z

Maps and oss telemetry are utilizing the task manager. Both would benefit from the overwrite / ignore 409 option in the task manager as they currently log the following warning at server startup:

server    log   [10:24:48.829] [warning][telemetry] Error scheduling task, received [task:oss_telemetry-vis_telemetry]: version conflict, document already exists (current version [3]): [version_conflict_engine_exception] [task:oss_telemetry-vis_telemetry]: version conflict, document already exists (current version [3]), with { index_uuid="mo3VOo_6QtuxedDXE_x0qA" & shard="0" & index=".kibana_task_manager_1" }

https://github.com/elastic/kibana/blob/master/x-pack/legacy/plugins/maps/server/maps_telemetry/telemetry_task.js#L31

https://github.com/elastic/kibana/blob/master/x-pack/legacy/plugins/oss_telemetry/server/lib/tasks/index.ts#L47

gmmorris · 2019-09-16T08:37:34Z

I've spent the last couple of days familiarising myself with Task Manager's internals and I'm going to start experimenting with approaches we can take to address the scaling issues mentioned under #1 Using updateByQuery to claim tasks.

There are a couple of quick wins I think I've identified, that feel worth trying out but I don't want to start making random changes without metrics we can work off of ( "If you can't measure it, you can't improve it" ), so my first priority is to figure out how to get measurements of the current TM and get a baseline.

Looking at @mikecote's update-by-query-demo it looks like we have a good starting point, but I'd like to collect stats for running this amount of tasks through Kibana itself, so I'm going to see about setting up an equivalent test against Kibana & ES instances and the actual TM.

Once I have that environment setup, and I measure a baseline of performance, I'm going to try the following experiments:

change TaskPool.attemptToRun so that the task.claimOwnership of each task is run concurrently (currently, they are sequentially moving from task to task, with only task.run running concurrently) . This will allow us to mostly parallelise the work being done in ES.
This is an easy change to make, and once we can measure it against the baseline we can see if this quick win has any impact.
decouple the claim ownership of a task from schedule a retry for said task, as they don't actually have to happen at the same time. This can be achieved by changing TaskStore.fetchAvailableTasks to an Update by Query as per Mike's experiment, and adding a bulk update of the retryAt field (currently made as part of task.claimOwnership) asynchronously to the actual task execution. This should reduce the number of ES updates we run against ES prior to attempting the work.

Once we have metrics for the baseline, the concurrent claimOwnership and the concurrent schedule retry step and attempt work step, we'll hopefully have a better idea of what we can handle scale wise.
The goal is broadly to reduce the amount of calls we make to ES and to parallelise any of these calls if we can decouple them.

gmmorris · 2019-09-16T11:59:52Z

There's now a branch on my fork that includes an Integration Tests which spawns tasks and measures how many tasks are run per second and what the lead time from a scheduled task reaching it's runAt time until it actually runs.
The test is currently marked as skip, so could in theory be merged into master (for future use), but I'll wait with that for now.

You can see it in this draft PR.

The results look like this:

❯ node ../scripts/functional_test_runner.js --config=test/plugin_api_perf/config.js
 debg Loading config file from "/Users/gidi/development/elastic/kibana/x-pack/test/plugin_api_perf/config.js"
 debg Loading config file from "/Users/gidi/development/elastic/kibana/x-pack/test/api_integration/config.js"
 debg Loading config file from "/Users/gidi/development/elastic/kibana/x-pack/test/functional/config.js"
 debg Loading config file from "/Users/gidi/development/elastic/kibana/test/common/config.js"
 debg Loading config file from "/Users/gidi/development/elastic/kibana/test/functional/config.js"
 info Config loaded
 debg randomness seed: 1568634458073
 info Starting tests

 └-: task_manager_perf
   └-> "before all" hook
   └-: stressing task manager
     └-> "before all" hook
     └-> should run 10 tasks every second for a minute
       └-> "before each" hook: global before each
       └-> "before each" hook
       │ debg Stress Test Result: {"runningAverageTasks":4.714285714285714,"averagesTaken":[3,5,5,4,5,5,5,5,5,4,5,5,5,5],"runningAverageLeadTime":8071.703571428571,"averagesTakenLeadTime":[2019,3306.4,5996.6,8051.5,8723,8932.2,10581.2,9329.2,9342.4,9345.75,9346.4,8721.8,9962.6,9345.8],"leadTimeQueue":[]}
       └- ✓ pass  (1.0m) "task_manager_perf stressing task manager should run 10 tasks every second for a minute"
     └-> "after all" hook
   └-> "after all" hook


1 passing (1.0m)

and you'll see the server logging the number of tasks run within consecutive 5 second windows.

   │ proc [kibana] I have processed 0 tasks in the past 5s
   │ succ
   │
   │      Elasticsearch and Kibana are ready for functional testing. Start the functional tests
   │      in another terminal session by running this command from this directory:
   │
   │          node ../scripts/functional_test_runner
   │
   │
   │ proc [kibana] I have processed 0 tasks in the past 5s
   │ proc [kibana] I have processed 0 tasks in the past 5s
   │ proc [kibana] I have processed 3 tasks in the past 5s
   │ proc [kibana] I have processed 5 tasks in the past 5s
   │ proc [kibana] I have processed 5 tasks in the past 5s
   │ proc [kibana] I have processed 4 tasks in the past 5s
   │ proc [kibana] I have processed 5 tasks in the past 5s
   │ proc [kibana] I have processed 5 tasks in the past 5s
   │ proc [kibana] I have processed 5 tasks in the past 5s
   │ proc [kibana] I have processed 5 tasks in the past 5s
   │ proc [kibana] I have processed 5 tasks in the past 5s
   │ proc [kibana] I have processed 4 tasks in the past 5s
   │ proc [kibana] I have processed 5 tasks in the past 5s
   │ proc [kibana] I have processed 5 tasks in the past 5s
   │ proc [kibana] I have processed 5 tasks in the past 5s
   │ proc [kibana] I have processed 5 tasks in the past 5s
   │ proc [kibana] I have processed 4 tasks in the past 5s

This will help us measure a baseline for comparison, but obviously not a proper stress test environment.
It's just meant to give us a way of measuring the impact of the changes we make from here out.

pmuellr · 2019-09-16T14:35:54Z

There's now a branch on my fork that includes an Integration Tests which spawns tasks and measures how many tasks are run per second and what the lead time from a scheduled task reaching it's runAt time until it actually runs.
The test is currently marked as skip, so could in theory be merged into master (for future use), but I'll wait with that for now.

awesome; and I like the idea of marking these as skip in the FT, meaning we can keep them with the rest of the FT tests, and easily turn them on locally for some testing.

gmmorris · 2019-09-16T14:47:48Z

Glad you like it. :)

A local run does seem to confirm that our current implementation caps out at a handful of tasks per second, which is what Mike already found, so I've moved on to trying to implement the first quick win (parallelise claim ownership).

I'll update when I have a working prototype.

gmmorris · 2019-09-17T07:43:14Z

@mikecote can you add this to your list above (it feels wrong just editing your comment):

Support Single Instance Tasks

Following the removal of numWorkers it is now impossible to prevent two instances of the same task type from running concurrently. This is needed for Reporting as multiple instances of Chromium can cause issues.
We will aim to support a simple toggle that allows you to set a certain Task Type is a "Single Instance" task which won't allow TM to run two instances of the task concurrently on the same host.

mikecote · 2019-09-17T17:30:59Z

@gmmorris as discussed, feel free to edit the description at your own leisure. You can re-organize this as you like. The meta issue is for the team to edit and collaborate.

gmmorris · 2019-09-18T10:49:45Z

I've pushed a change which applies the first experiment.

I'll add some more extensive docs around this, but the core change is this:
We queue the tasks of which we wish to take ownership and drain them out of this queue parallelising the claimOwnership step, while ensuring we never attempt to claim ownership of more tasks than we have capacity for.

The results from a local run on my laptop:


 └-: task_manager_perf
   └-> "before all" hook
   └-: stressing task manager
     └-> "before all" hook
     └-> should run 10 tasks every second for a minute
       └-> "before each" hook: global before each
       └-> "before each" hook
       │ debg Stress Test Result: {"runningAverageTasks":20,"averagesTaken":[1,11,18,25,25,20,25,20,25,20,20,25,20,25],"runningAverageLeadTime":1310.2527777777777,"averagesTakenLeadTime":[1921,1572,1401.888888888889,1190.96,1192.6,1162.15,1246.12,1282.05,1227,1201.7,1247.8,1234.36,1244.75,1219.16],"leadTimeQueue":[1237,1235,1237,1239,1239,1238,1240,1243,1241,1242,1194,1196,1194,1197,1197]}
       └- ✓ pass  (1.0m) "task_manager_perf stressing task manager should run 10 tasks every second for a minute"
     └-> "after all" hook
   └-> "after all" hook


1 passing (1.0m)

{
	"runningAverageTasks": 20,
	"averagesTaken": [1, 11, 18, 25, 25, 20, 25, 20, 25, 20, 20, 25, 20, 25],
	"runningAverageLeadTime": 1310.2527777777777,
	"averagesTakenLeadTime": [1921, 1572, 1401.888888888889, 1190.96, 1192.6, 1162.15, 1246.12, 1282.05, 1227, 1201.7, 1247.8, 1234.36, 1244.75, 1219.16],
	"leadTimeQueue": [1237, 1235, 1237, 1239, 1239, 1238, 1240, 1243, 1241, 1242, 1194, 1196, 1194, 1197, 1197]
}

This suggest that this change has the following impact (sketchy numbers, as it's my laptop, but proportionally we can still learn a lot) :

we're able to run tasks about 4 / 5 times the number of tasks per second as we could before.
The lead time from when a task should have been ran until it was actually ran dropped from about 8s to 1s.

It would be interesting running this on a more substantial environment such as AWS.

gmmorris · 2019-09-18T10:53:40Z

As much has I like this change, I'm disappointed by how little of an impact it has actually had.
This suggests the time spent talking to ES really is the bottle neck and we should focus on reducing that, but that is also the more complex and dangerous change, which is why I deferred it until after parallelising the claimOwnership step.

That's my next task.

gmmorris · 2019-09-30T16:29:26Z

If we were to add a field onto tasks which helps us identify when a Kibana instance tried to take ownership of a task and when it might be allegeable to have it's ownership take by another instance (this is to prevent a situation where ownership is taken, but retryAt isn't updated properly and so a task might fall between the cracks), would you prefer:

a ownershipExpiresAt field that specifies a Date & Time after which another Kibana is allowed to try and take ownership.

OR

a ownedAt field that specified when a Kibana instance took ownership, and so in code we'll specify the interval after which we're allowed to try and take ownership again.

I have my own preference, but I don't want to anchor - so what do you think? :)

But why do we even need this?
I'm trying to merge the take ownership and query for tasks step using an updateByQuery, but as a first step I want to keep the update of retryAt where it lives at the moment, as it's a little too complex to make part of the updateByQuery.
This change creates a dangerous possible state where a task is owned by a Kibana instance but never gets it's retryAt field updated, and that could lead to a task falling between the cracks.
I want the query step to look at this ownership field to know when a task might have fallen between the cracks and pick it up.

tsullivan · 2019-09-30T16:44:48Z

I prefer option 2, adding the ownedAt field. But to stay consistent with the verb terminology in other places in Task Manager, maybe claimedAt would be better.

Is there already a field for the UUID of the Kibana server that claims the task? Joel and I have recently added fields like that for Reporting-over-ESQueue, and we've just labeled them kibana_id and kibana_name.

gmmorris · 2019-09-30T17:02:43Z

Thanks @tsullivan , how do you feel about the fact that the retryAt field, which is the main way in which we cause a certain state to have been invalidated is in the future though?
I was originally thinking of using retryAt for both expiring ownership and expiring a run, but felt that could harm sustainability due to confusion about what that field actually means.

If we use claimedAt/ownedAt then it would be a DateTime in the past while retryAt is in the future, despite both being used to expire a certain state.
Would these two being different in this sense concern you?

Regarding the UUID - yes, I've added a owner field which is just that. :)

mikecote · 2019-09-30T17:04:18Z

I was originally thinking of using retryAt for both expiring ownership and expiring a run, but felt that could harm sustainability due to confusion about what that field actually mean

Another way I see it is we could introduce a new status claiming and re-use the retryAt field. That way it solves your concern about it not being clear.

gmmorris · 2019-09-30T17:04:51Z

hmm good idea, I like that. 🤔

tsullivan · 2019-09-30T18:00:59Z

In cases where claiming a task failed, I'm not sure what value the existing retryAt field will have towards resolving the state of the task.

If claiming failed and retryAt is in the past, it makes sense to retry right away.

If claiming failed and retryAt is in the future, it sounds like the deployment of Kibana servers has out-of-sync settings: system clock, for one.

In general, I lean towards adding more fields to help ensure a task state is stable, and a run doesn't fall through the cracks. They give us more debugging power if we ever need to diagnose something going wrong.

I was originally thinking of using retryAt for both expiring ownership and expiring a run, but felt that could harm sustainability due to confusion about what that field actually means.

This would be my concern too. I like Mike's idea to introduce the new claiming status since that would create a new variant of task state that would be understood and can be explicitly handled.

gmmorris · 2019-09-30T18:34:46Z

It's not so much that claiming might fail, it's that an instance will have claimed (as part of the fetch as a single call) and for some reason it then never got a chance to update from 'claiming' to 'running', and in the meantime retryAt will pass and the claiming will in essence expire.

I think that's enough of an indicator and lets us avoid a new field.

Does that make sense?

gmmorris · 2019-10-17T13:57:07Z

Update on Using updateByQuery to claim tasks:

Task Manager now uses updateByQuery to claim available tasks in a single call.
This has been merged into master and will be released as part of version 7.5.

In the meantime, we have extended the SavedObjectsClient(s) by adding a bulkUpdate operation which can update multiple saved objects in a single call.
This has been integrated as well, and will be released as part of 7.6.

Work is now being done to use this new functionality as part of the operations on tasks (the goal is for the two update steps marking a claimed task as running and marking a task run as complete to be done in bulk whenever possible).
PR for this is over here, and hopefully this will be integrated shortly.

In local perf tests this has bumped up the amount of tasks Kibana can handle per second considerably due to parallelisation of the mark tasks as running step and the drop in conflicts between multiple Kibana attempting to take ownership of the same task.

We're now trying to figure out how to perform some more extensive perf tests.

gmmorris · 2019-10-29T10:45:25Z

Submitted an issue to Elastic search regarding the potential optimisations we've discussed in updateByQuery

elastic/elasticsearch#48624

cc @peterschretlen

gmmorris · 2019-10-29T12:54:10Z

Do we have any thoughts on the prioritisation considerations between the following:

Update tasks, including runAt [DISCUSS] Task manager update API to allow changing a task's interval #45152
Task Manager UI Task manager UI #40029
New Platform [Task Manager] Convert plugin to New Platform #49629

At the moment I'm treating them as the above priority, but I'm not sure if there was some past conversation around this before my time.

gmmorris · 2019-10-29T14:52:43Z

Had a chat with @bmcconaghy and he feels the correct prioritisation is:

Update tasks, including runAt [DISCUSS] Task manager update API to allow changing a task's interval #45152
New Platform [Task Manager] Convert plugin to New Platform #49629
Task Manager UI Task manager UI #40029

As the NP work, broadly, is high priority.

gmmorris · 2019-11-01T15:36:48Z

A new issue has appeared in relation to the Performance Improvements we released in 7.5.
It seems users could disable inline scripts, which means Task Manager can no longer work.
#49853

At the moment this doesn't break anything new as it appears the components using TM already rely on inline scripts and are failing anyway, but long term we don't want this to be the case, so the issue will be discussed further.

gmmorris · 2019-11-11T20:16:46Z

Feels like the scheduleIfNotExists issue should be prioritised, so updated priorities in TM:

add scheduleIfNotExists api Task vis_telemetry "oss_telemetry-vis_telemetry" failed in attempt to run warning in Kibana logs #32050
Update tasks, including runAt [DISCUSS] Task manager update API to allow changing a task's interval #45152
New Platform [Task Manager] Convert plugin to New Platform #49629
Advanced Scheduling Advance scheduling (cron syntax, etc). #50272
Task Manager UI Task manager UI #40029

ymao1 · 2021-03-18T16:20:47Z

Closing as outdated.

mikecote added Meta Feature:Task Manager Team:Stack Services labels Jul 26, 2019

gmmorris self-assigned this Sep 12, 2019

bmcconaghy added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) and removed Team:Stack Services labels Dec 12, 2019

legrego mentioned this issue Mar 15, 2020

Sharing saved-objects phase 1 #54605

Merged

ymao1 closed this as completed Mar 18, 2021

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task Manager v2 #42055

Task Manager v2 #42055

mikecote commented Jul 26, 2019 •

edited by ymao1

Loading

elasticmachine commented Jul 26, 2019

Bamieh commented Aug 7, 2019

gmmorris commented Sep 16, 2019 •

edited

Loading

gmmorris commented Sep 16, 2019

pmuellr commented Sep 16, 2019

gmmorris commented Sep 16, 2019 •

edited

Loading

gmmorris commented Sep 17, 2019

mikecote commented Sep 17, 2019

gmmorris commented Sep 18, 2019

gmmorris commented Sep 18, 2019

gmmorris commented Sep 30, 2019 •

edited

Loading

tsullivan commented Sep 30, 2019

gmmorris commented Sep 30, 2019

mikecote commented Sep 30, 2019

gmmorris commented Sep 30, 2019

tsullivan commented Sep 30, 2019

gmmorris commented Sep 30, 2019

gmmorris commented Oct 17, 2019 •

edited

Loading

gmmorris commented Oct 29, 2019

gmmorris commented Oct 29, 2019

gmmorris commented Oct 29, 2019 •

edited

Loading

gmmorris commented Nov 1, 2019

gmmorris commented Nov 11, 2019 •

edited

Loading

ymao1 commented Mar 18, 2021

Task Manager v2 #42055

Task Manager v2 #42055

Comments

mikecote commented Jul 26, 2019 • edited by ymao1 Loading

1. Using updateByQuery to claim tasks

2. Queued requests (#43589)

3. Support Single Instance Tasks (#54916)

4. Convert search to use KQL

5. Encrypted attributes support

6. History

7. UI

8. HTTP API

9. Documentation

10. Testability

11. Scheduling

12. Cron syntax support

elasticmachine commented Jul 26, 2019

Bamieh commented Aug 7, 2019

gmmorris commented Sep 16, 2019 • edited Loading

gmmorris commented Sep 16, 2019

pmuellr commented Sep 16, 2019

gmmorris commented Sep 16, 2019 • edited Loading

gmmorris commented Sep 17, 2019

Support Single Instance Tasks

mikecote commented Sep 17, 2019

gmmorris commented Sep 18, 2019

gmmorris commented Sep 18, 2019

gmmorris commented Sep 30, 2019 • edited Loading

tsullivan commented Sep 30, 2019

gmmorris commented Sep 30, 2019

mikecote commented Sep 30, 2019

gmmorris commented Sep 30, 2019

tsullivan commented Sep 30, 2019

gmmorris commented Sep 30, 2019

gmmorris commented Oct 17, 2019 • edited Loading

gmmorris commented Oct 29, 2019

gmmorris commented Oct 29, 2019

gmmorris commented Oct 29, 2019 • edited Loading

gmmorris commented Nov 1, 2019

gmmorris commented Nov 11, 2019 • edited Loading

ymao1 commented Mar 18, 2021

mikecote commented Jul 26, 2019 •

edited by ymao1

Loading

gmmorris commented Sep 16, 2019 •

edited

Loading

gmmorris commented Sep 16, 2019 •

edited

Loading

gmmorris commented Sep 30, 2019 •

edited

Loading

gmmorris commented Oct 17, 2019 •

edited

Loading

gmmorris commented Oct 29, 2019 •

edited

Loading

gmmorris commented Nov 11, 2019 •

edited

Loading