Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task Manager v2 #42055

Closed
mikecote opened this issue Jul 26, 2019 · 24 comments
Closed

Task Manager v2 #42055

mikecote opened this issue Jul 26, 2019 · 24 comments
Assignees
Labels
Feature:Task Manager Meta Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@mikecote
Copy link
Contributor

mikecote commented Jul 26, 2019

This meta issue is still in WIP. More feature requests will be added to the list over time.

1. Using updateByQuery to claim tasks

There is performance issues scaling the current implementation to handle hundreds or thousands of tasks per second. Using the updateByQuery will allow task manager to claim tasks more efficiently by doing all the optimistic concurrency on the Elasticsearch side within a single request.

2. Queued requests (#43589)

Instead of throwing errors when task manager isn't ready. It would be nice to queue the requests until they can be processed. (Ex: scheduling a task).

3. Support Single Instance Tasks (#54916)

Following the removal of numWorkers it is now impossible to prevent two instances of the same task type from running concurrently. This is needed for Reporting as multiple instances of Chromium can cause issues.
We will aim to support a simple toggle that allows you to set a certain Task Type is a "Single Instance" task which won't allow TM to run two instances of the task concurrently on the same host.

4. Convert search to use KQL

This is the only function within the task manager store that still uses callCluster and replicates the saved objects rawToSavedObject functionality. It would be nice to convert it to use KQL once the support is implemented (#41136).

5. Encrypted attributes support

Now that task manager uses saved objects, we can pass encrypted attributes as parameters to a given task. This would be useful in alerting to pass the API Key to use when execution an action on behalf of an alert. This may require changes to the encrypted saved object plugin to allow custom ids (task manager supports this).

6. History

There should be a separate index containing the history of each task. The data stored should be denormalized from the task manager (attempts, params, state, etc). This probably will require task manager to never clean up finished tasks.

7. UI

There should be a UI for end users with the following functionality:

  • See what tasks are scheduled
  • See the history for a given task
  • Re-execute a task

8. HTTP API

There should be the following routes:

  • GET /api/task/_search
  • GET /api/task/{id}/history/_search
  • POST /api/task/{id}/_run

9. Documentation

Add asciidoc documentation for the task manager.

10. Testability

Task manager should be easier to test integrations with. Right now we have to use retry.retryForTime to wait until a task finished execution. It would be nice to have a way to speed up and / or have a hook to know when task finished running.

11. Scheduling

  • Ability to not throw an error when 409 is returned.
  • Ability to overwrite.

12. Cron syntax support

Ability to have more advance scheduling.

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-stack-services

@Bamieh
Copy link
Member

Bamieh commented Aug 7, 2019

Maps and oss telemetry are utilizing the task manager. Both would benefit from the overwrite / ignore 409 option in the task manager as they currently log the following warning at server startup:

server    log   [10:24:48.829] [warning][telemetry] Error scheduling task, received [task:oss_telemetry-vis_telemetry]: version conflict, document already exists (current version [3]): [version_conflict_engine_exception] [task:oss_telemetry-vis_telemetry]: version conflict, document already exists (current version [3]), with { index_uuid="mo3VOo_6QtuxedDXE_x0qA" & shard="0" & index=".kibana_task_manager_1" }

https://github.com/elastic/kibana/blob/master/x-pack/legacy/plugins/maps/server/maps_telemetry/telemetry_task.js#L31

https://github.com/elastic/kibana/blob/master/x-pack/legacy/plugins/oss_telemetry/server/lib/tasks/index.ts#L47

@gmmorris gmmorris self-assigned this Sep 12, 2019
@gmmorris
Copy link
Contributor

gmmorris commented Sep 16, 2019

I've spent the last couple of days familiarising myself with Task Manager's internals and I'm going to start experimenting with approaches we can take to address the scaling issues mentioned under #1 Using updateByQuery to claim tasks.

There are a couple of quick wins I think I've identified, that feel worth trying out but I don't want to start making random changes without metrics we can work off of ( "If you can't measure it, you can't improve it" ), so my first priority is to figure out how to get measurements of the current TM and get a baseline.

Looking at @mikecote's update-by-query-demo it looks like we have a good starting point, but I'd like to collect stats for running this amount of tasks through Kibana itself, so I'm going to see about setting up an equivalent test against Kibana & ES instances and the actual TM.

Once I have that environment setup, and I measure a baseline of performance, I'm going to try the following experiments:

  1. change TaskPool.attemptToRun so that the task.claimOwnership of each task is run concurrently (currently, they are sequentially moving from task to task, with only task.run running concurrently) . This will allow us to mostly parallelise the work being done in ES.
    This is an easy change to make, and once we can measure it against the baseline we can see if this quick win has any impact.

  2. decouple the claim ownership of a task from schedule a retry for said task, as they don't actually have to happen at the same time. This can be achieved by changing TaskStore.fetchAvailableTasks to an Update by Query as per Mike's experiment, and adding a bulk update of the retryAt field (currently made as part of task.claimOwnership) asynchronously to the actual task execution. This should reduce the number of ES updates we run against ES prior to attempting the work.

Once we have metrics for the baseline, the concurrent claimOwnership and the concurrent schedule retry step and attempt work step, we'll hopefully have a better idea of what we can handle scale wise.
The goal is broadly to reduce the amount of calls we make to ES and to parallelise any of these calls if we can decouple them.

@gmmorris
Copy link
Contributor

There's now a branch on my fork that includes an Integration Tests which spawns tasks and measures how many tasks are run per second and what the lead time from a scheduled task reaching it's runAt time until it actually runs.
The test is currently marked as skip, so could in theory be merged into master (for future use), but I'll wait with that for now.

You can see it in this draft PR.

The results look like this:

❯ node ../scripts/functional_test_runner.js --config=test/plugin_api_perf/config.js
 debg Loading config file from "/Users/gidi/development/elastic/kibana/x-pack/test/plugin_api_perf/config.js"
 debg Loading config file from "/Users/gidi/development/elastic/kibana/x-pack/test/api_integration/config.js"
 debg Loading config file from "/Users/gidi/development/elastic/kibana/x-pack/test/functional/config.js"
 debg Loading config file from "/Users/gidi/development/elastic/kibana/test/common/config.js"
 debg Loading config file from "/Users/gidi/development/elastic/kibana/test/functional/config.js"
 info Config loaded
 debg randomness seed: 1568634458073
 info Starting tests

 └-: task_manager_perf
   └-> "before all" hook
   └-: stressing task manager
     └-> "before all" hook
     └-> should run 10 tasks every second for a minute
       └-> "before each" hook: global before each
       └-> "before each" hook
       │ debg Stress Test Result: {"runningAverageTasks":4.714285714285714,"averagesTaken":[3,5,5,4,5,5,5,5,5,4,5,5,5,5],"runningAverageLeadTime":8071.703571428571,"averagesTakenLeadTime":[2019,3306.4,5996.6,8051.5,8723,8932.2,10581.2,9329.2,9342.4,9345.75,9346.4,8721.8,9962.6,9345.8],"leadTimeQueue":[]}
       └- ✓ pass  (1.0m) "task_manager_perf stressing task manager should run 10 tasks every second for a minute"
     └-> "after all" hook
   └-> "after all" hook


1 passing (1.0m)

and you'll see the server logging the number of tasks run within consecutive 5 second windows.

   │ proc [kibana] I have processed 0 tasks in the past 5s
   │ succ
   │
   │      Elasticsearch and Kibana are ready for functional testing. Start the functional tests
   │      in another terminal session by running this command from this directory:
   │
   │          node ../scripts/functional_test_runner
   │
   │
   │ proc [kibana] I have processed 0 tasks in the past 5s
   │ proc [kibana] I have processed 0 tasks in the past 5s
   │ proc [kibana] I have processed 3 tasks in the past 5s
   │ proc [kibana] I have processed 5 tasks in the past 5s
   │ proc [kibana] I have processed 5 tasks in the past 5s
   │ proc [kibana] I have processed 4 tasks in the past 5s
   │ proc [kibana] I have processed 5 tasks in the past 5s
   │ proc [kibana] I have processed 5 tasks in the past 5s
   │ proc [kibana] I have processed 5 tasks in the past 5s
   │ proc [kibana] I have processed 5 tasks in the past 5s
   │ proc [kibana] I have processed 5 tasks in the past 5s
   │ proc [kibana] I have processed 4 tasks in the past 5s
   │ proc [kibana] I have processed 5 tasks in the past 5s
   │ proc [kibana] I have processed 5 tasks in the past 5s
   │ proc [kibana] I have processed 5 tasks in the past 5s
   │ proc [kibana] I have processed 5 tasks in the past 5s
   │ proc [kibana] I have processed 4 tasks in the past 5s

This will help us measure a baseline for comparison, but obviously not a proper stress test environment.
It's just meant to give us a way of measuring the impact of the changes we make from here out.

@pmuellr
Copy link
Member

pmuellr commented Sep 16, 2019

There's now a branch on my fork that includes an Integration Tests which spawns tasks and measures how many tasks are run per second and what the lead time from a scheduled task reaching it's runAt time until it actually runs.
The test is currently marked as skip, so could in theory be merged into master (for future use), but I'll wait with that for now.

awesome; and I like the idea of marking these as skip in the FT, meaning we can keep them with the rest of the FT tests, and easily turn them on locally for some testing.

@gmmorris
Copy link
Contributor

gmmorris commented Sep 16, 2019

Glad you like it. :)

A local run does seem to confirm that our current implementation caps out at a handful of tasks per second, which is what Mike already found, so I've moved on to trying to implement the first quick win (parallelise claim ownership).

I'll update when I have a working prototype.

@gmmorris
Copy link
Contributor

@mikecote can you add this to your list above (it feels wrong just editing your comment):

Support Single Instance Tasks

Following the removal of numWorkers it is now impossible to prevent two instances of the same task type from running concurrently. This is needed for Reporting as multiple instances of Chromium can cause issues.
We will aim to support a simple toggle that allows you to set a certain Task Type is a "Single Instance" task which won't allow TM to run two instances of the task concurrently on the same host.

@mikecote
Copy link
Contributor Author

@gmmorris as discussed, feel free to edit the description at your own leisure. You can re-organize this as you like. The meta issue is for the team to edit and collaborate.

@gmmorris
Copy link
Contributor

I've pushed a change which applies the first experiment.

I'll add some more extensive docs around this, but the core change is this:
We queue the tasks of which we wish to take ownership and drain them out of this queue parallelising the claimOwnership step, while ensuring we never attempt to claim ownership of more tasks than we have capacity for.

The results from a local run on my laptop:


 └-: task_manager_perf
   └-> "before all" hook
   └-: stressing task manager
     └-> "before all" hook
     └-> should run 10 tasks every second for a minute
       └-> "before each" hook: global before each
       └-> "before each" hook
       │ debg Stress Test Result: {"runningAverageTasks":20,"averagesTaken":[1,11,18,25,25,20,25,20,25,20,20,25,20,25],"runningAverageLeadTime":1310.2527777777777,"averagesTakenLeadTime":[1921,1572,1401.888888888889,1190.96,1192.6,1162.15,1246.12,1282.05,1227,1201.7,1247.8,1234.36,1244.75,1219.16],"leadTimeQueue":[1237,1235,1237,1239,1239,1238,1240,1243,1241,1242,1194,1196,1194,1197,1197]}
       └- ✓ pass  (1.0m) "task_manager_perf stressing task manager should run 10 tasks every second for a minute"
     └-> "after all" hook
   └-> "after all" hook


1 passing (1.0m)
{
	"runningAverageTasks": 20,
	"averagesTaken": [1, 11, 18, 25, 25, 20, 25, 20, 25, 20, 20, 25, 20, 25],
	"runningAverageLeadTime": 1310.2527777777777,
	"averagesTakenLeadTime": [1921, 1572, 1401.888888888889, 1190.96, 1192.6, 1162.15, 1246.12, 1282.05, 1227, 1201.7, 1247.8, 1234.36, 1244.75, 1219.16],
	"leadTimeQueue": [1237, 1235, 1237, 1239, 1239, 1238, 1240, 1243, 1241, 1242, 1194, 1196, 1194, 1197, 1197]
}

This suggest that this change has the following impact (sketchy numbers, as it's my laptop, but proportionally we can still learn a lot) :

  1. we're able to run tasks about 4 / 5 times the number of tasks per second as we could before.
  2. The lead time from when a task should have been ran until it was actually ran dropped from about 8s to 1s.

It would be interesting running this on a more substantial environment such as AWS.

@gmmorris
Copy link
Contributor

As much has I like this change, I'm disappointed by how little of an impact it has actually had.
This suggests the time spent talking to ES really is the bottle neck and we should focus on reducing that, but that is also the more complex and dangerous change, which is why I deferred it until after parallelising the claimOwnership step.

That's my next task.

@gmmorris
Copy link
Contributor

gmmorris commented Sep 30, 2019

If we were to add a field onto tasks which helps us identify when a Kibana instance tried to take ownership of a task and when it might be allegeable to have it's ownership take by another instance (this is to prevent a situation where ownership is taken, but retryAt isn't updated properly and so a task might fall between the cracks), would you prefer:

  1. a ownershipExpiresAt field that specifies a Date & Time after which another Kibana is allowed to try and take ownership.

OR

  1. a ownedAt field that specified when a Kibana instance took ownership, and so in code we'll specify the interval after which we're allowed to try and take ownership again.

I have my own preference, but I don't want to anchor - so what do you think? :)


But why do we even need this?
I'm trying to merge the take ownership and query for tasks step using an updateByQuery, but as a first step I want to keep the update of retryAt where it lives at the moment, as it's a little too complex to make part of the updateByQuery.
This change creates a dangerous possible state where a task is owned by a Kibana instance but never gets it's retryAt field updated, and that could lead to a task falling between the cracks.
I want the query step to look at this ownership field to know when a task might have fallen between the cracks and pick it up.

@tsullivan
Copy link
Member

I prefer option 2, adding the ownedAt field. But to stay consistent with the verb terminology in other places in Task Manager, maybe claimedAt would be better.

Is there already a field for the UUID of the Kibana server that claims the task? Joel and I have recently added fields like that for Reporting-over-ESQueue, and we've just labeled them kibana_id and kibana_name.

@gmmorris
Copy link
Contributor

Thanks @tsullivan , how do you feel about the fact that the retryAt field, which is the main way in which we cause a certain state to have been invalidated is in the future though?
I was originally thinking of using retryAt for both expiring ownership and expiring a run, but felt that could harm sustainability due to confusion about what that field actually means.

If we use claimedAt/ownedAt then it would be a DateTime in the past while retryAt is in the future, despite both being used to expire a certain state.
Would these two being different in this sense concern you?

Regarding the UUID - yes, I've added a owner field which is just that. :)

@mikecote
Copy link
Contributor Author

I was originally thinking of using retryAt for both expiring ownership and expiring a run, but felt that could harm sustainability due to confusion about what that field actually mean

Another way I see it is we could introduce a new status claiming and re-use the retryAt field. That way it solves your concern about it not being clear.

@gmmorris
Copy link
Contributor

hmm good idea, I like that.   🤔

@tsullivan
Copy link
Member

In cases where claiming a task failed, I'm not sure what value the existing retryAt field will have towards resolving the state of the task.

If claiming failed and retryAt is in the past, it makes sense to retry right away.

If claiming failed and retryAt is in the future, it sounds like the deployment of Kibana servers has out-of-sync settings: system clock, for one.

In general, I lean towards adding more fields to help ensure a task state is stable, and a run doesn't fall through the cracks. They give us more debugging power if we ever need to diagnose something going wrong.

I was originally thinking of using retryAt for both expiring ownership and expiring a run, but felt that could harm sustainability due to confusion about what that field actually means.

This would be my concern too. I like Mike's idea to introduce the new claiming status since that would create a new variant of task state that would be understood and can be explicitly handled.

@gmmorris
Copy link
Contributor

It's not so much that claiming might fail, it's that an instance will have claimed (as part of the fetch as a single call) and for some reason it then never got a chance to update from 'claiming' to 'running', and in the meantime retryAt will pass and the claiming will in essence expire.

I think that's enough of an indicator and lets us avoid a new field.

Does that make sense?

@gmmorris
Copy link
Contributor

gmmorris commented Oct 17, 2019

Update on Using updateByQuery to claim tasks:

Task Manager now uses updateByQuery to claim available tasks in a single call.
This has been merged into master and will be released as part of version 7.5.

In the meantime, we have extended the SavedObjectsClient(s) by adding a bulkUpdate operation which can update multiple saved objects in a single call.
This has been integrated as well, and will be released as part of 7.6.

Work is now being done to use this new functionality as part of the operations on tasks (the goal is for the two update steps marking a claimed task as running and marking a task run as complete to be done in bulk whenever possible).
PR for this is over here, and hopefully this will be integrated shortly.

In local perf tests this has bumped up the amount of tasks Kibana can handle per second considerably due to parallelisation of the mark tasks as running step and the drop in conflicts between multiple Kibana attempting to take ownership of the same task.

We're now trying to figure out how to perform some more extensive perf tests.

@gmmorris
Copy link
Contributor

Submitted an issue to Elastic search regarding the potential optimisations we've discussed in updateByQuery

elastic/elasticsearch#48624

cc @peterschretlen

@gmmorris
Copy link
Contributor

Do we have any thoughts on the prioritisation considerations between the following:

  1. Update tasks, including runAt [DISCUSS] Task manager update API to allow changing a task's interval #45152
  2. Task Manager UI Task manager UI #40029
  3. New Platform [Task Manager] Convert plugin to New Platform #49629

At the moment I'm treating them as the above priority, but I'm not sure if there was some past conversation around this before my time.

@gmmorris
Copy link
Contributor

gmmorris commented Oct 29, 2019

Had a chat with @bmcconaghy and he feels the correct prioritisation is:

  1. Update tasks, including runAt [DISCUSS] Task manager update API to allow changing a task's interval #45152
  2. New Platform [Task Manager] Convert plugin to New Platform #49629
  3. Task Manager UI Task manager UI #40029

As the NP work, broadly, is high priority.

@gmmorris
Copy link
Contributor

gmmorris commented Nov 1, 2019

A new issue has appeared in relation to the Performance Improvements we released in 7.5.
It seems users could disable inline scripts, which means Task Manager can no longer work.
#49853

At the moment this doesn't break anything new as it appears the components using TM already rely on inline scripts and are failing anyway, but long term we don't want this to be the case, so the issue will be discussed further.

@gmmorris
Copy link
Contributor

gmmorris commented Nov 11, 2019

Feels like the scheduleIfNotExists issue should be prioritised, so updated priorities in TM:

  1. add scheduleIfNotExists api Task vis_telemetry "oss_telemetry-vis_telemetry" failed in attempt to run warning in Kibana logs #32050
  2. Update tasks, including runAt [DISCUSS] Task manager update API to allow changing a task's interval #45152
  3. New Platform [Task Manager] Convert plugin to New Platform #49629
  4. Advanced Scheduling Advance scheduling (cron syntax, etc). #50272
  5. Task Manager UI Task manager UI #40029

@bmcconaghy bmcconaghy added Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) and removed Team:Stack Services labels Dec 12, 2019
@ymao1
Copy link
Contributor

ymao1 commented Mar 18, 2021

Closing as outdated.

@ymao1 ymao1 closed this as completed Mar 18, 2021
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Task Manager Meta Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

No branches or pull requests

9 participants