Worker manager launch configs #191

lotas · 2024-05-23T12:02:27Z

Below is the proposal to introduce Launch Configurations as standalone entities in Worker Manager.

It also outlines some "self-healing" diagnostic metrics, that could potentially reduce the number of errors we see during provisioning (like resources missing or quotas exceeded).

It also proposes exposing more events from worker manager to make it more transparent and allow external systems to react and build advanced monitoring solutions around it

rendered rfcs/0191-worker-manager-launch-configs.md

ahal

Thanks, this proposal looks great! You can ignore my comment about the name from the meeting too. This makes more sense after giving it a read through.

rfcs/0191-worker-manager-launch-configs.md

gabrielBusta · 2024-05-24T18:49:18Z

This is a well-thought-out proposal! I really like the inclusion of additional events and the new API endpoints.
A thought: if all launch configurations in a worker pool have a low weight due to a high number of failures on every region, would that cause a halt in provisioning? Maybe that's an scenario worth considering.

lotas · 2024-05-24T19:06:35Z

This is a well-thought-out proposal! I really like the inclusion of additional events and the new API endpoints. A thought: if all launch configurations in a worker pool have a low weight due to a high number of failures on every region, would that cause a halt in provisioning? Maybe that's an scenario worth considering.

Thanks @gabrielBusta , this is indeed a likely scenario and to be honest, I'm not really sure how to tackle this.
I think, if each region has same amount of errors, we might give them 0 weight and pause provisioning temporarily. But because we would only check last 30-60min for errors (needs to be figured out), it will "get back to normal" after.
This logic is a bit blurry still. Do you have something in mind?

ahal · 2024-05-24T19:21:09Z

My impression was that the weight would only affect the probability relative to all launch configs, but that overall the rate of provisioning wouldn't change. E.g, if all launch configs have a weight of 0, that's the same as all launch configs having a weight of 1.

Is there a big downside to trying to provision too much when there are lots of errors? If so then I think slowing down the overall rate would be a nifty feature.. though I don't think we should ever turn it off completely (even for an hour). We probably want to at least attempt to provision in 1 of the launch configs at least once every 1-5 minutes.

bhearsum · 2024-05-27T14:29:19Z

rfcs/0191-worker-manager-launch-configs.md

+
+Each launch configuration will have a dynamic `weight` property that will be adjusted automatically based on the following events and metrics:
+
+* total number of successful worker provisioning attempts / registrations


This talks about total numbers of success/failure, but the summary at the top talks about using a ratio. Is this different somehow? (It seems ideal to me to use a ratio here, otherwise the influence this has will differ a lot based on the size of a worker pool.)

Good point.

My initial concern was to slow down the provisioning for brand new configs, that have not yet seen any successful workers. So in that context, having "total X good" would be a valuable info to know if it's fine to go full speed.

Maybe I'm mixing too much in here? Weight distribution between configs and initial provisioning rate for new configs. Probably worth separating those and also keep out the initial provisioning rate out of the scope, since it's not yet clear how to handle this properly.

bhearsum · 2024-05-27T14:31:55Z

rfcs/0191-worker-manager-launch-configs.md

+
+Weight will be used to determine the likelihood of selecting a specific launch configuration for provisioning.
+
+This will be calculated at the provisioning time.


How excatly will these combine together? Will there be hardcoded coefficients for them? Will they be adjustable by worker pool, or somehow dependent on parts of the worker pool configuration?

This is THE Question :)

As Dustin commented above, it should probably a separate process/listener that would calculate those based on the events, and keep them internal.

I don't think it's worth making it configurable, as it might lead to weird scenarios and also interfere with this internal logic.

I think it is only worth exposing pause/resume actions for launch configs to let external tools to handle extreme scenarios.
But this will also probably raise the question why not pause/resume provisioning for the whole worker pool also?

djmitche

This is exciting!

It's worth also talking about how this would interact with standalone and static workers. I assume that's fairly trivial, but might be relevant when e.g., a config is updated. Maybe static workers can detect that they are running with an out-of-date launchConfig and restart to get the new config.

rfcs/0191-worker-manager-launch-configs.md

djmitche · 2024-05-27T18:25:37Z

rfcs/0191-worker-manager-launch-configs.md

+* total number of failed worker provisioning attempts / registrations
+* has any worker claimed a task
+* fail-to-success ratio over a specific time period (i.e. last hour)
+* number of non-stopped workers currently running


It's worth considering how these will be calculated efficiently -- a lot of the trouble witih provisioning has come from taking too long to calculate things, and thus falling behind more and more.

In fact, maybe it makes sense to leave the RFC vague on what the health statistics are. In the implementation, maybe a different process within worker-manage can listen to pulse, monitor clouds, and so on to update weights as quickly as possible, without the risk of blocking the provisioner process.

Yeah, I wasn't sure either yet how to keep count those.

What I think is important is:

know the total numbers

know "the current state" (ratio of failed vs successful)

For 1st we can keep stats in db entity of launch config, updating counters with each success/failure (hopefully it wouldn't be too much concurrency and locking?)

For 2nd we need something faster, that could be queried by provisioner without heavy queries. Could also be a separate table in db with aggregated stats being updated real-timish by an external process (that would listen to pulse)

Does it make sense? :)

djmitche · 2024-05-27T18:30:06Z

rfcs/0191-worker-manager-launch-configs.md

+* fail-to-success ratio over a specific time period (i.e. last hour)
+* number of non-stopped workers currently running
+
+Weight will be used to determine the likelihood of selecting a specific launch configuration for provisioning.


When we talked about this sort of health-monitoring in the context of aws-provisioner, jhford noted that if we never try a configuration, we'll have no way to know if its status has changed (region back online, etc). I suspect weights should be constrained to be strictly >0.

yes yes, I kept that in mind. We don't want to fully "punish" the region, since most of those errors could/should be temporary.

I also think that we should narrow the scope of this RFC to the "weights" only, without touching the provisioning rate. I think I've tried to cover both issues - even region distribution and dealing with broken configs.

Should there be a separate RFC for dealing with new launch configs after this one lands?
Maybe we could reuse/repurpose existing scalingRatio once this gets implemented, that would take new metrics into account and lower that value for totally new configs.

djmitche · 2024-05-27T18:35:19Z

rfcs/0191-worker-manager-launch-configs.md

+* `workerManager.resumeLaunchConfig(workerPoolId, launchConfigId)` - to resume specific active launch configuration
+
+Last two endpoints would allow external systems to pause/resume specific launch configurations based on their own criteria.
+This might be useful when dynamic adjustment of weights is not enough to prevent provisioning workers in a specific region.


This is great!

In the past, Generic-worker has supported something very similar, basically a hash of its config. It would consult aws-provisioner or worker-manager and, if the hash had changed, stop taking tasks and shut down (for cloud-based workers).

I don't know if it would make sense to do the same with launch configs. In terms of handling load, it's definitely better to have a working worker keep working. But when updating a worker config, e.g., to rotate a secret or upgrade worker version, it makes sense to have old workers terminate as soon as possible.

Maybe it would make sense to support "draining" a launch config, and a worker can check for that status before claiming a task. Maybe that doesn't make sense in this RFC, but something to look forward to.

Nice!
I was thinking that launch config should have two flags:

is_archived to mark old configs as inactive

is_paused to mark if current launch config is paused (and excluded from provisioning)

Based on this, worker would only need to know his launch_config_id he can use to check if it was archived yet.

I believe this would replace or extend the existing deploymentId logic?

That makes a lot of sense!

djmitche · 2024-05-27T18:36:09Z

rfcs/0191-worker-manager-launch-configs.md

+Last two endpoints would allow external systems to pause/resume specific launch configurations based on their own criteria.
+This might be useful when dynamic adjustment of weights is not enough to prevent provisioning workers in a specific region.
+
+Existing endpoints will continue to accept the same payload as before for backward compatibility.


I suspect that the create/updateWorkerPool methods will need to return a set of launch configs, too?

Yes, right. Since it would be a breaking change anyway.

Worth keeping in mind that tc-admin/ciadmin might need to be patched.

djmitche · 2024-05-27T18:38:57Z

rfcs/0191-worker-manager-launch-configs.md

+At some point, provisioning workers in region A fails with quota exceeded errors.
+Weight of A would be adjusted proportionally to the error rate - `1 - (failed / total)`.
+
+Note: To avoid permanently disabling launch config, we would only adjust the weight for a specific time period (i.e. *last hour*).


This makes sense, too, as long as that ramp-up is gradual. I can imagine a case where region A goes down and tasks start piling up. One hour later, the weight adjustment disappears and suddenly worker-manager tries to create 100 workers in region A, all of which fail. Better to start with just a few and "see how it goes".

Thanks Dustin.
I think we should separate those two problems - adjusting weight between regions and slowing down the provision rate

I tend to update this RFC to make it only about the weight distribution, and leave the scaling ratio logic for an upcoming one.

Unless you have something in mind?
If for an updated worker pool, all launch configs would have a lower weight, they would still be equally-likely selected for provisioning.

What if ... we turn this weight to be dual-purpose .. give the likelihood and scaling ratio 🤔
so for any weight < 1, it will act as a scalingRatio also .. nah, probably too complicated :)

I think it's worth keeping things simple. And, scaling ratio is distinct enough to leave out of this RFC.

lotas · 2024-05-29T09:20:28Z

It's worth also talking about how this would interact with standalone and static workers. I assume that's fairly trivial, but might be relevant when e.g., a config is updated. Maybe static workers can detect that they are running with an out-of-date launchConfig and restart to get the new config.

Hmm.. I thought static and standalone don't have anything configured on the worker pool side, so shouldn't be affected at all.
Unless I'm missing something
@petemoore @djmitche

djmitche · 2024-05-30T22:05:14Z

Standalone workers don't have anything configured in worker-manager. Static do -- they get their config from worker-manager, but they are not dynamically provisioned.

lotas · 2024-06-04T10:37:55Z

Standalone workers don't have anything configured in worker-manager. Static do -- they get their config from worker-manager, but they are not dynamically provisioned.

Indeed.. To my surprise I discovered that static wp is using slightly different schema which kind of breaks my assumption that all worker pool definitions are similar :)
We can probably update static wp schema to have single launchConfig to allow proposed solutions to work uniformly for all pools, or we would treat non aws/gcp/azure providers as before.

djmitche · 2024-06-04T20:40:41Z

I think that makes sense. Then when that launchConfig changes, workers with the old launchConfig can detect and react to that (if implemented).

petemoore

One thing we experience at the moment, is that a single task is submitted to a task queue (worker pool) and worker manager may launch several workers before the pending task count drops. It currently doesn't track that it has already launched a worker in response to the pending task. For example, I created a new pool yesterday, set the maxCapacity to 5, and submitted a single task. Worker Manager will consistently create 5 workers, and only stop because it hits max capacity, before one of the workers has completed booting, logged into the desktop, and claimed the task.

It isn't clear to me how this RFC will solve that problem. If each launch configuration gets different weights, I think this RFC may influence which launch configuration is used for each of the spawned workers, but will it introduce any protections so that only a single worker will be launched in order to service the (single) task that was created? I think the only way to solve that is to balance pending tasks against "eventual worker capacity" where eventual worker capacity is effectively the number of running workers plus the total number that are currently being provisioned. I think what is happening at the moment is that as soon as a worker starts running, and we have a pending task, worker manager considers it needs to launch a new worker, because pending count is still 1, rather than recognising that there is a worker that has already been created to service this task, it just hasn't finished initialising yet.

I think the RFC doc could also explicitly summarise what happens in the case of bad machine images, e.g. where the worker won't even start up. This is a reasonably common problem. I think if worker manager can conclude that the instance is actually running, but after some period, is not calling the Queue, then it is probably a bad worker, and should start throwing alerts somehow, highlighting the issue in the UI, maybe we should have a status page, etc. Ideally there would be some kind of backoff, retrying, but if you can see the instance is running, but after e.g. launching 10 workers, although all of them were running according to the cloud provider, none of them called the Queue, we probably should totally freeze the worker pool, and prevent new instances from being launched, maybe require the worker pool to be manually reactivated. Eventually (future RFC) we may introduce some rollback feature, with versioned worker pool definitions, but for now just deactivating it might be enough, and making this very visible in the UI, and ideally alerting too.

I may have wondered off topic here, but I wanted to make sure that this RFC handles two specific issues we commonly experience today:

Bad machine images
Not launching multiple workers for a single pending task

Maybe it does already, but it would be good to explicitly mention these two topics in the RFC directly and explain how it handles them. Thanks!

lotas · 2024-06-05T12:17:59Z

Thanks Pete, this is important also, and I think I finally understood the root problem there - worker statuses!

So in that particular case, steps that led to launch of many workers without any worker doing anything were following:

pending tasks: 1
desired capacity: 1
requested: 1

looks good, but as soon as that worker was fully launched, it turned into existingCapacity
so on next run we have

existing capacity: 1
pending: 1
desired capacity: 2 (+1)
requested: 1

and it goes on, because workers start but don't call claimWork, and that pending count keeps spawning new workers.

I think this is simply a bug in the provisioner, or the fact that it doesn't know the actual number of tasks being claimed, as it treats existing capacity as something that is already working on tasks, when in fact, it doesn't.
Since we refactored queue internals, we could also tell how many tasks are being claimed for a particular pool.
And then do some simple additions/subtractions to figure out that claimed + pending <> existing + requested

UPD: created taskcluster/taskcluster#7052

petemoore

looks perfect

To avoid changing ID too much when capacity and lifecycle values are changed, which happens more frequently than the config itself

The base branch was changed.

lotas requested a review from a team as a code owner May 23, 2024 12:02

lotas requested review from petemoore and matt-boris and removed request for a team May 23, 2024 12:02

ahal reviewed May 24, 2024

View reviewed changes

rfcs/0191-worker-manager-launch-configs.md Outdated Show resolved Hide resolved

rfcs/0191-worker-manager-launch-configs.md Outdated Show resolved Hide resolved

bhearsum reviewed May 27, 2024

View reviewed changes

djmitche reviewed May 27, 2024

View reviewed changes

lotas force-pushed the feat/0191-worker-manager-launch-configs branch 2 times, most recently from bc1a363 to d5f471e Compare May 28, 2024 10:49

petemoore reviewed Jun 5, 2024

View reviewed changes

lotas mentioned this pull request Jun 5, 2024

Worker-manager provisioner could use total claimed to know how many tasks are running taskcluster/taskcluster#7052

Closed

lotas added the Phase: Final Comment label Jun 7, 2024

lotas added Phase: Decided and removed Phase: Final Comment labels Jun 18, 2024

petemoore previously approved these changes Jun 19, 2024

View reviewed changes

lotas added 6 commits June 19, 2024 17:55

0191 - worker manager launch configurations

6fb9978

Launch configuration ID would only be a has of its properties

f8c9ff6

To avoid changing ID too much when capacity and lifecycle values are changed, which happens more frequently than the config itself

Updated TOC

e587f6e

Add worker created/shutdown events

ea18dc8

Leave note about the static workers and worker interactions

1beb36d

Updated info about static workers

cd6535e

lotas force-pushed the feat/0191-worker-manager-launch-configs branch from f0315cd to cd6535e Compare June 19, 2024 15:55

lotas changed the base branch from feat/0190-change-task-run-priority to main June 19, 2024 15:55

petemoore approved these changes Jun 19, 2024

View reviewed changes

lotas merged commit 802091b into main Jun 20, 2024
1 check passed

lotas deleted the feat/0191-worker-manager-launch-configs branch June 20, 2024 07:38

lotas mentioned this pull request Jun 20, 2024

RFC#0192 - Ensure workers do not get unnecessarily killed #192

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker manager launch configs #191

Worker manager launch configs #191

lotas commented May 23, 2024 •

edited

Loading

ahal left a comment

gabrielBusta commented May 24, 2024

lotas commented May 24, 2024

ahal commented May 24, 2024

bhearsum May 27, 2024

lotas May 28, 2024

bhearsum May 27, 2024

lotas May 28, 2024

djmitche left a comment

djmitche May 27, 2024

lotas May 28, 2024

djmitche May 27, 2024

lotas May 28, 2024 •

edited

Loading

djmitche May 27, 2024

lotas May 28, 2024

djmitche May 28, 2024

djmitche May 27, 2024

lotas May 28, 2024

djmitche May 27, 2024

lotas May 28, 2024

djmitche May 28, 2024

lotas commented May 29, 2024

djmitche commented May 30, 2024

lotas commented Jun 4, 2024

djmitche commented Jun 4, 2024

petemoore left a comment

lotas commented Jun 5, 2024 •

edited

Loading

petemoore left a comment


		Each launch configuration will have a dynamic `weight` property that will be adjusted automatically based on the following events and metrics:

		* total number of successful worker provisioning attempts / registrations


		Weight will be used to determine the likelihood of selecting a specific launch configuration for provisioning.

		This will be calculated at the provisioning time.

Worker manager launch configs #191

Worker manager launch configs #191

Conversation

lotas commented May 23, 2024 • edited Loading

ahal left a comment

Choose a reason for hiding this comment

gabrielBusta commented May 24, 2024

lotas commented May 24, 2024

ahal commented May 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

djmitche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lotas May 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lotas commented May 29, 2024

djmitche commented May 30, 2024

lotas commented Jun 4, 2024

djmitche commented Jun 4, 2024

petemoore left a comment

Choose a reason for hiding this comment

lotas commented Jun 5, 2024 • edited Loading

petemoore left a comment

Choose a reason for hiding this comment

lotas commented May 23, 2024 •

edited

Loading

lotas May 28, 2024 •

edited

Loading

lotas commented Jun 5, 2024 •

edited

Loading