Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker manager launch configs #191

Merged
merged 6 commits into from
Jun 20, 2024
Merged

Conversation

lotas
Copy link
Contributor

@lotas lotas commented May 23, 2024

Below is the proposal to introduce Launch Configurations as standalone entities in Worker Manager.

It also outlines some "self-healing" diagnostic metrics, that could potentially reduce the number of errors we see during provisioning (like resources missing or quotas exceeded).

It also proposes exposing more events from worker manager to make it more transparent and allow external systems to react and build advanced monitoring solutions around it

rendered rfcs/0191-worker-manager-launch-configs.md

@lotas lotas requested a review from a team as a code owner May 23, 2024 12:02
@lotas lotas requested review from petemoore and matt-boris and removed request for a team May 23, 2024 12:02
Copy link
Contributor

@ahal ahal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this proposal looks great! You can ignore my comment about the name from the meeting too. This makes more sense after giving it a read through.

rfcs/0191-worker-manager-launch-configs.md Outdated Show resolved Hide resolved
rfcs/0191-worker-manager-launch-configs.md Outdated Show resolved Hide resolved
@gabrielBusta
Copy link
Member

This is a well-thought-out proposal! I really like the inclusion of additional events and the new API endpoints.
A thought: if all launch configurations in a worker pool have a low weight due to a high number of failures on every region, would that cause a halt in provisioning? Maybe that's an scenario worth considering.

@lotas
Copy link
Contributor Author

lotas commented May 24, 2024

This is a well-thought-out proposal! I really like the inclusion of additional events and the new API endpoints. A thought: if all launch configurations in a worker pool have a low weight due to a high number of failures on every region, would that cause a halt in provisioning? Maybe that's an scenario worth considering.

Thanks @gabrielBusta , this is indeed a likely scenario and to be honest, I'm not really sure how to tackle this.
I think, if each region has same amount of errors, we might give them 0 weight and pause provisioning temporarily. But because we would only check last 30-60min for errors (needs to be figured out), it will "get back to normal" after.
This logic is a bit blurry still. Do you have something in mind?

@ahal
Copy link
Contributor

ahal commented May 24, 2024

My impression was that the weight would only affect the probability relative to all launch configs, but that overall the rate of provisioning wouldn't change. E.g, if all launch configs have a weight of 0, that's the same as all launch configs having a weight of 1.

Is there a big downside to trying to provision too much when there are lots of errors? If so then I think slowing down the overall rate would be a nifty feature.. though I don't think we should ever turn it off completely (even for an hour). We probably want to at least attempt to provision in 1 of the launch configs at least once every 1-5 minutes.


Each launch configuration will have a dynamic `weight` property that will be adjusted automatically based on the following events and metrics:

* total number of successful worker provisioning attempts / registrations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This talks about total numbers of success/failure, but the summary at the top talks about using a ratio. Is this different somehow? (It seems ideal to me to use a ratio here, otherwise the influence this has will differ a lot based on the size of a worker pool.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

My initial concern was to slow down the provisioning for brand new configs, that have not yet seen any successful workers. So in that context, having "total X good" would be a valuable info to know if it's fine to go full speed.

Maybe I'm mixing too much in here? Weight distribution between configs and initial provisioning rate for new configs. Probably worth separating those and also keep out the initial provisioning rate out of the scope, since it's not yet clear how to handle this properly.


Weight will be used to determine the likelihood of selecting a specific launch configuration for provisioning.

This will be calculated at the provisioning time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How excatly will these combine together? Will there be hardcoded coefficients for them? Will they be adjustable by worker pool, or somehow dependent on parts of the worker pool configuration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is THE Question :)

As Dustin commented above, it should probably a separate process/listener that would calculate those based on the events, and keep them internal.

I don't think it's worth making it configurable, as it might lead to weird scenarios and also interfere with this internal logic.

I think it is only worth exposing pause/resume actions for launch configs to let external tools to handle extreme scenarios.
But this will also probably raise the question why not pause/resume provisioning for the whole worker pool also?

Copy link
Contributor

@djmitche djmitche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exciting!

It's worth also talking about how this would interact with standalone and static workers. I assume that's fairly trivial, but might be relevant when e.g., a config is updated. Maybe static workers can detect that they are running with an out-of-date launchConfig and restart to get the new config.

rfcs/0191-worker-manager-launch-configs.md Outdated Show resolved Hide resolved
* total number of failed worker provisioning attempts / registrations
* has any worker claimed a task
* fail-to-success ratio over a specific time period (i.e. last hour)
* number of non-stopped workers currently running
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth considering how these will be calculated efficiently -- a lot of the trouble witih provisioning has come from taking too long to calculate things, and thus falling behind more and more.

In fact, maybe it makes sense to leave the RFC vague on what the health statistics are. In the implementation, maybe a different process within worker-manage can listen to pulse, monitor clouds, and so on to update weights as quickly as possible, without the risk of blocking the provisioner process.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wasn't sure either yet how to keep count those.

What I think is important is:

  1. know the total numbers
  2. know "the current state" (ratio of failed vs successful)

For 1st we can keep stats in db entity of launch config, updating counters with each success/failure (hopefully it wouldn't be too much concurrency and locking?)

For 2nd we need something faster, that could be queried by provisioner without heavy queries. Could also be a separate table in db with aggregated stats being updated real-timish by an external process (that would listen to pulse)

Does it make sense? :)

* fail-to-success ratio over a specific time period (i.e. last hour)
* number of non-stopped workers currently running

Weight will be used to determine the likelihood of selecting a specific launch configuration for provisioning.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we talked about this sort of health-monitoring in the context of aws-provisioner, jhford noted that if we never try a configuration, we'll have no way to know if its status has changed (region back online, etc). I suspect weights should be constrained to be strictly >0.

Copy link
Contributor Author

@lotas lotas May 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes yes, I kept that in mind. We don't want to fully "punish" the region, since most of those errors could/should be temporary.

I also think that we should narrow the scope of this RFC to the "weights" only, without touching the provisioning rate. I think I've tried to cover both issues - even region distribution and dealing with broken configs.

Should there be a separate RFC for dealing with new launch configs after this one lands?
Maybe we could reuse/repurpose existing scalingRatio once this gets implemented, that would take new metrics into account and lower that value for totally new configs.

* `workerManager.resumeLaunchConfig(workerPoolId, launchConfigId)` - to resume specific active launch configuration

Last two endpoints would allow external systems to pause/resume specific launch configurations based on their own criteria.
This might be useful when dynamic adjustment of weights is not enough to prevent provisioning workers in a specific region.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great!

In the past, Generic-worker has supported something very similar, basically a hash of its config. It would consult aws-provisioner or worker-manager and, if the hash had changed, stop taking tasks and shut down (for cloud-based workers).

I don't know if it would make sense to do the same with launch configs. In terms of handling load, it's definitely better to have a working worker keep working. But when updating a worker config, e.g., to rotate a secret or upgrade worker version, it makes sense to have old workers terminate as soon as possible.

Maybe it would make sense to support "draining" a launch config, and a worker can check for that status before claiming a task. Maybe that doesn't make sense in this RFC, but something to look forward to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!
I was thinking that launch config should have two flags:

  1. is_archived to mark old configs as inactive
  2. is_paused to mark if current launch config is paused (and excluded from provisioning)

Based on this, worker would only need to know his launch_config_id he can use to check if it was archived yet.

I believe this would replace or extend the existing deploymentId logic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes a lot of sense!

Last two endpoints would allow external systems to pause/resume specific launch configurations based on their own criteria.
This might be useful when dynamic adjustment of weights is not enough to prevent provisioning workers in a specific region.

Existing endpoints will continue to accept the same payload as before for backward compatibility.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that the create/updateWorkerPool methods will need to return a set of launch configs, too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, right. Since it would be a breaking change anyway.

Worth keeping in mind that tc-admin/ciadmin might need to be patched.

At some point, provisioning workers in region A fails with quota exceeded errors.
Weight of A would be adjusted proportionally to the error rate - `1 - (failed / total)`.

Note: To avoid permanently disabling launch config, we would only adjust the weight for a specific time period (i.e. *last hour*).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense, too, as long as that ramp-up is gradual. I can imagine a case where region A goes down and tasks start piling up. One hour later, the weight adjustment disappears and suddenly worker-manager tries to create 100 workers in region A, all of which fail. Better to start with just a few and "see how it goes".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Dustin.
I think we should separate those two problems - adjusting weight between regions and slowing down the provision rate

I tend to update this RFC to make it only about the weight distribution, and leave the scaling ratio logic for an upcoming one.

Unless you have something in mind?
If for an updated worker pool, all launch configs would have a lower weight, they would still be equally-likely selected for provisioning.

What if ... we turn this weight to be dual-purpose .. give the likelihood and scaling ratio 🤔
so for any weight < 1, it will act as a scalingRatio also .. nah, probably too complicated :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth keeping things simple. And, scaling ratio is distinct enough to leave out of this RFC.

@lotas lotas force-pushed the feat/0191-worker-manager-launch-configs branch 2 times, most recently from bc1a363 to d5f471e Compare May 28, 2024 10:49
@lotas
Copy link
Contributor Author

lotas commented May 29, 2024

It's worth also talking about how this would interact with standalone and static workers. I assume that's fairly trivial, but might be relevant when e.g., a config is updated. Maybe static workers can detect that they are running with an out-of-date launchConfig and restart to get the new config.

Hmm.. I thought static and standalone don't have anything configured on the worker pool side, so shouldn't be affected at all.
Unless I'm missing something
@petemoore @djmitche

@djmitche
Copy link
Contributor

Standalone workers don't have anything configured in worker-manager. Static do -- they get their config from worker-manager, but they are not dynamically provisioned.

@lotas
Copy link
Contributor Author

lotas commented Jun 4, 2024

Standalone workers don't have anything configured in worker-manager. Static do -- they get their config from worker-manager, but they are not dynamically provisioned.

Indeed.. To my surprise I discovered that static wp is using slightly different schema which kind of breaks my assumption that all worker pool definitions are similar :)
We can probably update static wp schema to have single launchConfig to allow proposed solutions to work uniformly for all pools, or we would treat non aws/gcp/azure providers as before.

@djmitche
Copy link
Contributor

djmitche commented Jun 4, 2024

I think that makes sense. Then when that launchConfig changes, workers with the old launchConfig can detect and react to that (if implemented).

Copy link
Member

@petemoore petemoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing we experience at the moment, is that a single task is submitted to a task queue (worker pool) and worker manager may launch several workers before the pending task count drops. It currently doesn't track that it has already launched a worker in response to the pending task. For example, I created a new pool yesterday, set the maxCapacity to 5, and submitted a single task. Worker Manager will consistently create 5 workers, and only stop because it hits max capacity, before one of the workers has completed booting, logged into the desktop, and claimed the task.

It isn't clear to me how this RFC will solve that problem. If each launch configuration gets different weights, I think this RFC may influence which launch configuration is used for each of the spawned workers, but will it introduce any protections so that only a single worker will be launched in order to service the (single) task that was created? I think the only way to solve that is to balance pending tasks against "eventual worker capacity" where eventual worker capacity is effectively the number of running workers plus the total number that are currently being provisioned. I think what is happening at the moment is that as soon as a worker starts running, and we have a pending task, worker manager considers it needs to launch a new worker, because pending count is still 1, rather than recognising that there is a worker that has already been created to service this task, it just hasn't finished initialising yet.

I think the RFC doc could also explicitly summarise what happens in the case of bad machine images, e.g. where the worker won't even start up. This is a reasonably common problem. I think if worker manager can conclude that the instance is actually running, but after some period, is not calling the Queue, then it is probably a bad worker, and should start throwing alerts somehow, highlighting the issue in the UI, maybe we should have a status page, etc. Ideally there would be some kind of backoff, retrying, but if you can see the instance is running, but after e.g. launching 10 workers, although all of them were running according to the cloud provider, none of them called the Queue, we probably should totally freeze the worker pool, and prevent new instances from being launched, maybe require the worker pool to be manually reactivated. Eventually (future RFC) we may introduce some rollback feature, with versioned worker pool definitions, but for now just deactivating it might be enough, and making this very visible in the UI, and ideally alerting too.

I may have wondered off topic here, but I wanted to make sure that this RFC handles two specific issues we commonly experience today:

  1. Bad machine images
  2. Not launching multiple workers for a single pending task

Maybe it does already, but it would be good to explicitly mention these two topics in the RFC directly and explain how it handles them. Thanks!

@lotas
Copy link
Contributor Author

lotas commented Jun 5, 2024

Thanks Pete, this is important also, and I think I finally understood the root problem there - worker statuses!

So in that particular case, steps that led to launch of many workers without any worker doing anything were following:

pending tasks: 1
desired capacity: 1
requested: 1

looks good, but as soon as that worker was fully launched, it turned into existingCapacity
so on next run we have

existing capacity: 1
pending: 1
desired capacity: 2 (+1)
requested: 1

and it goes on, because workers start but don't call claimWork, and that pending count keeps spawning new workers.

I think this is simply a bug in the provisioner, or the fact that it doesn't know the actual number of tasks being claimed, as it treats existing capacity as something that is already working on tasks, when in fact, it doesn't.
Since we refactored queue internals, we could also tell how many tasks are being claimed for a particular pool.
And then do some simple additions/subtractions to figure out that claimed + pending <> existing + requested

UPD: created taskcluster/taskcluster#7052

petemoore
petemoore previously approved these changes Jun 19, 2024
Copy link
Member

@petemoore petemoore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks perfect

@lotas lotas force-pushed the feat/0191-worker-manager-launch-configs branch from f0315cd to cd6535e Compare June 19, 2024 15:55
@lotas lotas changed the base branch from feat/0190-change-task-run-priority to main June 19, 2024 15:55
@lotas lotas dismissed petemoore’s stale review June 19, 2024 15:55

The base branch was changed.

@lotas lotas merged commit 802091b into main Jun 20, 2024
1 check passed
@lotas lotas deleted the feat/0191-worker-manager-launch-configs branch June 20, 2024 07:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants