-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worker manager launch configs #191
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this proposal looks great! You can ignore my comment about the name from the meeting too. This makes more sense after giving it a read through.
This is a well-thought-out proposal! I really like the inclusion of additional events and the new API endpoints. |
Thanks @gabrielBusta , this is indeed a likely scenario and to be honest, I'm not really sure how to tackle this. |
My impression was that the weight would only affect the probability relative to all launch configs, but that overall the rate of provisioning wouldn't change. E.g, if all launch configs have a weight of Is there a big downside to trying to provision too much when there are lots of errors? If so then I think slowing down the overall rate would be a nifty feature.. though I don't think we should ever turn it off completely (even for an hour). We probably want to at least attempt to provision in 1 of the launch configs at least once every 1-5 minutes. |
|
||
Each launch configuration will have a dynamic `weight` property that will be adjusted automatically based on the following events and metrics: | ||
|
||
* total number of successful worker provisioning attempts / registrations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This talks about total numbers of success/failure, but the summary at the top talks about using a ratio. Is this different somehow? (It seems ideal to me to use a ratio here, otherwise the influence this has will differ a lot based on the size of a worker pool.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
My initial concern was to slow down the provisioning for brand new configs, that have not yet seen any successful workers. So in that context, having "total X good" would be a valuable info to know if it's fine to go full speed.
Maybe I'm mixing too much in here? Weight distribution between configs and initial provisioning rate for new configs. Probably worth separating those and also keep out the initial provisioning rate out of the scope, since it's not yet clear how to handle this properly.
|
||
Weight will be used to determine the likelihood of selecting a specific launch configuration for provisioning. | ||
|
||
This will be calculated at the provisioning time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How excatly will these combine together? Will there be hardcoded coefficients for them? Will they be adjustable by worker pool, or somehow dependent on parts of the worker pool configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is THE Question :)
As Dustin commented above, it should probably a separate process/listener that would calculate those based on the events, and keep them internal.
I don't think it's worth making it configurable, as it might lead to weird scenarios and also interfere with this internal logic.
I think it is only worth exposing pause
/resume
actions for launch configs to let external tools to handle extreme scenarios.
But this will also probably raise the question why not pause
/resume
provisioning for the whole worker pool also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is exciting!
It's worth also talking about how this would interact with standalone and static workers. I assume that's fairly trivial, but might be relevant when e.g., a config is updated. Maybe static workers can detect that they are running with an out-of-date launchConfig and restart to get the new config.
* total number of failed worker provisioning attempts / registrations | ||
* has any worker claimed a task | ||
* fail-to-success ratio over a specific time period (i.e. last hour) | ||
* number of non-stopped workers currently running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth considering how these will be calculated efficiently -- a lot of the trouble witih provisioning has come from taking too long to calculate things, and thus falling behind more and more.
In fact, maybe it makes sense to leave the RFC vague on what the health statistics are. In the implementation, maybe a different process within worker-manage can listen to pulse, monitor clouds, and so on to update weights as quickly as possible, without the risk of blocking the provisioner process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I wasn't sure either yet how to keep count those.
What I think is important is:
- know the total numbers
- know "the current state" (ratio of failed vs successful)
For 1st we can keep stats in db entity of launch config, updating counters with each success/failure (hopefully it wouldn't be too much concurrency and locking?)
For 2nd we need something faster, that could be queried by provisioner without heavy queries. Could also be a separate table in db with aggregated stats being updated real-timish by an external process (that would listen to pulse)
Does it make sense? :)
* fail-to-success ratio over a specific time period (i.e. last hour) | ||
* number of non-stopped workers currently running | ||
|
||
Weight will be used to determine the likelihood of selecting a specific launch configuration for provisioning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we talked about this sort of health-monitoring in the context of aws-provisioner, jhford noted that if we never try a configuration, we'll have no way to know if its status has changed (region back online, etc). I suspect weights should be constrained to be strictly >0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes yes, I kept that in mind. We don't want to fully "punish" the region, since most of those errors could/should be temporary.
I also think that we should narrow the scope of this RFC to the "weights" only, without touching the provisioning rate. I think I've tried to cover both issues - even region distribution and dealing with broken configs.
Should there be a separate RFC for dealing with new launch configs after this one lands?
Maybe we could reuse/repurpose existing scalingRatio
once this gets implemented, that would take new metrics into account and lower that value for totally new configs.
* `workerManager.resumeLaunchConfig(workerPoolId, launchConfigId)` - to resume specific active launch configuration | ||
|
||
Last two endpoints would allow external systems to pause/resume specific launch configurations based on their own criteria. | ||
This might be useful when dynamic adjustment of weights is not enough to prevent provisioning workers in a specific region. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great!
In the past, Generic-worker has supported something very similar, basically a hash of its config. It would consult aws-provisioner or worker-manager and, if the hash had changed, stop taking tasks and shut down (for cloud-based workers).
I don't know if it would make sense to do the same with launch configs. In terms of handling load, it's definitely better to have a working worker keep working. But when updating a worker config, e.g., to rotate a secret or upgrade worker version, it makes sense to have old workers terminate as soon as possible.
Maybe it would make sense to support "draining" a launch config, and a worker can check for that status before claiming a task. Maybe that doesn't make sense in this RFC, but something to look forward to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
I was thinking that launch config should have two flags:
is_archived
to mark old configs as inactiveis_paused
to mark if current launch config is paused (and excluded from provisioning)
Based on this, worker would only need to know his launch_config_id
he can use to check if it was archived yet.
I believe this would replace or extend the existing deploymentId
logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes a lot of sense!
Last two endpoints would allow external systems to pause/resume specific launch configurations based on their own criteria. | ||
This might be useful when dynamic adjustment of weights is not enough to prevent provisioning workers in a specific region. | ||
|
||
Existing endpoints will continue to accept the same payload as before for backward compatibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that the create/updateWorkerPool methods will need to return a set of launch configs, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, right. Since it would be a breaking change anyway.
Worth keeping in mind that tc-admin/ciadmin might need to be patched.
At some point, provisioning workers in region A fails with quota exceeded errors. | ||
Weight of A would be adjusted proportionally to the error rate - `1 - (failed / total)`. | ||
|
||
Note: To avoid permanently disabling launch config, we would only adjust the weight for a specific time period (i.e. *last hour*). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense, too, as long as that ramp-up is gradual. I can imagine a case where region A goes down and tasks start piling up. One hour later, the weight adjustment disappears and suddenly worker-manager tries to create 100 workers in region A, all of which fail. Better to start with just a few and "see how it goes".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Dustin.
I think we should separate those two problems - adjusting weight between regions and slowing down the provision rate
I tend to update this RFC to make it only about the weight distribution, and leave the scaling ratio logic for an upcoming one.
Unless you have something in mind?
If for an updated worker pool, all launch configs would have a lower weight, they would still be equally-likely selected for provisioning.
What if ... we turn this weight to be dual-purpose .. give the likelihood and scaling ratio 🤔
so for any weight < 1, it will act as a scalingRatio
also .. nah, probably too complicated :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth keeping things simple. And, scaling ratio is distinct enough to leave out of this RFC.
bc1a363
to
d5f471e
Compare
Hmm.. I thought static and standalone don't have anything configured on the worker pool side, so shouldn't be affected at all. |
Standalone workers don't have anything configured in worker-manager. Static do -- they get their config from worker-manager, but they are not dynamically provisioned. |
Indeed.. To my surprise I discovered that static wp is using slightly different schema which kind of breaks my assumption that all worker pool definitions are similar :) |
I think that makes sense. Then when that launchConfig changes, workers with the old launchConfig can detect and react to that (if implemented). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing we experience at the moment, is that a single task is submitted to a task queue (worker pool) and worker manager may launch several workers before the pending task count drops. It currently doesn't track that it has already launched a worker in response to the pending task. For example, I created a new pool yesterday, set the maxCapacity to 5, and submitted a single task. Worker Manager will consistently create 5 workers, and only stop because it hits max capacity, before one of the workers has completed booting, logged into the desktop, and claimed the task.
It isn't clear to me how this RFC will solve that problem. If each launch configuration gets different weights, I think this RFC may influence which launch configuration is used for each of the spawned workers, but will it introduce any protections so that only a single worker will be launched in order to service the (single) task that was created? I think the only way to solve that is to balance pending tasks against "eventual worker capacity" where eventual worker capacity is effectively the number of running workers plus the total number that are currently being provisioned. I think what is happening at the moment is that as soon as a worker starts running, and we have a pending task, worker manager considers it needs to launch a new worker, because pending count is still 1, rather than recognising that there is a worker that has already been created to service this task, it just hasn't finished initialising yet.
I think the RFC doc could also explicitly summarise what happens in the case of bad machine images, e.g. where the worker won't even start up. This is a reasonably common problem. I think if worker manager can conclude that the instance is actually running, but after some period, is not calling the Queue, then it is probably a bad worker, and should start throwing alerts somehow, highlighting the issue in the UI, maybe we should have a status page, etc. Ideally there would be some kind of backoff, retrying, but if you can see the instance is running, but after e.g. launching 10 workers, although all of them were running according to the cloud provider, none of them called the Queue, we probably should totally freeze the worker pool, and prevent new instances from being launched, maybe require the worker pool to be manually reactivated. Eventually (future RFC) we may introduce some rollback feature, with versioned worker pool definitions, but for now just deactivating it might be enough, and making this very visible in the UI, and ideally alerting too.
I may have wondered off topic here, but I wanted to make sure that this RFC handles two specific issues we commonly experience today:
- Bad machine images
- Not launching multiple workers for a single pending task
Maybe it does already, but it would be good to explicitly mention these two topics in the RFC directly and explain how it handles them. Thanks!
Thanks Pete, this is important also, and I think I finally understood the root problem there - worker statuses! So in that particular case, steps that led to launch of many workers without any worker doing anything were following: pending tasks: 1 looks good, but as soon as that worker was fully launched, it turned into existingCapacity existing capacity: 1 and it goes on, because workers start but don't call claimWork, and that pending count keeps spawning new workers. I think this is simply a bug in the provisioner, or the fact that it doesn't know the actual number of tasks being claimed, as it treats existing capacity as something that is already working on tasks, when in fact, it doesn't. UPD: created taskcluster/taskcluster#7052 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks perfect
To avoid changing ID too much when capacity and lifecycle values are changed, which happens more frequently than the config itself
f0315cd
to
cd6535e
Compare
Below is the proposal to introduce Launch Configurations as standalone entities in Worker Manager.
It also outlines some "self-healing" diagnostic metrics, that could potentially reduce the number of errors we see during provisioning (like resources missing or quotas exceeded).
It also proposes exposing more events from worker manager to make it more transparent and allow external systems to react and build advanced monitoring solutions around it
rendered rfcs/0191-worker-manager-launch-configs.md