Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker pool image automation #47

Open
ahal opened this issue Jan 24, 2024 · 10 comments
Open

Worker pool image automation #47

ahal opened this issue Jan 24, 2024 · 10 comments

Comments

@ahal
Copy link
Contributor

ahal commented Jan 24, 2024

Following a monopacker cross-training session with aerickson, Ben and I had a chat around potential avenues for automation. I wanted to jot down some of the ideas while they are fresh in my mind. We can figure out how to put them into action via proper RFCs later.

I'll try to order them from easiest to wildest.

Automate monopacker builds

One of the pain points aerickson mentioned was that it was difficult to tell if you break another build while working on a current one (as they often use the same scripts). A simple first step could be to have tasks that build each image definition (without publishing). Then it's clear when something breaks and what broke it.

Automate image dependency upgrades

The next step could be to have cron tasks that look for new versions of certain dependencies. Things like Taskcluster / generic-worker / worker-runner etc. These could run and publish the images and we can update pools to them at our leisure.

aerickson mentioned image storage as a potential concern. May need a strategy for cleaning up old unused images.

Automate worker pool upgrades

The next step would be to automate using these images in the various pools. We could have a cron task that runs out of ci-config that looks for new images and then creates a pull request to update them (bear in mind ci-config is moving to Github).

I don't think we would want these changes to go live automatically, but automated PRs or phab revisions would be very welcome!

In-repo image upgrades

This is the pinnacle of automation. The same cron task from the previous section (in ci-config) would run and look for newer versions of the images. Except in this case, the images are no longer defined in ci-config but rather in-repo. Or rather, the images are defined in ci-config, but the image a pool uses would be defined in-repo.

The cron task would iterate through all projects, look at what images the repo is using, and if there is a newer one, create a PR to update it. Maintainers for each repo could decide to merge or close the PR.

This has many benefits:

  1. Image updates can be backed out if they cause problems
  2. Image updates can be tested directly in pull requests for the various repos
  3. Different repos can more easily use different versions of the image

It's worth noting that Gecko won't be using pull requests, so we'd need to submit a phab revision in that case.

@ahal
Copy link
Contributor Author

ahal commented Jan 24, 2024

Also, we can probably apply similar ideas to our Azure images.

@ahal
Copy link
Contributor Author

ahal commented Jan 24, 2024

To expand a bit on how "in-repo image upgrades" would work, I think ci-config would need to generate pools for every available image, as well as a latest pool that always points at the most recent image. Then repos would simply set the worker-type to either latest or one with a date in the name. At that point, you update images by updating pools.

To avoid too many unused pools lying around, we could have a check that inspects what pools each configured project is using, and warn when there are unused pools / images.

@petemoore
Copy link

i don't this this approach scales, and gives too much control to ci-config which is outside of project authority. I would prefer an approach where projects had full autonomy. Something like how Dockerfiles in tree are used for building docker images, under the full autonomy of the development team. I think we need to provide apis and services that allow teams to build their own images, rather than own the image configurations ourselves and allow people to use what we created.

@petemoore
Copy link

something along the lines of taskcluster/taskcluster-rfcs#122

@ahal
Copy link
Contributor Author

ahal commented Jan 25, 2024

Having full control of image building from within each project sounds wonderful! Is this something the TC team is planning to work towards? It might still be worth tackling some of this in the meantime as we're feeling the pain and need some kind of improvement here soon :)

I'm not sure why this approach wouldn't scale however, to me it seems like it could almost be entirely automated other than needing to merge PRs, but perhaps I'm missing something.

I agree that having full in-project control is the ideal, I guess I'm just struggling to see a concrete path that gets us there.

@markcor
Copy link

markcor commented Jan 25, 2024

Also, we can probably apply similar ideas to our Azure images.

With the Azure images, much of this is in line with what we are doing and planning on doing.
https://github.com/mozilla-platform-ops/worker-images

@bhearsum
Copy link
Contributor

Having full control of image building from within each project sounds wonderful! Is this something the TC team is planning to work towards? It might still be worth tackling some of this in the meantime as we're feeling the pain and need some kind of improvement here soon :)

I'm not sure why this approach wouldn't scale however, to me it seems like it could almost be entirely automated other than needing to merge PRs, but perhaps I'm missing something.

I agree that having full in-project control is the ideal, I guess I'm just struggling to see a concrete path that gets us there.

While not necessarily a blocker, a potential downside to doing this is that it would put image building in the critical path of running builds/tests. Obviously this is already the case for docker images - but depending on how much time it adds to the critical path it may have some significant downside. This is one of the big upsides about putting a reference to an already existing image in the tree - you gain the in-tree control over what things are built on without putting anything new in the critical path.

(I also agree that the ideal state is everything in the tree however.)

@petemoore
Copy link

I'm not sure why this approach wouldn't scale however, to me it seems like it could almost be entirely automated other than needing to merge PRs, but perhaps I'm missing something.

Apologies, it is certainly a more automated approach than we currently have and a definite improvement. By not scaling, I really mean that any time a human needs to intervene to approve something (such as merge a PR), from a different team, we potentially block each other. We don't have 24/7 coverage in teams, so people will invariably need to wait. The more images that are created and managed, the more human resources you need to handle the requests. You are always constrained by the number of people that can respond to requests. But agreed, it is a lot better than the current approach, but I think it would be good to aim for one that doesn't require any central approval, so teams can have full autonomy.

@ahal
Copy link
Contributor Author

ahal commented Jan 26, 2024

I think a key here is that it would be project maintainers merging the PRs, not releng or relops (well we would merge the PRs for new images, but not for new worker pools).

Tbh, I really don't feel comfortable about this stuff just automatically going live into production without any human intervention.

Edit: Re-reading your comment I don't think that's what you're suggesting, and I think you misunderstood my proposal. I'm proposing we move away from centralized gatekeepers here. See this line from initial comment:

Maintainers for each repo could decide to merge or close the PR.

@hwine
Copy link

hwine commented Jan 29, 2024

/me notes that "decentralizing" some of this does change security boundaries, at least for the Fx CI case. I.e. RRA at some point, please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants