Start a new slave per build #22448

alexcrichton · 2015-02-17T16:05:35Z

It is far too common nowadays that one build on our buildbot will end up corrupting all future builds, requiring a manual login to the bot to fix the state of affairs. The most common reason this happens is a rogue process remains running on Windows, preventing recreation of the executable (e.g. causing the compiler to fail to link on all future test runs).

It would likely be much more robust to start a new slave each time we start a new build. This way we're guaranteed a 100% clean slate when starting a build (e.g. no lingering processes). This would, however, require caching LLVM builds elsewhere and having the first step of the build to download the most recent LLVM build.

As such, this is not an easy bug to implement at all, hence the wishlist status for our infrastructure.

cc @brson

nagisa · 2015-02-17T20:21:28Z

Can’t ccache caches be shared between machines somehow, perhaps by storing the cache on a network share? This way we could just ccache llvm.

alexcrichton · 2015-02-17T22:03:49Z

That sounds like an excellent solution to the problem!

brson · 2015-02-18T05:09:28Z

A distributed ccache does sound useful here.

Slave per build will take a few more minutes per cycle because of the time it takes to start slaves, but it's not a great cost.

I wonder if we could use a simple distributed rustc cache to speed up half of the auto builds.

vadimcn · 2015-02-20T20:49:22Z

Sounds a bit heavy-handed.
If this were only about Windows, I'd suggest looking into Windows Job Objects. They are basically like containers for process groups, and you can set them up such that all processes get killed when job object is destroyed.

dotdash · 2015-03-04T19:02:56Z

Is this still an issue or is this handled by the job group(?) stuff that was implemented for the windows buildbots?

alexcrichton · 2015-03-04T20:57:05Z

I believe that using dojob.exe has fixed some of our problems (I haven't seen any issues related to this recently), but I would also like to still do this. Right now if you kill a build too quickly it will corrupt the git directory, causing all future builds using that build directory to fail. This currently requires manual intervention to delete the git directory or rebooting the slave in question.

steveklabnik · 2016-03-24T21:14:33Z

Triage: not aware of any changes here.

alexcrichton · 2016-08-22T23:03:12Z

This likely isn't going to happen, we'd basically have to rewrite our entire CI system.

alexcrichton added E-hard Call for participation: Hard difficulty. Experience needed to fix: A lot. I-wishlist labels Feb 17, 2015

brson mentioned this issue Feb 18, 2015

Automation metabug #17356

Closed

65 tasks

alexcrichton mentioned this issue Feb 22, 2015

Intermittent "Permission denied" on Windows #22628

Closed

alexcrichton closed this as completed Aug 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Start a new slave per build #22448

Start a new slave per build #22448

alexcrichton commented Feb 17, 2015

nagisa commented Feb 17, 2015

alexcrichton commented Feb 17, 2015

brson commented Feb 18, 2015

vadimcn commented Feb 20, 2015

dotdash commented Mar 4, 2015

alexcrichton commented Mar 4, 2015

steveklabnik commented Mar 24, 2016

alexcrichton commented Aug 22, 2016

Start a new slave per build #22448

Start a new slave per build #22448

Comments

alexcrichton commented Feb 17, 2015

nagisa commented Feb 17, 2015

alexcrichton commented Feb 17, 2015

brson commented Feb 18, 2015

vadimcn commented Feb 20, 2015

dotdash commented Mar 4, 2015

alexcrichton commented Mar 4, 2015

steveklabnik commented Mar 24, 2016

alexcrichton commented Aug 22, 2016