Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache failed builds #224

Closed
MagicRB opened this issue Jul 15, 2024 · 14 comments
Closed

Cache failed builds #224

MagicRB opened this issue Jul 15, 2024 · 14 comments
Milestone

Comments

@MagicRB
Copy link
Contributor

MagicRB commented Jul 15, 2024

Currently if a build fails buildbot-nix will attempt to rebuild it every time it pops up in an evaluation, which is wasteful, unless the failure was temporary. The lix fork of buildbot-nix already has this, so we can just pull those changes into upstream (this repository).

@MagicRB MagicRB added this to the Release 1.0 milestone Jul 15, 2024
@Mic92
Copy link
Member

Mic92 commented Jul 15, 2024

The lix fork creates build steps for every derivation, I saw. I think this might be ok for packages but can get really slow for NixOS configuration where we have to build a lot of small derivations at least it was in hercules-ci. But I might be wrong.

@MagicRB
Copy link
Contributor Author

MagicRB commented Jul 15, 2024

Oh yeah that is a horrible idea, I was thinkimg about caching the failures of toplevel derivations.

@MagicRB
Copy link
Contributor Author

MagicRB commented Jul 16, 2024

@Mic92 as far as I can tell, their failed build caching doesn't work across builds, https://git.lix.systems/lix-project/buildbot-nix/src/branch/main/buildbot_nix/__init__.py#L180 I don't see a persistent global failed build cache anywhere, so we cannot pull anything afaict

@puckipedia
Copy link
Contributor

The lix fork creates build steps for every derivation, I saw.

If this were the case, any build even remotely dependent on nixpkgs would create thousands of build steps, which would bog down any buildbot instance instantly. Any evaluation done by our buildbot instance also shows this isn't how it works. We've just replaced the part of the scheduler that registers all builds in one go with one that only schedules a build once the builds it depends on succeed.

@Mic92
Copy link
Member

Mic92 commented Jul 23, 2024

I cannot see the website unfortunately. There seems to be an issue with the authentication:

You do not have the can-perform-mutations buildbot role

@puckipedia
Copy link
Contributor

..I entirely forgot we put the entire site behind ACLs, as a temporary thing until we got buildbot-native ACLs working, and then kinda forgot to write the buildbot ACLs. Sorry! Attached a screenshot.

loooong screenshot

lix-eval

@MagicRB
Copy link
Contributor Author

MagicRB commented Jul 26, 2024

@puckipedia I must be missing something, because your code does trigger a scheduler per each derivation, as can be seen in the logs by the flood of Scheduling which are coming from here and then right after you trigger a scheduler to do the actual build of that one derivation.

@puckipedia
Copy link
Contributor

that's not "every derivation"; otherwise I'd say that this repo's buildbot-nix in fact also builds "every derivation". Compare:

also a long screenshot, though not quite as long

https://buildbot.thalheim.io/#/builders/4/builds/333

I'd say that the qemu dependency itself would fill like twenty pages if every derivation was scheduled independently :)

(also, -29 pending builds?)

@puckipedia
Copy link
Contributor

And mind you, the "Scheduling" logic there is just replicating what the Buildbot Trigger class does. However, it does so entirely dynamically, rather than scheduling all builds at the start of the run in getSchedulersAndProperties. This allows us finer control to determine when a check is scheduled; and we use this to avoid CPU waste by building every check when a check it depends on fails. In this example, we avoid rebuilding Lix halfway through about 28 times, which is a huge time waster:

You know the gist by now.

yup that's right, another screenshot of the lix buildbot, showing 28 skipped checks because a build that they all depended upon failed.

@MagicRB
Copy link
Contributor Author

MagicRB commented Jul 30, 2024

Ah okay, I can't please excuse my complete and utter blindness (turns out reading untyoed code is very hard, uhg I hate python with a passion), I got my arrays mixed up :( .

Your system does look nice, I do have to admit that. I'll see how much of it will be ergonomic to pull in, since the codebases diverged a fair amount. (I would have written it slightly differently, but I also would have rewritten buildbot in Haskell if I was left to do whatever I want, so may be a good I didn't write it 🤣 might be more readable for literally everyone but me this way)

@MagicRB
Copy link
Contributor Author

MagicRB commented Jul 30, 2024

And mind you, the "Scheduling" logic there is just replicating what the Buildbot Trigger class does. However, it does so entirely dynamically, rather than scheduling all builds at the start of the run in getSchedulersAndProperties. This allows us finer control to determine when a check is scheduled; and we use this to avoid CPU waste by building every check when a check it depends on fails. In this example, we avoid rebuilding Lix halfway through about 28 times, which is a huge time waster:
You know the gist by now.

yup that's right, another screenshot of the lix buildbot, showing 28 skipped checks because a build that they all depended upon failed.

That is very reasonable, I'll need to read more, I am slightly struggling to make sense of the code. I'll figure it out tho, thanks fir the help, I appreciate it.

@MagicRB
Copy link
Contributor Author

MagicRB commented Sep 14, 2024

Closed by #255

@Mic92
Copy link
Member

Mic92 commented Oct 9, 2024

@puckipedia we were recently converted to async/await syntax and found a potential bug in your original code because of better type checks.

On this line: https://git.lix.systems/lix-project/buildbot-nix/src/commit/48828cb33fbd99ae2e442c29a888217cd892b22e/buildbot_nix/__init__.py#L292
it creates a deferred object but it is never yielded, effectively leaking it.

See also the async method here: https://git.lix.systems/lix-project/buildbot-nix/src/commit/48828cb33fbd99ae2e442c29a888217cd892b22e/buildbot_nix/__init__.py#L210

@puckipedia
Copy link
Contributor

note that defer and async have slightly different behaviours in this case; from what i can tell, it's safe to not drive the defer and let it execute in the background; that's what is happening here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants