Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support partial/failed builds #668

Closed
cgwalters opened this issue Jul 25, 2019 · 7 comments
Closed

support partial/failed builds #668

cgwalters opened this issue Jul 25, 2019 · 7 comments
Assignees

Comments

@cgwalters
Copy link
Member

For RHCOS we push an oscontainer to a registry, which is outside of S3.

This leads to issues with what is canonical. There are a few options here - first we could store the oscontainer in S3 too, and then synchronize it afterwards. (A tricky detail here is we need to compress before doing so to get the correct sha256 in the machine)

The second option is for cosa to have a notion of "in progress" builds - basically we add a new entry to builds.json with a meta.json that's just {"building": "true"} or something.

If buildprep discovers the tip build is building:true, it would...delete that one and replace it with a new version number that's an increment.

@cgwalters
Copy link
Member Author

Another argument for having a representation of a "failed" build in cosa is that today if e.g. we fail during or after uploading cloud images (AMIs, etc.), then these effectively "leak".

One can do pruning by starting from the "strong set" of successful builds and looking for cloud images that don't match that, but it's more accurate to have a pruner that e.g. walks the set of failed builds and deletes their cloud resources after say a day.

@darkmuggle darkmuggle self-assigned this Jul 25, 2019
@jlebon
Copy link
Member

jlebon commented Jul 26, 2019

The second option is for cosa to have a notion of "in progress" builds - basically we add a new entry to builds.json with a meta.json that's just {"building": "true"} or something.

I like this idea, and I think this is something we'll need too in FCOS when implementing signing through fedora messaging.

Minor bikeshed: instead of a building: true in meta.json, I wonder if we could instead put it directly in builds.json under a separate e.g. pending-build key? That way (1) we reinforce that there can only be a single build in such a state at a time, (2) it makes pruning trivial, and (3) it maintains the invariant that all builds in builds[] are completed.

@cgwalters
Copy link
Member Author

If there can only be one pending-build at a time, then that means to do "accurate GC", the pipeline would need to prune any cloud resources from the failed build before starting the next one.

(BTW, the pruning here should clearly live in cosa, something like cosa delete-build or so?)

Maybe we aren't disagreeing actually... I think I agree with builds[] being completed. So we'd have at most one pending-build as you say, and failed builds go to failed-builds[].

The buildprep logic though would need to take the newer (by timestamp?) from builds[] and failed-builds[] to allocate a version number.

jlebon added a commit to jlebon/fedora-coreos-pipeline that referenced this issue Jul 30, 2019
Use cosa to compress the qcow2, so we get multi-threading, before
archiving. Otherwise, we risk overloading the Jenkins master PVC before
too long.

Another strategy eventually would be to still upload to S3 (see [1]
which is related to this). But even then, we'd still want to compress
first.

[1] coreos/coreos-assembler#668
@cgwalters
Copy link
Member Author

Bigger picture here...there's an interesting intersection/overlap between e.g. Jenkins artifacts and cosa's builds. Clearly e.g. Jenkins has support for sticking its artifacts in things like S3. But that's tied pretty closely to Jenkins and also consequently makes Jenkins a bit more of a "pet".

It also means local/dev builds would need to do something custom.

This issue is illustrating the flip side though where we need to carefully define an interface between a pipeline and cosa.

I think what we're doing is right, just wanted to write this down.

jlebon added a commit to coreos/fedora-coreos-pipeline that referenced this issue Jul 30, 2019
Use cosa to compress the qcow2, so we get multi-threading, before
archiving. Otherwise, we risk overloading the Jenkins master PVC before
too long.

Another strategy eventually would be to still upload to S3 (see [1]
which is related to this). But even then, we'd still want to compress
first.

[1] coreos/coreos-assembler#668
@cgwalters
Copy link
Member Author

Stuck a WIP for this in #885

The other option is to include hours:minutes:seconds in build IDs or so.

@cgwalters
Copy link
Member Author

The other option is to include hours:minutes:seconds in build IDs or so.

Cowardly punted and did this for RHCOS now.

jlebon added a commit to jlebon/fedora-coreos-pipeline that referenced this issue Nov 22, 2019
By default Jenkins tries to be conservative and writes back lots of data
so that it can resume pipelines from a specific point if it gets
interrupted.

We don't care about that here. We want this to be a native functionality
in cosa (though we're not entirely there yet, see:
coreos/coreos-assembler#668). And in the
future we want to split the pipeline into multiple jobs exactly to make
rerunning things easier.

For more information, see:
https://jenkins.io/doc/book/pipeline/scaling-pipeline/
jlebon added a commit to jlebon/fedora-coreos-pipeline that referenced this issue Nov 22, 2019
By default Jenkins tries to be conservative and writes back lots of data
so that it can resume pipelines from a specific point if it gets
interrupted.

We don't care about that here. We want this to be a native functionality
in cosa (though we're not entirely there yet, see:
coreos/coreos-assembler#668). And in the
future we want to split the pipeline into multiple jobs exactly to make
rerunning things easier.

For more information, see:
https://jenkins.io/doc/book/pipeline/scaling-pipeline/
dustymabe pushed a commit to dustymabe/fedora-coreos-pipeline that referenced this issue Dec 13, 2019
By default Jenkins tries to be conservative and writes back lots of data
so that it can resume pipelines from a specific point if it gets
interrupted.

We don't care about that here. We want this to be a native functionality
in cosa (though we're not entirely there yet, see:
coreos/coreos-assembler#668). And in the
future we want to split the pipeline into multiple jobs exactly to make
rerunning things easier.

For more information, see:
https://jenkins.io/doc/book/pipeline/scaling-pipeline/
jlebon added a commit to jlebon/fedora-coreos-pipeline that referenced this issue Dec 16, 2019
By default Jenkins tries to be conservative and writes back lots of data
so that it can resume pipelines from a specific point if it gets
interrupted.

We don't care about that here. We want this to be a native functionality
in cosa (though we're not entirely there yet, see:
coreos/coreos-assembler#668). And in the
future we want to split the pipeline into multiple jobs exactly to make
rerunning things easier.

For more information, see:
https://jenkins.io/doc/book/pipeline/scaling-pipeline/
jlebon added a commit to coreos/fedora-coreos-pipeline that referenced this issue Dec 16, 2019
By default Jenkins tries to be conservative and writes back lots of data
so that it can resume pipelines from a specific point if it gets
interrupted.

We don't care about that here. We want this to be a native functionality
in cosa (though we're not entirely there yet, see:
coreos/coreos-assembler#668). And in the
future we want to split the pipeline into multiple jobs exactly to make
rerunning things easier.

For more information, see:
https://jenkins.io/doc/book/pipeline/scaling-pipeline/
@jlebon
Copy link
Member

jlebon commented Sep 21, 2023

Nowadays, the oscontainer is both pushed to the registry and part of the build dir. Also, with the pipeline rework, the problem of "orchestrating" across multiple Jenkins is no longer an issue. Builds often fail, and the bits that did pass make it to S3 and show up in the build list but we just never release them so they're never exposed to customers/users. Not sure, probably not worth adding this concept at this point. Feel free to reopen if someone disagrees.

@jlebon jlebon closed this as completed Sep 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants