support partial/failed builds #668

cgwalters · 2019-07-25T17:40:51Z

For RHCOS we push an oscontainer to a registry, which is outside of S3.

This leads to issues with what is canonical. There are a few options here - first we could store the oscontainer in S3 too, and then synchronize it afterwards. (A tricky detail here is we need to compress before doing so to get the correct sha256 in the machine)

The second option is for cosa to have a notion of "in progress" builds - basically we add a new entry to builds.json with a meta.json that's just {"building": "true"} or something.

If buildprep discovers the tip build is building:true, it would...delete that one and replace it with a new version number that's an increment.

The text was updated successfully, but these errors were encountered:

cgwalters · 2019-07-25T19:54:43Z

Another argument for having a representation of a "failed" build in cosa is that today if e.g. we fail during or after uploading cloud images (AMIs, etc.), then these effectively "leak".

One can do pruning by starting from the "strong set" of successful builds and looking for cloud images that don't match that, but it's more accurate to have a pruner that e.g. walks the set of failed builds and deletes their cloud resources after say a day.

jlebon · 2019-07-26T18:51:15Z

The second option is for cosa to have a notion of "in progress" builds - basically we add a new entry to builds.json with a meta.json that's just {"building": "true"} or something.

I like this idea, and I think this is something we'll need too in FCOS when implementing signing through fedora messaging.

Minor bikeshed: instead of a building: true in meta.json, I wonder if we could instead put it directly in builds.json under a separate e.g. pending-build key? That way (1) we reinforce that there can only be a single build in such a state at a time, (2) it makes pruning trivial, and (3) it maintains the invariant that all builds in builds[] are completed.

cgwalters · 2019-07-27T14:12:23Z

If there can only be one pending-build at a time, then that means to do "accurate GC", the pipeline would need to prune any cloud resources from the failed build before starting the next one.

(BTW, the pruning here should clearly live in cosa, something like cosa delete-build or so?)

Maybe we aren't disagreeing actually... I think I agree with builds[] being completed. So we'd have at most one pending-build as you say, and failed builds go to failed-builds[].

The buildprep logic though would need to take the newer (by timestamp?) from builds[] and failed-builds[] to allocate a version number.

Use cosa to compress the qcow2, so we get multi-threading, before archiving. Otherwise, we risk overloading the Jenkins master PVC before too long. Another strategy eventually would be to still upload to S3 (see [1] which is related to this). But even then, we'd still want to compress first. [1] coreos/coreos-assembler#668

cgwalters · 2019-07-30T16:48:44Z

Bigger picture here...there's an interesting intersection/overlap between e.g. Jenkins artifacts and cosa's builds. Clearly e.g. Jenkins has support for sticking its artifacts in things like S3. But that's tied pretty closely to Jenkins and also consequently makes Jenkins a bit more of a "pet".

It also means local/dev builds would need to do something custom.

This issue is illustrating the flip side though where we need to carefully define an interface between a pipeline and cosa.

I think what we're doing is right, just wanted to write this down.

Use cosa to compress the qcow2, so we get multi-threading, before archiving. Otherwise, we risk overloading the Jenkins master PVC before too long. Another strategy eventually would be to still upload to S3 (see [1] which is related to this). But even then, we'd still want to compress first. [1] coreos/coreos-assembler#668

cgwalters · 2019-10-31T19:52:24Z

Stuck a WIP for this in #885

The other option is to include hours:minutes:seconds in build IDs or so.

cgwalters · 2019-10-31T20:13:45Z

The other option is to include hours:minutes:seconds in build IDs or so.

Cowardly punted and did this for RHCOS now.

By default Jenkins tries to be conservative and writes back lots of data so that it can resume pipelines from a specific point if it gets interrupted. We don't care about that here. We want this to be a native functionality in cosa (though we're not entirely there yet, see: coreos/coreos-assembler#668). And in the future we want to split the pipeline into multiple jobs exactly to make rerunning things easier. For more information, see: https://jenkins.io/doc/book/pipeline/scaling-pipeline/

jlebon · 2023-09-21T14:51:25Z

Nowadays, the oscontainer is both pushed to the registry and part of the build dir. Also, with the pipeline rework, the problem of "orchestrating" across multiple Jenkins is no longer an issue. Builds often fail, and the bits that did pass make it to S3 and show up in the build list but we just never release them so they're never exposed to customers/users. Not sure, probably not worth adding this concept at this point. Feel free to reopen if someone disagrees.

darkmuggle self-assigned this Jul 25, 2019

jlebon mentioned this issue Jul 30, 2019

Jenkinsfile: compress qcow2 before archiving on failure coreos/fedora-coreos-pipeline#116

Merged

This was referenced Aug 7, 2019

support logging #692

Open

ore aws upload does not override existing AMIs/snapshots coreos/mantle#1035

Closed

ore: Make upload --force deregister AMI too coreos/mantle#1039

Merged

cgwalters mentioned this issue Oct 31, 2019

WIP: failed buildid #885

Closed

jlebon mentioned this issue Mar 22, 2021

Prevent reuse of version numbers from failed builds coreos/fedora-coreos-pipeline#320

Closed

cgwalters mentioned this issue Apr 13, 2022

support cosa init --ostree docker://quay.io/coreos-assembler/fcos:testing-devel #2685

Open

jlebon closed this as completed Sep 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support partial/failed builds #668

support partial/failed builds #668

cgwalters commented Jul 25, 2019

cgwalters commented Jul 25, 2019

jlebon commented Jul 26, 2019

cgwalters commented Jul 27, 2019

cgwalters commented Jul 30, 2019

cgwalters commented Oct 31, 2019

cgwalters commented Oct 31, 2019

jlebon commented Sep 21, 2023

support partial/failed builds #668

support partial/failed builds #668

Comments

cgwalters commented Jul 25, 2019

cgwalters commented Jul 25, 2019

jlebon commented Jul 26, 2019

cgwalters commented Jul 27, 2019

cgwalters commented Jul 30, 2019

cgwalters commented Oct 31, 2019

cgwalters commented Oct 31, 2019

jlebon commented Sep 21, 2023