-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pin OS image for release builds #757
Conversation
Looks like the first commit here overlaps with #732 (but has a different hash)? |
I'd still like to understand how far we are from pivoting to the OS image which was built into the release image. Related discussion in #281. |
|
+1. Probably just drop the
Well, I think there is some benefit to collecting these together. It would be great to be able to commit some environment variables to |
I will just close #732 and deal with all the commits here.
Done.
That is a wonderful idea. That is indeed the place it all should be in. env var to build script still has advantage of custom testing a release build easily.
Could do it, though I think it helps to give a list of possible ones when the buildName does not match any.
I say we should do it only if we really need to. On the other hand I would like if we could bubble down the default channel even. |
We could do that later, on a best-effort case, if we need it. Something like this.
Well, 124df53 has the bubbled-up version if folks want to take a look ;). About bubbling down, that would simplify these APIs slightly, and folks could always supply their own libvirt image or AMI ID if they want to bypass our detection logic. Do we expect other folks to use this package and want more flexibility than "the latest maipo"? If we have any external consumers, they haven't told godocs about their package ;). |
One thing related to this - I think we should consider always pinning the RHCOS build even in git master - then the RHCOS team owns sending PRs to bump it. Though we need to be aware in this model that e.g. if we want to get a change to kubelet landed it'd have to merge into origin and then a secondary PR here. |
@cgwalters That seems reasonable to me. |
@cgwalters @ashcrow We had decided that master would always float so that we don't allow sub-components to drift too far from a working state. If we always pin the OS, it becomes very easy for the OS team to drift too far from a working state. |
at least (was reading the code this morning). |
You say "drift from a working state" there twice but isn't clear to me - what would be some concrete examples? Big picture with this approach then, the state inputs to an install drop down to (installer, update payload) for master, which helps us get closer to the installer release state of simply (installer, ). That said we don't need to debate it in this PR - we can consider it later after this lands. |
That looks like |
The RHCOS team bumps the kernel, podman, and libc in the same week. At the end of the week, you submit a PR to the installer and then find out that that image no longer works. By the time you've figured out that it was the kernel bump, you've again updated podman. You submit another PR, and it breaks again, this time because there is some conflict between libc and podman. You then spend the rest of the month trying to get a working image without blocking the entire team. I've seen it happen back in the CoreOS days. While I don't agree with the overall testing approach (testing every PR to every component with a full integration test is pretty inefficient), this is currently how we are testing everything in OpenShift. We can revisit this once we have something stable and ship a product, but in the meantime, I'd rather we just stick with the status quo. |
Feels like an OpenShift API server flake (openshift/origin#21612). /retest |
With respect - I think what we're doing here is fairly different from that in numerous ways (OS is dedicated to exactly one clustering impl, OS is lifecycle bound with it, OS has more dedicated engineers, and probably the biggest: OS is tracking a more "stable" stream rather than "latest upstream", etc.)
I think we'd be constantly sending PRs - likely one a day. And the way I am thinking about things here, we would likely be testing individual component changes as well before we even submit the PR here. And if for some reason pinning turns out to be a problem, it'd clearly be easy to just go back to not pinning right? Finally - if you still disagree, can you help think of an alternative path that gets us RHCOS builds integrated with the CI gating? |
/test e2e-aws |
While I agree that RHCOS better not have the wild instability the CL had, I am on the side of master always floating. I think that pinning RHCOS is a particular problem for the pod and runtime team as well. If they make a change to the kubelet or CRIO, I think, we won't actually get an CI coverage for that change until it comes back through an RHCOS update. Am I wrong? do we build and use per-PR RHCOS builds with per PR updated runtime and kubelet? If we don't get testing of runtime and kubelet until the RHCOS bump I think the fear of 'drifting from working' is very real, and quite unacceptable. |
We do not. The runtime packages are currently manually pushed to a repo by folks in that team which we automatically pull from during builds. Kubelet changes come through the origin package repository, though we are working towards getting the kubelet PRs on top of RHCOS builds for kubelet CI. |
So we have two independent problems that are somewhat coupled:
The current CI structure for most components says “master of a component github is truth”. That means a merge builds an image which is pushed to an integration stream. All other components test from the integration stream. This means gating and break fixing are handled by merges to master. The coreos team is not currently obeying the integration stream pattern, which is going to be a problem because it means you aren’t a part of the release image. So you need to fix that very soon or we’ve basically failed at release image (no one has made an arguemnt to me why you are special and don’t have to be part of the release image, so I assume you’re working on that and this is a temporary stopgap). When you fix that, you get break reversion by pushing an older image. But you still need to implement gating before your built image gets pushed. If the team is >2 weeks from being in the integration image stream, I’d like to know why. If this is a short term thing until you get there, I might be ok. If this is an attempt to not being part of the release payload, then this is the wrong direction. |
This is next on my list but I've been trying to get up to speed on hacking on the MCO since the way However this is "crossing the threads" a bit - this issue is about the "bootimage", the initial RHCOS the installer uses.
OK so...how hard would it be for us to have the "bootimage" (i.e. AMI, libvirt qcow2 URL) embedded in the release payload too? I think I floated this once before and @wking was skeptical but...it's just a small amount of data we could stick in the release payload metadata, we don't even need to extract the release payload fully, just do a few HTTP requests like |
We do have to be very careful and make sure we are prioritizing work. The |
I should note some of this is in our backlog and a few cards are prioritized. One of which is in this coming sprint. |
But still leave an override env var so that it can be overridden. Use the following env var to pin the image while building the binary: ``` $ # export the release-image variable $ export RELEASE_IMAGE=registry.redhat.io/openshift/origin-release:v4.0" $ # build the openshift-install binary $ ./hack/build.sh $ # distribute the binary to customers and run with pinned image being picked up $ ./bin/openshift-install create cluster... ``` The only way to override it would be to use an env var OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE while running the openshift-install binary.
One environment variable to be used when building the pinned release binary: RHCOS_BUILD_NAME: a string representing the build name(or rather number e.g. 47.48). If empty, then the latest one will be picked up. This could just be hardcoded in the hack/build.sh script before making the release branch so that it stays baked in.
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: rajatchopra, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
@rajatchopra: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/retest Please review the full test history for this PR and help us cut down flakes. |
Moving related conversation to #987 |
Three environment variables have been introduce for build time that will pin the exact OS image to be pulled.
OPENSHIFT_INSTALL_RHCOS_BASE_URL: points to the base url from where to pick the image from. Defaults to "https://releases-rhcos.svc.ci.openshift.org/storage/releases" (as it was before this PR).
OPENSHIFT_INSTALL_RHCOS_DEFAULT_CHANNEL: points to the channel where the image is available. Defaults to "maipo" (as it was before this PR).
OPENSHIFT_INSTALL_RHCOS_BUILD_NAME: a string representing the build name. If empty (the default is empty), then the latest one will be picked up. If the provided build name is wrong/unavailable, then the installer will error out.
This PR is a follow up on #732