Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pin OS image for release builds #757

Merged
merged 2 commits into from
Dec 15, 2018

Conversation

rajatchopra
Copy link

@rajatchopra rajatchopra commented Nov 28, 2018

Three environment variables have been introduce for build time that will pin the exact OS image to be pulled.
OPENSHIFT_INSTALL_RHCOS_BASE_URL: points to the base url from where to pick the image from. Defaults to "https://releases-rhcos.svc.ci.openshift.org/storage/releases" (as it was before this PR).
OPENSHIFT_INSTALL_RHCOS_DEFAULT_CHANNEL: points to the channel where the image is available. Defaults to "maipo" (as it was before this PR).
OPENSHIFT_INSTALL_RHCOS_BUILD_NAME: a string representing the build name. If empty (the default is empty), then the latest one will be picked up. If the provided build name is wrong/unavailable, then the installer will error out.

This PR is a follow up on #732

@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Nov 28, 2018
@rajatchopra rajatchopra changed the title Pin OS image for builds Pin OS image for release builds Nov 28, 2018
@wking
Copy link
Member

wking commented Nov 28, 2018

Looks like the first commit here overlaps with #732 (but has a different hash)?

@rajatchopra
Copy link
Author

Looks like the first commit here overlaps with #732?

Yes. They are changing the same place in the build script and I wasn't keen on chasing the conflicts. Could close #732 if this is looking alright. Or let it merge and the commit will resolve by itself.

@wking
Copy link
Member

wking commented Nov 28, 2018

Or let it merge and the commit will resolve by itself.

That works if your new commit here was on top of #732's e890b7b instead of on top of c5657fc. You can get that with:

$ git rebase --onto e890b7b HEAD^

@wking
Copy link
Member

wking commented Nov 29, 2018

I'd still like to understand how far we are from pivoting to the OS image which was built into the release image. Related discussion in #281.

pkg/rhcos/builds.go Outdated Show resolved Hide resolved
pkg/rhcos/builds.go Outdated Show resolved Hide resolved
@abhinavdahiya
Copy link
Contributor

  1. The envs are similar to what we use as options for installer. These are build envs and they shouldn't resemble user envs.
  2. I think we are going to be branching for each new release and want git history to should show what image we fixed to. we could add fix overrides in build script to fix to a specific version when we branch but that's no different to use doing that in code.

@wking
Copy link
Member

wking commented Nov 29, 2018

The envs are similar to what we use as options for installer. These are build envs and they shouldn't resemble user envs.

+1. Probably just drop the OPENSHIFT_INSTALL_ prefixes.

I think we are going to be branching for each new release and want git history to should show what image we fixed to. we could add fix overrides in build script to fix to a specific version when we branch but that's no different to use doing that in code.

Well, I think there is some benefit to collecting these together. It would be great to be able to commit some environment variables to hack/release.sh and build with those pins without having to poke around in the bowels of pkg/. But yeah, at the end of the day, these are going to be strings in the repo, so we should drop the env-var approach if it seems too invasive. Currently the env-var approach seems minimal enough for me to want it, but if it comes down to personal preference I'm fine being overridden ;).

pkg/rhcos/builds.go Outdated Show resolved Hide resolved
@rajatchopra
Copy link
Author

Looks like the first commit here overlaps with #732 (but has a different hash)?

I will just close #732 and deal with all the commits here.

+1. Probably just drop the OPENSHIFT_INSTALL_ prefixes

Done.

It would be great to be able to commit some environment variables to hack/release.sh

That is a wonderful idea. That is indeed the place it all should be in. env var to build script still has advantage of custom testing a release build easily.
So when we make a release just write/update the image url/buildName into the hack/release.sh

If we have a hard-coded buildName, I'd rather not bother hitting builds.json at all.

Could do it, though I think it helps to give a list of possible ones when the buildName does not match any.

I'm also fine bubbling that build argument up into AMI and QEMU

I say we should do it only if we really need to. On the other hand I would like if we could bubble down the default channel even.

@wking wking dismissed crawford’s stale review November 30, 2018 00:16

Both suggestions landed in 6521f80.

@wking
Copy link
Member

wking commented Nov 30, 2018

If we have a hard-coded buildName, I'd rather not bother hitting builds.json at all.

Could do it, though I think it helps to give a list of possible ones when the buildName does not match any.

We could do that later, on a best-effort case, if we need it. Something like this.

I'm also fine bubbling that build argument up into AMI and QEMU

I say we should do it only if we really need to. On the other hand I would like if we could bubble down the default channel even.

Well, 124df53 has the bubbled-up version if folks want to take a look ;). About bubbling down, that would simplify these APIs slightly, and folks could always supply their own libvirt image or AMI ID if they want to bypass our detection logic. Do we expect other folks to use this package and want more flexibility than "the latest maipo"? If we have any external consumers, they haven't told godocs about their package ;).

@cgwalters
Copy link
Member

One thing related to this - I think we should consider always pinning the RHCOS build even in git master - then the RHCOS team owns sending PRs to bump it.

Though we need to be aware in this model that e.g. if we want to get a change to kubelet landed it'd have to merge into origin and then a secondary PR here.

@ashcrow
Copy link
Member

ashcrow commented Nov 30, 2018

@cgwalters That seems reasonable to me.

@crawford
Copy link
Contributor

crawford commented Dec 1, 2018

@cgwalters @ashcrow We had decided that master would always float so that we don't allow sub-components to drift too far from a working state. If we always pin the OS, it becomes very easy for the OS team to drift too far from a working state.

@cgwalters
Copy link
Member

If we have any external consumers, they haven't told godocs about their package ;).

https://github.com/openshift/machine-config-operator/blob/77b0e7bc3c815fd23e4e220b780be2eda45eb250/pkg/operator/bootstrap.go#L10

at least (was reading the code this morning).

@cgwalters
Copy link
Member

@cgwalters @ashcrow We had decided that master would always float so that we don't allow sub-components to drift too far from a working state. If we always pin the OS, it becomes very easy for the OS team to drift too far from a working state.

You say "drift from a working state" there twice but isn't clear to me - what would be some concrete examples?

Big picture with this approach then, the state inputs to an install drop down to (installer, update payload) for master, which helps us get closer to the installer release state of simply (installer, ). That said we don't need to debate it in this PR - we can consider it later after this lands.

@wking
Copy link
Member

wking commented Dec 3, 2018

If we have any external consumers, they haven't told godocs about their package ;).

https://github.com/openshift/machine-config-operator/blob/77b0e7bc3c815fd23e4e220b780be2eda45eb250/pkg/operator/bootstrap.go#L10 at least (was reading the code this morning).

That looks like pkg/types, not pkg/rhcos?

@crawford
Copy link
Contributor

crawford commented Dec 3, 2018

You say "drift from a working state" there twice but isn't clear to me - what would be some concrete examples?

The RHCOS team bumps the kernel, podman, and libc in the same week. At the end of the week, you submit a PR to the installer and then find out that that image no longer works. By the time you've figured out that it was the kernel bump, you've again updated podman. You submit another PR, and it breaks again, this time because there is some conflict between libc and podman. You then spend the rest of the month trying to get a working image without blocking the entire team. I've seen it happen back in the CoreOS days.

While I don't agree with the overall testing approach (testing every PR to every component with a full integration test is pretty inefficient), this is currently how we are testing everything in OpenShift. We can revisit this once we have something stable and ship a product, but in the meantime, I'd rather we just stick with the status quo.

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 7, 2018
@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Dec 7, 2018
@wking
Copy link
Member

wking commented Dec 8, 2018

e2e-aws:

ERROR: logging before flag.Parse: E1207 23:53:16.846995      49 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=3, ErrCode=NO_ERROR, debug=""
level=warning msg="RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 2212"
level=warning msg="Failed to connect events watcher: Get https://ci-op-jftfn5q4-1d3f3-api.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/kube-system/events?resourceVersion=2212&watch=true: dial tcp 35.169.222.233:6443: connect: connection refused"
level=warning msg="Failed to connect events watcher: Get https://ci-op-jftfn5q4-1d3f3-api.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/kube-system/events?resourceVersion=2212&watch=true: dial tcp 35.169.222.233:6443: connect: connection refused"
level=debug msg="added kube-scheduler.156e3220df892228: ip-10-0-2-93_3c8ab69b-fa7b-11e8-be6a-12e836c48fc4 became leader"
level=debug msg="added kube-controller-manager.156e322150b30a61: ip-10-0-2-93_3fcc31ad-fa7b-11e8-a16c-12e836c48fc4 became leader"
level=fatal msg="Error executing openshift-install: waiting for bootstrap-complete: timed out waiting for the condition"

Feels like an OpenShift API server flake (openshift/origin#21612).

/retest

@cgwalters
Copy link
Member

I've seen it happen back in the CoreOS days.

With respect - I think what we're doing here is fairly different from that in numerous ways (OS is dedicated to exactly one clustering impl, OS is lifecycle bound with it, OS has more dedicated engineers, and probably the biggest: OS is tracking a more "stable" stream rather than "latest upstream", etc.)

At the end of the week, you submit a PR to the installer and then find out that that image no longer works.

I think we'd be constantly sending PRs - likely one a day. And the way I am thinking about things here, we would likely be testing individual component changes as well before we even submit the PR here.

And if for some reason pinning turns out to be a problem, it'd clearly be easy to just go back to not pinning right?

Finally - if you still disagree, can you help think of an alternative path that gets us RHCOS builds integrated with the CI gating?

@ashcrow
Copy link
Member

ashcrow commented Dec 11, 2018

/test e2e-aws

@eparis
Copy link
Member

eparis commented Dec 12, 2018

While I agree that RHCOS better not have the wild instability the CL had, I am on the side of master always floating.

I think that pinning RHCOS is a particular problem for the pod and runtime team as well. If they make a change to the kubelet or CRIO, I think, we won't actually get an CI coverage for that change until it comes back through an RHCOS update. Am I wrong? do we build and use per-PR RHCOS builds with per PR updated runtime and kubelet?

If we don't get testing of runtime and kubelet until the RHCOS bump I think the fear of 'drifting from working' is very real, and quite unacceptable.

@ashcrow
Copy link
Member

ashcrow commented Dec 12, 2018

do we build and use per-PR RHCOS builds with per PR updated runtime and kubelet?

We do not. The runtime packages are currently manually pushed to a repo by folks in that team which we automatically pull from during builds. Kubelet changes come through the origin package repository, though we are working towards getting the kubelet PRs on top of RHCOS builds for kubelet CI.

@smarterclayton
Copy link
Contributor

So we have two independent problems that are somewhat coupled:

  1. Gating breaks
  2. Catching breaks early

The current CI structure for most components says “master of a component github is truth”. That means a merge builds an image which is pushed to an integration stream. All other components test from the integration stream. This means gating and break fixing are handled by merges to master.

The coreos team is not currently obeying the integration stream pattern, which is going to be a problem because it means you aren’t a part of the release image. So you need to fix that very soon or we’ve basically failed at release image (no one has made an arguemnt to me why you are special and don’t have to be part of the release image, so I assume you’re working on that and this is a temporary stopgap).

When you fix that, you get break reversion by pushing an older image. But you still need to implement gating before your built image gets pushed.

If the team is >2 weeks from being in the integration image stream, I’d like to know why. If this is a short term thing until you get there, I might be ok. If this is an attempt to not being part of the release payload, then this is the wrong direction.

@cgwalters
Copy link
Member

cgwalters commented Dec 12, 2018

If the team is >2 weeks from being in the integration image stream, I’d like to know why.

This is next on my list but I've been trying to get up to speed on hacking on the MCO since the way OSImageURL works today isn't how it should IMO, and that's a prereq for having the oscontainer in the update payload - openshift/machine-config-operator#183

However this is "crossing the threads" a bit - this issue is about the "bootimage", the initial RHCOS the installer uses.

If this is a short term thing until you get there, I might be ok. If this is an attempt to not being part of the release payload, then this is the wrong direction.

OK so...how hard would it be for us to have the "bootimage" (i.e. AMI, libvirt qcow2 URL) embedded in the release payload too? I think I floated this once before and @wking was skeptical but...it's just a small amount of data we could stick in the release payload metadata, we don't even need to extract the release payload fully, just do a few HTTP requests like skopeo inspect does.

@ashcrow
Copy link
Member

ashcrow commented Dec 12, 2018

We do have to be very careful and make sure we are prioritizing work. The coreos team is currently working off our backlog using the priority we are given. If working on a CI system is higher priority than other work then @imcleod and Hina would be good contacts. All accepted and prioritized work goes through one of them first.

@ashcrow
Copy link
Member

ashcrow commented Dec 12, 2018

I should note some of this is in our backlog and a few cards are prioritized. One of which is in this coming sprint.

Rajat Chopra added 2 commits December 14, 2018 12:17
But still leave an override env var so that it can be overridden. Use the following env var to pin the image while building the binary:
```
$ # export the release-image variable
$ export RELEASE_IMAGE=registry.redhat.io/openshift/origin-release:v4.0"
$ # build the openshift-install binary
$ ./hack/build.sh
$ # distribute the binary to customers and run with pinned image being picked up
$ ./bin/openshift-install create cluster...
```
The only way to override it would be to use an env var OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE while running the openshift-install binary.
One environment variable to be used when building the pinned release binary:
    RHCOS_BUILD_NAME: a string representing the build name(or rather number e.g. 47.48). If empty, then the latest one will be picked up.
This could just be hardcoded in the hack/build.sh script before making the release branch so that it stays baked in.
@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Dec 14, 2018
@wking
Copy link
Member

wking commented Dec 14, 2018

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 14, 2018
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rajatchopra, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Dec 15, 2018

@rajatchopra: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-libvirt 6521f80 link /test e2e-libvirt

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@cgwalters
Copy link
Member

Moving related conversation to #987

@rajatchopra rajatchopra deleted the pin_os branch January 15, 2019 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.