Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the image for kube-apiserver v1.19.0-rc.2 contains a binary of unknown origin #1438

Closed
neolit123 opened this issue Jul 24, 2020 · 10 comments · Fixed by #1455
Closed

the image for kube-apiserver v1.19.0-rc.2 contains a binary of unknown origin #1438

neolit123 opened this issue Jul 24, 2020 · 10 comments · Fixed by #1455
Assignees
Labels
area/release-eng Issues or PRs related to the Release Engineering subproject kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/release Categorizes an issue or PR as relevant to SIG Release.

Comments

@neolit123
Copy link
Member

neolit123 commented Jul 24, 2020

What happened:

https://kubernetes.slack.com/archives/CJH2GBF7Y/p1595600184366100

@aanm deployed a v1.19.0-rc.2 kubeadm based cluster and discovered that the kube-apiserver reports a build SHA dd1511ca82c2e08847a1e4f712f4f1924f5babc8, which is unknown (not a commit in k/k).

What you expected to happen:

the binary inside the kube-apiserver image to match the Anago release commit:
kubernetes/kubernetes@27bb2a4

How to reproduce it (as minimally and precisely as possible):

$ kubeadm init --kubernetes-version=v1.19.0-rc.2
# wait for the api-server to come up...
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.0-rc.2", GitCommit:"27bb2a4a0a5cb8330178d19e57fa61fffa895c98", GitTreeState:"clean", BuildDate:"2020-07-21T17:39:35Z", GoVersion:"go1.14.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.0-rc.2", GitCommit:"dd1511ca82c2e08847a1e4f712f4f1924f5babc8", GitTreeState:"clean", BuildDate:"2020-07-21T16:25:15Z", GoVersion:"go1.14.6", Compiler:"gc", Platform:"linux/amd64"}

Notes

  • possibly the other control-plane images are affected too
  • the binary artifacts e.g. https://storage.googleapis.com/kubernetes-release/release/${RELEASE}/bin/linux/amd64/${BIN} have the correct commit SHA.

/kind bug

@neolit123 neolit123 added area/release-eng Issues or PRs related to the Release Engineering subproject kind/bug Categorizes issue or PR as related to a bug. sig/release Categorizes an issue or PR as relevant to SIG Release. labels Jul 24, 2020
@liggitt
Copy link
Member

liggitt commented Jul 25, 2020

/priority critical-urgent

can we capture the rc2 build/release logs before they expire?

@kubernetes/release-engineering

@k8s-ci-robot k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-priority labels Jul 25, 2020
@hasheddan
Copy link
Contributor

Links to logs for each run are here: kubernetes/sig-release#1156

@BenTheElder
Copy link
Member

#1428 probably deserves a second look ... (or perhaps a first, from new eyes uhhh)

@puerco
Copy link
Member

puerco commented Jul 25, 2020

Tim merged that one for anago to finish transitioning the artifacts from stage to release, everything was already built by that point. If it was caused by a hack, it should be one of the earlier ones (there were several that day).

@saschagrunert
Copy link
Member

The wrong commit is the release commit from the mock staging bucket. If we look into the (mock) staged sources, then we can see:

> wget https://storage.googleapis.com/kubernetes-release-gcb/stage/v1.19.0-rc.1.121+223a4e974f77c5/src.tar.gz
> tar xf src.tar.gz
> cd src/k8s.io/kubernetes/
> git show dd1511ca82c2e08847a1e4f712f4f1924f5babc8
commit dd1511ca82c2e08847a1e4f712f4f1924f5babc8 (tag: v1.19.0-rc.2)
Author: Anago GCB <[email protected]>
Date:   Tue Jul 21 16:24:14 2020 +0000

    Release commit for Kubernetes v1.19.0-rc.2

To me it looks like that we pushed the wrong container image into the registry. The build image during nomock stage contains the right git commit SHA (verified locally). If I re-pull k8s.gcr.io/kube-apiserver-amd64:v1.19.0-rc.2 then it truly contains the image from the mock stage bucket…

@puerco
Copy link
Member

puerco commented Jul 25, 2020

OK, then that means that the wrong image got referenced in the promoter manifest somehow, right? The digest in the manifest matches the image in the staging registry. Let's investigate one further step back.

@tpepper
Copy link
Member

tpepper commented Jul 28, 2020

Looking in diffoscope at the binaries produced for x86_64 linux and comparing with a local build (stripped) from the 1.19.0-rc.2 tag, they look like they're functionally equivalent and indirectly affirming the suspicion that the re-re-re-build&releases had local, non-pushed changelog commits and thus different commitids.

@saschagrunert
Copy link
Member

saschagrunert commented Jul 29, 2020

Looks like we have the same issue with rc.3:

> etcd &
> sudo podman run -it --net=host k8s.gcr.io/kube-apiserver-amd64:v1.19.0-rc.3 kube-apiserver --etcd-servers http://127.0.0.1:2379

then:

> curl http://localhost:8080/version
{
  "major": "1",
  "minor": "19+",
  "gitVersion": "v1.19.0-rc.3",
  "gitCommit": "aaf86f7c3a07cd29fa306370b50950f950d0f64d",
  "gitTreeState": "clean",
  "buildDate": "2020-07-29T07:28:27Z",
  "goVersion": "go1.15rc1",
  "compiler": "gc",
  "platform": "linux/amd64"
}

aaf86f7c3a07cd29fa306370b50950f950d0f64d is the commit from the mock stage image build. This can be verified by looking at the staged sources:

> wget https://storage.googleapis.com/kubernetes-release-gcb/stage/v1.19.0-rc.2.125+182a67fa7bef52/src.tar.gz
> tar xf src.tar.gz
> cd src/k8s.io/kubernetes/
> git show aaf86f7c3a07cd29fa306370b50950f950d0f64d
commit aaf86f7c3a07cd29fa306370b50950f950d0f64d (tag: v1.19.0-rc.3)
Author: Anago GCB <[email protected]>
Date:   Wed Jul 29 07:27:28 2020 +0000

    Release commit for Kubernetes v1.19.0-rc.3

Now, let's look at the nomock stage container images: kube-apiserver.tar

> sudo podman load -i stage_v1.19.0-rc.2.125+182a67fa7bef52_v1.19.0-rc.3_release-images_amd64_kube-apiserver.tar 
> etcd &
> sudo podman run -it --net=host k8s.gcr.io/kube-apiserver-amd64:v1.19.0-rc.3 kube-apiserver --etcd-servers http://127.0.0.1:2379

then:

> curl http://localhost:8080/version
{
  "major": "1",
  "minor": "19+",
  "gitVersion": "v1.19.0-rc.3",
  "gitCommit": "9ee7e7c2c15d6148abcbef79276c67230100de14",
  "gitTreeState": "clean",
  "buildDate": "2020-07-29T08:36:49Z",
  "goVersion": "go1.15rc1",
  "compiler": "gc",
  "platform": "linux/amd64"
}

9ee7e7c2c15d6148abcbef79276c67230100de14 is the right commit.


TLDR; I still think that we push/promote the images from mock stage and not them from nomock stage, because the tarballs in the buckets are definitively correct.

@justaugustus
Copy link
Member

Catching up from some OOO days and the threads in Slack (https://kubernetes.slack.com/archives/CJH2GBF7Y/p1596014847435000?thread_ts=1596006315.427000&cid=CJH2GBF7Y, https://kubernetes.slack.com/archives/CJH2GBF7Y/p1596217066459900)...

I'll summarize what happened and what I believe should be next steps.

On 7/20, the vanity domain flip (VDF) was kicked off, which changed the backing registries for k8s.gcr.io from gcr.io/google-containers to {asia,eu,us}.gcr.io/k8s-artifacts-prod. We made an accompanying change to the Release Engineering tooling to allow us to push images to the accompanying staging repo for core images (gcr.io/k8s-staging-kubernetes) here: #1230

The release process runs on the kubernetes-release-test project, which is a Google-owned GCP project that Kubernetes Release Managers (@kubernetes/release-managers) have access to.

The GCB service account for this project has direct access to push images to the following registries:

  • staging-k8s.gcr.io (old staging)
  • gcr.io/google-containers (old prod)

Given the promotion process for community-owned images, we do not have the same ability to push container images directly to new production (nor do we need it). To get around this, we skipped the image push on official releases and instead do a validation that the images have been promoted. You can see that added in #1199.

The trouble with this is the timing is off.
The current images (rc.2, rc.3) are being built from the mock stage assets, which have a release commit that never gets pushed to GitHub. That's the reason for the commit difference, though they should essentially be equivalent (since we build against explicit build IDs in an official stage/release).

Here's what I think needs to happen:

Mock

Mock stage

Build and dry-run push to gcr.io/k8s-staging-kubernetes. If dry-runs are not possible on container pushes, push instead to the old staging repo or a subdirectory of the gcr.io/k8s-staging-kubernetes, like gcr.io/k8s-staging-kubernetes/mock.

Mock release

Validate image manifests from the mock push location as one of the prerequisites.

Official

Official stage

Build and push to gcr.io/k8s-staging-kubernetes.
(This means the images are now built against the real release commit instead of the mock one.)

Image promotion

After the official stage is complete:

  • Generate a PR to handle the image promotion
  • Ensure the images have been promoted to k8s-artifacts-prod endpoints BEFORE starting the official release

Official release

Validate image manifests have been promoted as one of the prerequisites.

I'll take a look at this since I've touched most of the code paths around container images most recently.

/assign

@justaugustus
Copy link
Member

Opened a fix in #1455.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/release-eng Issues or PRs related to the Release Engineering subproject kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/release Categorizes an issue or PR as relevant to SIG Release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants