Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenShift + Pipeline on GCP is broken #1742

Closed
vdemeester opened this issue Dec 12, 2019 · 11 comments
Closed

OpenShift + Pipeline on GCP is broken #1742

vdemeester opened this issue Dec 12, 2019 · 11 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@vdemeester
Copy link
Member

Expected Behavior

Deploying Tekton Pipeline on an OpenShift cluster running in GCP should work.

Actual Behavior

Deploying Tekton Pipeline on an OpenShift cluster running in GCP should work.
With a OCP 4.2 cluster installed in GCP and RH OpenShift Piplenise Operator 0.8.0, I see the creation of a runtime object (TaskRun or PipelineRun) does not create any resources like pods. When checking the pipeline controller log it doesn't show anything. The controller is actually looping forever.

Quoting @bbrowning

I tracked this down some today and the problem is an infinite retry loop in the k8schain library used by Knative, Tekton, and various other projects. For some reason, in OpenShift on GCP, this library cannot contact the expected Google metadata server.

The main issue is with the "k8s.io/kubernetes/pkg/credentialprovider/gcp" import, and what is magically happening there, especially here. This metadat URL is being disallowed by OpenShift and thus this loops for ever (with backofff, but still).

Steps to Reproduce the Problem

  1. Install OpenShift on GCP
  2. Install tekton on it (using the OpenShift Pipelines operator or directly applying the release yaml)
  3. Look at the controller and the resource not being created.

Additional Info

One easy way to fix it, would be to put the following magic import behind build tags (upstream in go-containerregistry)

	_ "k8s.io/kubernetes/pkg/credentialprovider/aws"
	_ "k8s.io/kubernetes/pkg/credentialprovider/azure"
	_ "k8s.io/kubernetes/pkg/credentialprovider/gcp"

/assign
/kind bug

@tekton-robot tekton-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 12, 2019
@imjasonh
Copy link
Member

To summarize, so that I'm sure I understand the issue:

  1. k8schain uses the credentialprovider/gcp magic import to fetch GCP creds from GCP metadata, in order to fetch image metadata (only from GCP?) to inject Tekton's entrypoint binary.

  2. OpenShift-on-GCP blocks GCP metadata requests, so when k8schain is used it fails continuously, and doesn't surface that failure or fallback to anonymous.

Is that correct?

One solution would be to not use k8schain, but I'm not sure what we'd use instead to get necessary credentials to fetch image data. Using k8schain without the magic imports is also possible (as suggested in the above bug report), but this would presumably break auth for users who rely on it today.

cc @jonjohnsonjr

@vdemeester
Copy link
Member Author

@imjasonh I think this is true even without the entrypoint magic as k8schain is also used in knative/pkg that we depend on for the controller.

Created google/go-containerregistry#630 upstream 👼

@vdemeester
Copy link
Member Author

2. OpenShift-on-GCP blocks GCP metadata requests, so when `k8schain` is used it fails continuously, and doesn't surface that failure or fallback to anonymous.

Yes, https://github.com/kubernetes/kubernetes/blob/master/pkg/credentialprovider/gcp/metadata.go#L239 blocks (as it loop forever with backoff), and thus the rest of the code never get executed (and that means, for the controller, it is never ready to reconcile anything 😅 ).

@imjasonh
Copy link
Member

Ah okay, so just importing the magic import causes the controller to block forever, when installed on OpenShift-on-GCP. Is that correct? These magic imports seem like more trouble than they're worth to be honest. 👿

Did this work until recently? AFAIK we've had an indirect dependency on the magic imports for quite a while.

@vdemeester
Copy link
Member Author

Ah okay, so just importing the magic import causes the controller to block forever, when installed on OpenShift-on-GCP. Is that correct? These magic imports seem like more trouble than they're worth to be honest. imp

Did this work until recently? AFAIK we've had an indirect dependency on the magic imports for quite a while.

We only tried that recently on GCP so… I am guessing it never worked before. It is the same for Knative by the way. Yeah I am really not a huge fan of magic import and the use of init() in those credentials package. Having, at least, a way to disable those is a "best-effort" fix for now I think 👼

@vdemeester
Copy link
Member Author

vdemeester commented Dec 12, 2019

I can confirm that with master...vdemeester:k8schain-quick-fix and GOFLAGS="-tags=disable_gcp" … it works 👼

~/s/g/t/p/e/taskruns k8schain-quick-fix *2 λ kubectl create -f git-resource.yaml
taskrun.tekton.dev/git-resource-tag-dg5kt created
taskrun.tekton.dev/git-resource-branch-8hvqq created
ktaskrun.tekton.dev/git-resource-ref-vxscg created                                                                                                     
~/s/g/t/p/e/taskruns k8schain-quick-fix *2 λ kubectl get pods
NAME                                  READY   STATUS     RESTARTS   AGE
git-resource-branch-8hvqq-pod-hq2h9   0/2     Init:0/3   0          3s
git-resource-ref-vxscg-pod-flmqj      0/2     Init:0/3   0          3s
git-resource-tag-dg5kt-pod-zcl9j      0/2     Init:0/3   0          3s

@bbrowning
Copy link

@imjasonh This is not technically OpenShift specific either. We've had reports in Knative of other managed Kubernetes services on GCP hitting this same issue. Basically, anyone that can hit that metadata URL can gain credentials that a random user on a K8s cluster shouldn't necessarily be able to get. That's why OpenShift and other managed K8s distros block that metadata URL from pods in the cluster unless the pods are running with host networking.

@vdemeester
Copy link
Member Author

upstream issue : kubernetes/kubernetes#86245

@vdemeester
Copy link
Member Author

This can be consider complete as #1882 has been merged, so a build tag to set and we are good to go.
@bobcatfish should we close ?

@vdemeester
Copy link
Member Author

As we are tracking that downstream and the required bump of go-containerregistry is in, I'll go ahead and close this one.

/close

@tekton-robot
Copy link
Collaborator

@vdemeester: Closing this issue.

In response to this:

As we are tracking that downstream and the required bump of go-containerregistry is in, I'll go ahead and close this one.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants