e2e TestGitPipelineRun test fails on newly installed pipeline #3627

barthy1 · 2020-12-10T16:39:26Z

Expected Behavior

TestGitPipelineRun test should pass

Actual Behavior

TestGitPipelineRun test fails because of timeout. Corresponding taskrun is created but no pod is started.
Usually second attempt is successful.

# oc get tr -A
NAMESPACE         NAME                                     SUCCEEDED   REASON   STARTTIME   COMPLETIONTIME
arendelle-8nxtw   git-check-pipeline-run-git-check-72mth

# oc get pod -n arendelle-8nxtw
No resources found in arendelle-8nxtw namespace.

Steps to Reproduce the Problem

Install k8s cluster using kind
Install nightly pipeline build

kubectl apply --filename https://storage.googleapis.com/tekton-releases-nightly/pipeline/latest/release.yaml

Run the test just after pipeline is installed

# go test -v -tags=e2e -count=1 ./test -run ^TestGitPipelineRun/tekton_@_master_with_PR_refspec

Wait 10mins and get the error

# go test -v -tags=e2e -count=1 ./test -run ^TestGitPipelineRun/tekton_@_master_with_PR_refspec
Using kubeconfig at `/root/.kube/config` with cluster ``
Cluster information
- Version: v1.19.1
- BuildDate: 2020-09-14T07:30:52Z
- GoVersion: go1.15
- Platform: linux/amd64
=== RUN   TestGitPipelineRun
=== RUN   TestGitPipelineRun/tekton_@_master_with_PR_refspec
=== PAUSE TestGitPipelineRun/tekton_@_master_with_PR_refspec
=== CONT  TestGitPipelineRun/tekton_@_master_with_PR_refspec
    init_test.go:147: Create namespace arendelle-8nxtw to deploy to
    init_test.go:163: Verify SA "default" is created in namespace "arendelle-8nxtw"
    git_checkout_test.go:103: Creating Git PipelineResource git-source-resource
    git_checkout_test.go:119: Creating PipelineRun git-check-pipeline-run
panic: test timed out after 10m0s
...
FAIL	github.com/tektoncd/pipeline/test	600.088s
FAIL

Additional Info

Kubernetes version:

# kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.0", GitCommit:"af46c47ce925f4c4ad5cc8d1fca46c7b77d13b38", GitTreeState:"clean", BuildDate:"2020-12-08T17:59:43Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.1", GitCommit:"206bcadf021e76c27513500ca24182692aabd17e", GitTreeState:"clean", BuildDate:"2020-09-14T07:30:52Z", GoVersion:"go1.15", Compiler:"gc", Platform:"linux/amd64"}

Tekton Pipeline version:

# tkn version
Client version: 0.11.0
Pipeline version: v20201210-d58a1ae68f
Triggers version: unknown

/kind bug

The text was updated successfully, but these errors were encountered:

nikhil-thomas · 2020-12-14T08:51:49Z

does kubectl describe taskrun git-check-pipeline-run-git-check-72mth -n arendelle-8nxtw give any additional information.

chmouel · 2020-12-14T09:41:37Z

You probably would want to set the TEST_KEEP_NAMESPACES environment variable to be able to inspect the logs of the failed container :

pipeline/test/init_test.go

Lines 111 to 116 in d702ca1

    
           if os.Getenv("TEST_KEEP_NAMESPACES") == "" && !t.Failed() { 
        
           	t.Logf("Deleting namespace %s", namespace) 
        
           	if err := cs.KubeClient.Kube.CoreV1().Namespaces().Delete(ctx, namespace, metav1.DeleteOptions{}); err != nil { 
        
           		t.Errorf("Failed to delete namespace %s: %s", namespace, err) 
        
           	} 
        
           }

barthy1 · 2020-12-14T09:58:07Z

@nikhil-thomas don't have the taskrun mentioned in this issue any more. Providing describe for new taskrun. From my point of view no additional info there.

# oc describe taskrun git-check-pipeline-run-git-check-7m7kw -n arendelle-lf2x8
Name:         git-check-pipeline-run-git-check-7m7kw
Namespace:    arendelle-lf2x8
Labels:       app.kubernetes.io/managed-by=tekton-pipelines
              tekton.dev/pipeline=git-check-pipeline-run
              tekton.dev/pipelineRun=git-check-pipeline-run
              tekton.dev/pipelineTask=git-check
Annotations:  <none>
API Version:  tekton.dev/v1beta1
Kind:         TaskRun
Metadata:
  Creation Timestamp:  2020-12-14T09:40:42Z
  Generation:          1
  Managed Fields:
    API Version:  tekton.dev/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .:
          f:tekton.dev/pipeline:
          f:tekton.dev/pipelineRun:
          f:tekton.dev/pipelineTask:
        f:ownerReferences:
          .:
          k:{"uid":"5a46cb1b-7c3e-4c58-9c0a-68fe9bd4942f"}:
            .:
            f:apiVersion:
            f:blockOwnerDeletion:
            f:controller:
            f:kind:
            f:name:
            f:uid:
      f:spec:
        .:
        f:resources:
          .:
          f:inputs:
        f:serviceAccountName:
        f:taskSpec:
          .:
          f:resources:
            .:
            f:inputs:
          f:steps:
        f:timeout:
      f:status:
        .:
        f:podName:
    Manager:    controller
    Operation:  Update
    Time:       2020-12-14T09:40:42Z
  Owner References:
    API Version:           tekton.dev/v1beta1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  PipelineRun
    Name:                  git-check-pipeline-run
    UID:                   5a46cb1b-7c3e-4c58-9c0a-68fe9bd4942f
  Resource Version:        2042242
  Self Link:               /apis/tekton.dev/v1beta1/namespaces/arendelle-lf2x8/taskruns/git-check-pipeline-run-git-check-7m7kw
  UID:                     99d06efb-23d4-4730-8958-0c7692d81135
Spec:
  Resources:
    Inputs:
      Name:  gitsource
      Resource Ref:
        Name:            git-source-resource
  Service Account Name:  default
  Task Spec:
    Resources:
      Inputs:
        Name:  gitsource
        Type:  git
    Steps:
      Args:
        --git-dir=/workspace/gitsource/.git
        show
      Image:  alpine/git
      Name:
      Resources:
  Timeout:  1h0m0s
Events:
  Type    Reason   Age   From     Message
  ----    ------   ----  ----     -------
  Normal  Started  12m   TaskRun

@chmouel no pods are actually created, so cannot inspect the logs.

barthy1 · 2020-12-14T10:07:43Z

Update from my side, in 18mins after starting the test, taskrun was finished successfully and it took just several seconds. But obviously the test failed because of 10mins waiting timeout. So I observe some kind of delay for taskrun execution, when I start it first time.

afrittoli · 2021-02-17T10:31:02Z

Update from my side, in 18mins after starting the test, taskrun was finished successfully and it took just several seconds. But obviously the test failed because of 10mins waiting timeout. So I observe some kind of delay for taskrun execution, when I start it first time.

Do you wait for all pods to be ready in the tekton-pipelines namespace before starting to run the tests?

afrittoli · 2021-02-17T10:33:22Z

Update from my side, in 18mins after starting the test, taskrun was finished successfully and it took just several seconds. But obviously the test failed because of 10mins waiting timeout. So I observe some kind of delay for taskrun execution, when I start it first time.

Do you wait for all pods to be ready in the tekton-pipelines namespace before starting to run the tests?

Looking at the task that installs Tekton it looks like you do, I wonder if there is something else causing the delay... it would be useful if you could share the controller logs?

barthy1 · 2021-02-17T11:38:59Z

Do you wait for all pods to be ready in the tekton-pipelines namespace before starting to run the tests?

yes

Looking at the task that installs Tekton it looks like you do, I wonder if there is something else causing the delay... it would be useful if you could share the controller logs?

sure, I will provide the logs when do clean setup

barthy1 · 2021-02-17T18:34:14Z

the logs from controller from start till test started to execute
https://gist.github.com/barthy1/58e7fe5bc5dd67335c0c24a93894ae83#file-tekton_controller_logs

tekton-robot · 2021-05-18T19:10:13Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot · 2021-06-17T19:21:13Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

ghost · 2021-06-29T16:51:13Z

@barthy1 does this remain an issue in more recent e2e test runs?

barthy1 · 2021-07-02T08:15:22Z

@sbwsg

does this remain an issue in more recent e2e test runs?

Yes, I see the same behaviour with the latest main code

ghost · 2021-07-02T13:34:51Z

There's a very sudden jump in the logs from 17:20 to 17:37 where nothing appears to be happening, and then an error is reported related to aws credentials:

{"level":"info","ts":"2021-02-17T17:20:12.695Z","logger":"tekton-pipelines-controller","caller":"controller/controller.go:530","msg":"Reconcile succeeded","commit":"736e98e","knative.dev/controller":"github.aaakk.us.kg.tektoncd.pipeline.pkg.reconciler.pipelinerun.Reconciler","knative.dev/kind":"tekton.dev.PipelineRun","knative.dev/traceid":"eb0c2a9b-396c-481d-b7e5-1207486aea80","knative.dev/key":"arendelle-2rtch/task-run-pipeline-run-cancel-retries-1-pdnjuwjf","duration":0.000033238}
E0217 17:37:55.174255       1 aws_credentials.go:77] while getting AWS credentials NoCredentialProviders: no valid providers in chain. Deprecated.
	For verbose messaging see aws.Config.CredentialsChainVerboseErrors

That error comes from k8schain (more accurately from the k8s-pkg-credentialprovider package) and so I believe that this process is stuck waiting for the k8schain package to do ... something (maybe timeout initializing aws creds? but where did the 17 minute delay come from?). I'm just not sure what is happening yet. I've been seeing a similar problem locally in Kind working on a separate issue. For me locally the delay is exactly 10 minutes.

Pulling in some more folks here who I mentioned the delay to yesterday on Slack. fyi @vdemeester and @imjasonh .

vdemeester · 2021-07-02T13:43:46Z

This sounds heavily similar to the GCP error we have with OpenShift (where the URL that k8s-pkg-credentialprovider does query a magic url and waits forever for an answer, completely locking the controller).
The fork is exists only to remove the dependency on k8s.io/kubernetes (which brings way too much to the dependency graph) and the usage of it is in google/go-containerregistry.

Something might have changed in the recent version of this package and/or on the service side of AWS 🤔

ghost · 2021-07-02T16:32:01Z

Using the runtime.trace package I'm able to isolate 8x 75s-long calls to http.dialConnFor, that look like they're all spawned from the aws-sdk-go package. So at least in my case I think this accounts for the ten minute delay:

0.000342165	 .         	198	task k8schain (id 1, parent 0) created
0.000345207	 .     3042	198	namespace=default
0.000346755	 .     1548	198	saName=default
0.000348249	 .     1494	198	region k8schain.New started (duration: 10m2.20341841s)
0.005701440	 .  5353191	198	new goroutine 260: github.com/aws/aws-sdk-go/internal/sync/singleflight.(*Group).doCall
0.005805444	 .   104004	260	new goroutine 261: net/http.(*Transport).dialConnFor
75.329407376	75.323601932	260	new goroutine 464: net/http.(*Transport).dialConnFor
150.583420726	75.254013350	260	new goroutine 785: net/http.(*Transport).dialConnFor
225.953546323	75.370125597	260	new goroutine 991: net/http.(*Transport).dialConnFor
301.129792264	75.176245941	260	new goroutine 1258: net/http.(*Transport).dialConnFor
376.398534187	75.268741923	260	new goroutine 1497: net/http.(*Transport).dialConnFor
451.715118090	75.316583903	260	new goroutine 1809: net/http.(*Transport).dialConnFor
527.040350774	75.325232684	260	new goroutine 2069: net/http.(*Transport).dialConnFor
602.203773864	75.163423090	198	total-k8schain-time=10m2.203418063s
602.203775323	 .     1459	198	task end
GC:22.952305ms

ghost · 2021-07-02T18:17:03Z

Searching around for similar issues it does sound like the aws sdk introduced a security feature that has caused lots of difficult-to-debug delays for other projects.

Here's one example with a good explanation of what's going on: zalando-incubator/kube-ingress-aws-controller#309 (comment)

Edit: The problem I'm facing now is that all of this AWS stuff is happening way down deep in our dependency tree. Not sure yet how to mitigate or work around.

ghost · 2021-07-02T18:31:53Z

One workaround that appears to speed things up is to set nonsense AWS credentials in environment variables. I've updated the pipelines controller deployment with the following change:

λ git diff
diff --git a/config/controller.yaml b/config/controller.yaml
index d07084c5d..728776b4c 100644
--- a/config/controller.yaml
+++ b/config/controller.yaml
@@ -89,6 +89,12 @@ spec:
         - name: config-registry-cert
           mountPath: /etc/config-registry-cert
         env:
+        - name: AWS_ACCESS_KEY_ID
+          value: foobarbaz
+        - name: AWS_SECRET_ACCESS_KEY
+          value: foobarbaz
+        - name: AWS_DEFAULT_REGION
+          value: the-event-horizon

Now my first call to k8schain.New() seem to be completing much quicker.

@barthy1 could you try doing something similar with your pipelines deployment and see if that changes the slowness?

barthy1 · 2021-07-03T17:09:42Z

@sbwsg after I added the suggested env variables, I do not observe the slowness mentioned in this issue any more 👍

ghost · 2021-07-07T12:33:19Z

Awesome, thanks for the update. This is very concerning. I wonder how many projects are embedding k8schain or otherwise initializing the aws sdk and therefore slowing everything down just because it's a transitive dependency.

ghost · 2021-07-09T14:44:01Z

/remove-lifecycle rotten
/lifecycle frozen

Several issues have now reared their head which are directly caused by an update to the aws-sdk. The update results in extremely long delays in the execution of tasks after the Pipelines controller is first deployed in a cluster. The aws-sdk is initialized through a transitive dependency that pipelines pulls in via k8schain. Here are the recent issues directly related to this aws-sdk bug: - #3627 (Since December!) - #4084 One quick way to work around this problem is to set fake AWS credentials in the environment of the deployed controller. This apparently causes the aws-sdk to skip whatever process it has introduced that causes massive delays in execution. So this commit does exactly that - set fake aws creds in the deployments env vars. This is an unfortunate hack to try and mitigate the problem until a better solution presents itself.

bobcatfish · 2021-07-15T17:21:49Z

Does #4073 fix this?

barthy1 · 2021-07-16T08:05:10Z

@bobcatfish yes, with the latest pipeline code I do not observe the problems mentioned in this issue.

barthy1 added the kind/bug Categorizes issue or PR as related to a bug. label Dec 10, 2020

tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2021

tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 17, 2021

ghost mentioned this issue Jul 2, 2021

Set fake AWS credentials on controller to workaround aws-sdk bug #4073

Merged

3 tasks

tekton-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jul 9, 2021

This was referenced Jul 9, 2021

TaskRun hangs when command field is not set #4084

Closed

Long delays on first taskrun caused by aws-sdk-go #4087

Closed

barthy1 closed this as completed Jul 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e TestGitPipelineRun test fails on newly installed pipeline #3627

e2e TestGitPipelineRun test fails on newly installed pipeline #3627

barthy1 commented Dec 10, 2020

nikhil-thomas commented Dec 14, 2020

chmouel commented Dec 14, 2020 •

edited

Loading

barthy1 commented Dec 14, 2020

barthy1 commented Dec 14, 2020

afrittoli commented Feb 17, 2021

afrittoli commented Feb 17, 2021

barthy1 commented Feb 17, 2021

barthy1 commented Feb 17, 2021

tekton-robot commented May 18, 2021

tekton-robot commented Jun 17, 2021

ghost commented Jun 29, 2021

barthy1 commented Jul 2, 2021

ghost commented Jul 2, 2021 •

edited by ghost

Loading

vdemeester commented Jul 2, 2021

ghost commented Jul 2, 2021

ghost commented Jul 2, 2021 •

edited by ghost

Loading

ghost commented Jul 2, 2021 •

edited by ghost

Loading

barthy1 commented Jul 3, 2021

ghost commented Jul 7, 2021

ghost commented Jul 9, 2021

bobcatfish commented Jul 15, 2021

barthy1 commented Jul 16, 2021

e2e TestGitPipelineRun test fails on newly installed pipeline #3627

e2e TestGitPipelineRun test fails on newly installed pipeline #3627

Comments

barthy1 commented Dec 10, 2020

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

nikhil-thomas commented Dec 14, 2020

chmouel commented Dec 14, 2020 • edited Loading

barthy1 commented Dec 14, 2020

barthy1 commented Dec 14, 2020

afrittoli commented Feb 17, 2021

afrittoli commented Feb 17, 2021

barthy1 commented Feb 17, 2021

barthy1 commented Feb 17, 2021

tekton-robot commented May 18, 2021

tekton-robot commented Jun 17, 2021

ghost commented Jun 29, 2021

barthy1 commented Jul 2, 2021

ghost commented Jul 2, 2021 • edited by ghost Loading

vdemeester commented Jul 2, 2021

ghost commented Jul 2, 2021

ghost commented Jul 2, 2021 • edited by ghost Loading

ghost commented Jul 2, 2021 • edited by ghost Loading

barthy1 commented Jul 3, 2021

ghost commented Jul 7, 2021

ghost commented Jul 9, 2021

bobcatfish commented Jul 15, 2021

barthy1 commented Jul 16, 2021

chmouel commented Dec 14, 2020 •

edited

Loading

ghost commented Jul 2, 2021 •

edited by ghost

Loading

ghost commented Jul 2, 2021 •

edited by ghost

Loading

ghost commented Jul 2, 2021 •

edited by ghost

Loading