-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
e2e TestGitPipelineRun test fails on newly installed pipeline #3627
Comments
does |
You probably would want to set the Lines 111 to 116 in d702ca1
|
@nikhil-thomas don't have the taskrun mentioned in this issue any more. Providing describe for new taskrun. From my point of view no additional info there.
@chmouel no pods are actually created, so cannot inspect the logs. |
Update from my side, in 18mins after starting the test, taskrun was finished successfully and it took just several seconds. But obviously the test failed because of 10mins waiting timeout. So I observe some kind of delay for taskrun execution, when I start it first time. |
Do you wait for all pods to be ready in the |
Looking at the task that installs Tekton it looks like you do, I wonder if there is something else causing the delay... it would be useful if you could share the controller logs? |
yes
sure, I will provide the logs when do clean setup |
the logs from controller from start till test started to execute |
Issues go stale after 90d of inactivity. /lifecycle stale Send feedback to tektoncd/plumbing. |
Stale issues rot after 30d of inactivity. /lifecycle rotten Send feedback to tektoncd/plumbing. |
@barthy1 does this remain an issue in more recent e2e test runs? |
Yes, I see the same behaviour with the latest main code |
There's a very sudden jump in the logs from 17:20 to 17:37 where nothing appears to be happening, and then an error is reported related to aws credentials:
That error comes from Pulling in some more folks here who I mentioned the delay to yesterday on Slack. fyi @vdemeester and @imjasonh . |
This sounds heavily similar to the GCP error we have with OpenShift (where the URL that Something might have changed in the recent version of this package and/or on the service side of AWS 🤔 |
Using the
|
Searching around for similar issues it does sound like the aws sdk introduced a security feature that has caused lots of difficult-to-debug delays for other projects. Here's one example with a good explanation of what's going on: zalando-incubator/kube-ingress-aws-controller#309 (comment) Edit: The problem I'm facing now is that all of this AWS stuff is happening way down deep in our dependency tree. Not sure yet how to mitigate or work around. |
One workaround that appears to speed things up is to set nonsense AWS credentials in environment variables. I've updated the pipelines controller deployment with the following change: λ git diff
diff --git a/config/controller.yaml b/config/controller.yaml
index d07084c5d..728776b4c 100644
--- a/config/controller.yaml
+++ b/config/controller.yaml
@@ -89,6 +89,12 @@ spec:
- name: config-registry-cert
mountPath: /etc/config-registry-cert
env:
+ - name: AWS_ACCESS_KEY_ID
+ value: foobarbaz
+ - name: AWS_SECRET_ACCESS_KEY
+ value: foobarbaz
+ - name: AWS_DEFAULT_REGION
+ value: the-event-horizon Now my first call to @barthy1 could you try doing something similar with your pipelines deployment and see if that changes the slowness? |
@sbwsg after I added the suggested env variables, I do not observe the slowness mentioned in this issue any more 👍 |
Awesome, thanks for the update. This is very concerning. I wonder how many projects are embedding k8schain or otherwise initializing the aws sdk and therefore slowing everything down just because it's a transitive dependency. |
/remove-lifecycle rotten |
Several issues have now reared their head which are directly caused by an update to the aws-sdk. The update results in extremely long delays in the execution of tasks after the Pipelines controller is first deployed in a cluster. The aws-sdk is initialized through a transitive dependency that pipelines pulls in via k8schain. Here are the recent issues directly related to this aws-sdk bug: - #3627 (Since December!) - #4084 One quick way to work around this problem is to set fake AWS credentials in the environment of the deployed controller. This apparently causes the aws-sdk to skip whatever process it has introduced that causes massive delays in execution. So this commit does exactly that - set fake aws creds in the deployments env vars. This is an unfortunate hack to try and mitigate the problem until a better solution presents itself.
Does #4073 fix this? |
@bobcatfish yes, with the latest pipeline code I do not observe the problems mentioned in this issue. |
Expected Behavior
TestGitPipelineRun
test should passActual Behavior
TestGitPipelineRun
test fails because of timeout. Corresponding taskrun is created but no pod is started.Usually second attempt is successful.
Steps to Reproduce the Problem
kind
Additional Info
/kind bug
The text was updated successfully, but these errors were encountered: