-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1717604: data/aws/vpc/master-elb: Bump load-balancer timeouts to 20m #2279
Conversation
@wking: This pull request references Bugzilla bug 1717604, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/bugzilla refresh |
@wking: This pull request references Bugzilla bug 1717604, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
can you drop a301036 as it doesn't belong to the bz attached. |
we don't have a timeout for infra provisioning. AFAIK |
/test e2e-libvirt |
a301036
to
ad26356
Compare
Up from their default 10 minutes, using the knob that dates back to the original network load balancer support [1]. This should help us avoid the [2]: Error: timeout while waiting for state to become 'active' (last state: 'provisioning', timeout: 10m0s) that cropped up again this week. 20m matches the timeout we set for routes and security groups in 246f4a1 (data/aws: 20-minute create timeouts for routes and security groups, 2019-04-26, openshift#1682). Sometimes even 20m will not be enough [3], but should make us a bit more resilient anyway. [1]: hashicorp/terraform-provider-aws@1af53b1#diff-f4b0dbdc7e3eede6ba70cd286c834f37R92 [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1717604 [3]: https://bugzilla.redhat.com/show_bug.cgi?id=1717604#c15
ad26356
to
1ec2758
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhinavdahiya, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold hmm, it doesn;t look like it is working.. |
it looks like we need to set Or maybe not, was looking at the wrong function https://github.com/terraform-providers/terraform-provider-aws/blob/3ac0b235870a52cb8b5090d5f7d0b2825d9e76fc/aws/resource_aws_lb.go#L300 |
Ah, is this "upgrade uses the tip release image as the source and then upgrades to tip+PR-change as the target"? In which case, do we care about running upgrade tests in the installer repo? $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/2279/pull-ci-openshift-installer-master-e2e-aws-upgrade/2029/artifacts/e2e-aws-upgrade/installer/.openshift_install.log | grep 'Built from commit' | head -n1
time="2019-08-28T16:21:09Z" level=debug msg="Built from commit 6318c72cd078fb26f835e28117d7d02bc50d81a5" 6318c72 is the parent of 1ec2758 and does not include my bump. |
Checking
so this got bit by the /retest |
/hold cancel |
But upgrade CI is unaffected by in-flight installer PRs, so must be a flake. /retest |
manually merging to hopefully unjam our CI system which is suffering from hitting these timeouts in a lot of jobs. PR has passed e2e-aws. |
@wking: All pull requests linked via external trackers have merged. Bugzilla bug 1717604 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@wking: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/cherrypick release-4.1 |
@wking: #2279 failed to apply on top of branch "release-4.1":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
We've been hitting the default 10m timeout recently, starting around [1]. Use the creation timeout knob [2] to wait a bit longer before giving up on AWS. 20m matches the value we've used bumping this timeout for other resources in 1ec2758 (data/aws/vpc/master-elb: Bump load-balancer timeouts to 20m, 2019-08-27, openshift#2279) and previous. There's no guarantee that 20m will be sufficient, the issue could be due to internal AWS issues like a shortage of on-demand instances of the requested type in the requested availability zone. But it gives AWS an even easier target to hit ;). [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_console/2708/pull-ci-openshift-console-master-e2e-aws-console-olm/8584 [2]: https://www.terraform.io/docs/providers/aws/r/instance.html#timeouts
Up from their default 10 minutes, using the knob that dates back to the original network load balancer support. This should help us avoid:
that cropped up again this week, while still generally keeping us under the 30m timeout we set for the whole infrastructure-provisioning step. Sometimes even 20m will not be enough, but should make us a bit more resilient anyway.
While I was comparing the docs with our config, I also noticed that we should have dropped
idle_timeout
(which does not apply to network load balancers) when we made the shift from classic to network load balancers in 16dfbb3 (#594), so I'm doing that too.