-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ec_deployment: Retry destroy on Timeout Exceeded #308
ec_deployment: Retry destroy on Timeout Exceeded #308
Conversation
Adds retries when the plan tracking message of a deployment shutdown contains the Timeout Exceeded message, with a maximum of 3 retries. Signed-off-by: Marc Lopez <[email protected]>
Seems to work as expected, there was a transient test failure on the Terraform binary download, but otherwise the tests passed:
Tests actually passed after a re-run: https://devops-ci.elastic.co/job/elastic+terraform-provider-ec+pull-request/481/ |
Signed-off-by: Marc Lopez <[email protected]>
Signed-off-by: Marc Lopez <[email protected]>
I have a couple of questions:
@elastic/cloud-orchestration what is your recommendation here? How should a client deal with a "Timeout Exceeded" response to a destroy deployment request (or a deployment change request in general)? |
To my knowledge, the only (relatively) common cause of destroy plan failures is when Docker is in a bad state and unresponsive to delete container API calls, which is something that generally requires manual intervention to fix (for now!). That is, short-term retries probably wouldn't help until we implement automatic detection and recovery from Docker failures. Are there other failure scenarios that this change is targeting? |
For deployment change failures in general, retries for timeouts may be helpful. In particular, plans that create new containers (genesis plans, scale up plans, etc) will sometimes fail due to resource contention on busy allocators. This is especially true for automated test suites in which many clusters are being created and destroyed simultaneously, typically thrashing one or two allocators due to our default allocator fill strategy creating hotspots. In these cases, trying again a couple minutes later may well succeed. |
Usually these occur when an entire plan times out while waiting for an instance to start or stop, which @ean5533 is right, could be due to some underlying infra problem that a retry won't necessarily resolve, or could be due to an allocator being CPU bound and just slow. The only way to know the difference is by the allocator reporting progress and/or errors, which we've been working on recently. My sense is we might want retries, but we don't yet have granular enough info to know when that is the case. |
Hi everyone, quite a lot of interest on this overnight, ideally, we wouldn't have to take this route and would have the product itself retry this to avoid the transient "Timeout Exceeded" plan failure. However, the destroy "Timeout Exceeded" failures are common enough where it makes sense to introduce a retry at this level. Specially, since the retry actually solves the transient failure, and Terraform supports that primitive natively. As I mention on the PR description, there's a ceiling of maximum 3 retries, and not explicitly written here, but because of the nature of how we track plans and detect when a plan has finished, there's always a lag of at least 8s (initial API call + 2 seconds * 4 retries) between retries, so we're not flooding the API. Answering the comments below: @jowiho's
This retry is specific to the destroy operation (analogous to the deployment shutdown API), and will perform a best effort of up to 3 retries. If they fail, the error is returned as it normally would.
There's no incremental backoff mechanism built in, but as I explained in the paragraph above, we're waiting at least 8s between retries (might be more because of network I/O).
They do not, this seems to be something specific to stopping the containers, where the allocator doesn't seem to stop the containers in time. I tried doubling the minimum timeout for any API operation which reading the codebase, is only hit on shutdown (due to the capacity being 0) to 4096, but it didn't have much of an effect in reducing these errors.
Not really, the reason being is that we're calling the shutdown API which is async, and that returns a 200, then, to track the progress of the pending plan, @ean5533
I believe that might be the cause, interestingly enough, the containers do succeed to be created first in that allocator where Docker is "bad state" but only fail when they're stopped, even though the deployment lifespan was probably 5-15m. By all means I'm happy to provide any assistance required to help eliminating this problem completely. On the Jenkins acceptance test log there should be plenty of deployment IDs that can be used to investigate the timeline and logs of these failures.
This change only targets the deployment shutdown Timeout Exceeded failures, no other failure scenarios are tackled with this change. |
That is really interesting, and not something I was aware existed out in the wild. Would definitely be worth one of us Orch folks peeking into at some point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go code LGTM
re Terraform binary download failure:
We're pretty strict on this with the main Cloud stuff, we can be a bit more lax here, but in general (I'm sure you know this) downloading things during tests is something to avoid (flakyness, adds time etc). The way to fix this is to bake the requirements into the images, so there's nothing to download; there is prior art for this, Infra has all the tooling in place, given this repo is only going to see more development as time goes on it might be worth it.
Alternatively we should at least add retrys around the download part of the test (just as like a step in the pipeline or whatever is easiest).
Not suggesting we need to do the above to merge this, but if this is common an issue should at least be created to follow up.
I just re-ran the tests before merging and they passed, again. So I'm merging this for good. |
Description
Adds retries when the plan tracking message of a deployment shutdown
contains the Timeout Exceeded message, with a maximum of 3 retries.
Motivation and Context
No more failures on flakey
Timeout Exceeded
.How Has This Been Tested?
Acceptance tests.
Types of Changes