ec_deployment: Retry destroy on Timeout Exceeded #308

marclop · 2021-05-05T12:22:07Z

Description

Adds retries when the plan tracking message of a deployment shutdown
contains the Timeout Exceeded message, with a maximum of 3 retries.

Motivation and Context

No more failures on flakey Timeout Exceeded.

How Has This Been Tested?

Acceptance tests.

Types of Changes

Enhancement

Adds retries when the plan tracking message of a deployment shutdown contains the Timeout Exceeded message, with a maximum of 3 retries. Signed-off-by: Marc Lopez <[email protected]>

marclop · 2021-05-06T01:36:02Z

Seems to work as expected, there was a transient test failure on the Terraform binary download, but otherwise the tests passed:

17:23:54  + make testacc-ci
17:23:54  make[1]: Entering directory '/var/lib/jenkins/workspace/elastic+terraform-provider-ec+pull-request'
17:23:54  -> Running terraform acceptance tests...
[..]
18:11:38  cannot run Terraform provider tests: unexpected error: Get "https://checkpoint-api.hashicorp.com/v1/check/terraform?arch=amd64&os=linux&signature=&version=": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
18:11:38  FAIL	github.com/elastic/terraform-provider-ec/ec/acc	2823.271s
18:11:38  FAIL

Tests actually passed after a re-run: https://devops-ci.elastic.co/job/elastic+terraform-provider-ec+pull-request/481/

Signed-off-by: Marc Lopez <[email protected]>

jowiho · 2021-05-10T14:41:11Z

I have a couple of questions:

Is retrying the best strategy when the API returns a timeout? What exactly timed out, and how do you know it won't timeout again.
Should there be a back-off period in between retries? Or can you retry with a higher timeout?
Why is this mechanism specific to destroying deployments? Don't other APIs return timeout too?
Is there a better way than string matching to detect a timeout response? Does the API return an error code? We recently improved API responses for failed plans.

@elastic/cloud-orchestration what is your recommendation here? How should a client deal with a "Timeout Exceeded" response to a destroy deployment request (or a deployment change request in general)?

ean5533 · 2021-05-10T15:03:08Z

To my knowledge, the only (relatively) common cause of destroy plan failures is when Docker is in a bad state and unresponsive to delete container API calls, which is something that generally requires manual intervention to fix (for now!). That is, short-term retries probably wouldn't help until we implement automatic detection and recovery from Docker failures.

Are there other failure scenarios that this change is targeting?

ean5533 · 2021-05-10T15:12:22Z

For deployment change failures in general, retries for timeouts may be helpful. In particular, plans that create new containers (genesis plans, scale up plans, etc) will sometimes fail due to resource contention on busy allocators. This is especially true for automated test suites in which many clusters are being created and destroyed simultaneously, typically thrashing one or two allocators due to our default allocator fill strategy creating hotspots. In these cases, trying again a couple minutes later may well succeed.

jhalterman · 2021-05-10T15:37:14Z

Usually these occur when an entire plan times out while waiting for an instance to start or stop, which @ean5533 is right, could be due to some underlying infra problem that a retry won't necessarily resolve, or could be due to an allocator being CPU bound and just slow. The only way to know the difference is by the allocator reporting progress and/or errors, which we've been working on recently. My sense is we might want retries, but we don't yet have granular enough info to know when that is the case.

marclop · 2021-05-11T04:49:01Z

Hi everyone, quite a lot of interest on this overnight, ideally, we wouldn't have to take this route and would have the product itself retry this to avoid the transient "Timeout Exceeded" plan failure. However, the destroy "Timeout Exceeded" failures are common enough where it makes sense to introduce a retry at this level. Specially, since the retry actually solves the transient failure, and Terraform supports that primitive natively.

As I mention on the PR description, there's a ceiling of maximum 3 retries, and not explicitly written here, but because of the nature of how we track plans and detect when a plan has finished, there's always a lag of at least 8s (initial API call + 2 seconds * 4 retries) between retries, so we're not flooding the API.

Answering the comments below:

@jowiho's

Is retrying the best strategy when the API returns a timeout? What exactly timed out, and how do you know it won't timeout again.

This retry is specific to the destroy operation (analogous to the deployment shutdown API), and will perform a best effort of up to 3 retries. If they fail, the error is returned as it normally would.

Should there be a back-off period in between retries? Or can you retry with a higher timeout?

There's no incremental backoff mechanism built in, but as I explained in the paragraph above, we're waiting at least 8s between retries (might be more because of network I/O).

Why is this mechanism specific to destroying deployments? Don't other APIs return timeout too?

They do not, this seems to be something specific to stopping the containers, where the allocator doesn't seem to stop the containers in time. I tried doubling the minimum timeout for any API operation which reading the codebase, is only hit on shutdown (due to the capacity being 0) to 4096, but it didn't have much of an effect in reducing these errors.

Is there a better way than string matching to detect a timeout response? Does the API return an error code? We recently improved API responses for failed plans.

Not really, the reason being is that we're calling the shutdown API which is async, and that returns a 200, then, to track the progress of the pending plan, GET /deployment/<id> is called and that also, returns a 200 and we read the plan log and process that accordingly.
I can go into details if you wish, but since the deployment has N applications, and each can individually fail, a multierror is returned which wraps each of the errors, the only mechanism we have right now without spending considerable time refactoring quite a lot of code, is string matching that substring we want to see in the "failure".

@ean5533

To my knowledge, the only (relatively) common cause of destroy plan failures is when Docker is in a bad state and unresponsive to delete container API calls, which is something that generally requires manual intervention to fix (for now!). That is, short-term retries probably wouldn't help until we implement automatic detection and recovery from Docker failures.

I believe that might be the cause, interestingly enough, the containers do succeed to be created first in that allocator where Docker is "bad state" but only fail when they're stopped, even though the deployment lifespan was probably 5-15m.

By all means I'm happy to provide any assistance required to help eliminating this problem completely. On the Jenkins acceptance test log there should be plenty of deployment IDs that can be used to investigate the timeline and logs of these failures.

Are there other failure scenarios that this change is targeting?

This change only targets the deployment shutdown Timeout Exceeded failures, no other failure scenarios are tackled with this change.

ean5533 · 2021-05-11T14:11:42Z

I believe that might be the cause, interestingly enough, the containers do succeed to be created first in that allocator where Docker is "bad state" but only fail when they're stopped, even though the deployment lifespan was probably 5-15m.

That is really interesting, and not something I was aware existed out in the wild. Would definitely be worth one of us Orch folks peeking into at some point.

Aran-K

Go code LGTM

re Terraform binary download failure:
We're pretty strict on this with the main Cloud stuff, we can be a bit more lax here, but in general (I'm sure you know this) downloading things during tests is something to avoid (flakyness, adds time etc). The way to fix this is to bake the requirements into the images, so there's nothing to download; there is prior art for this, Infra has all the tooling in place, given this repo is only going to see more development as time goes on it might be worth it.
Alternatively we should at least add retrys around the download part of the test (just as like a step in the pipeline or whatever is easiest).

Not suggesting we need to do the above to merge this, but if this is common an issue should at least be created to follow up.

marclop · 2021-05-12T11:24:43Z

I just re-ran the tests before merging and they passed, again. So I'm merging this for good.

ec_deployment: Retry destroy on Timeout Exceeded

de8c243

Adds retries when the plan tracking message of a deployment shutdown contains the Timeout Exceeded message, with a maximum of 3 retries. Signed-off-by: Marc Lopez <[email protected]>

marclop self-assigned this May 5, 2021

marclop added 2 commits May 6, 2021 07:00

Refactor for readability

f5e2da2

Signed-off-by: Marc Lopez <[email protected]>

Add changelog entry

b3b172a

Signed-off-by: Marc Lopez <[email protected]>

marclop added acceptance Improve or add a test for a resource enhancement New feature or request Team:Delivery labels May 6, 2021

marclop added this to the v0.2.0 milestone May 6, 2021

marclop marked this pull request as ready for review May 6, 2021 03:53

marclop requested a review from a team as a code owner May 6, 2021 03:53

Aran-K approved these changes May 12, 2021

View reviewed changes

marclop merged commit 28607b8 into elastic:master May 12, 2021

marclop deleted the f/add-destroy-retry-on-timeout-exceeded branch May 12, 2021 11:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ec_deployment: Retry destroy on Timeout Exceeded #308

ec_deployment: Retry destroy on Timeout Exceeded #308

marclop commented May 5, 2021 •

edited

Loading

marclop commented May 6, 2021 •

edited

Loading

jowiho commented May 10, 2021

ean5533 commented May 10, 2021

ean5533 commented May 10, 2021 •

edited

Loading

jhalterman commented May 10, 2021 •

edited

Loading

marclop commented May 11, 2021 •

edited

Loading

ean5533 commented May 11, 2021

Aran-K left a comment

marclop commented May 12, 2021

ec_deployment: Retry destroy on Timeout Exceeded #308

ec_deployment: Retry destroy on Timeout Exceeded #308

Conversation

marclop commented May 5, 2021 • edited Loading

Description

Motivation and Context

How Has This Been Tested?

Types of Changes

marclop commented May 6, 2021 • edited Loading

jowiho commented May 10, 2021

ean5533 commented May 10, 2021

ean5533 commented May 10, 2021 • edited Loading

jhalterman commented May 10, 2021 • edited Loading

marclop commented May 11, 2021 • edited Loading

@jowiho's

@ean5533

ean5533 commented May 11, 2021

Aran-K left a comment

Choose a reason for hiding this comment

marclop commented May 12, 2021

marclop commented May 5, 2021 •

edited

Loading

marclop commented May 6, 2021 •

edited

Loading

ean5533 commented May 10, 2021 •

edited

Loading

jhalterman commented May 10, 2021 •

edited

Loading

marclop commented May 11, 2021 •

edited

Loading