Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ec_deployment: Retry destroy on Timeout Exceeded #308

Merged

Conversation

marclop
Copy link
Contributor

@marclop marclop commented May 5, 2021

Description

Adds retries when the plan tracking message of a deployment shutdown
contains the Timeout Exceeded message, with a maximum of 3 retries.

Motivation and Context

No more failures on flakey Timeout Exceeded.

How Has This Been Tested?

Acceptance tests.

Types of Changes

  • Enhancement

Adds retries when the plan tracking message of a deployment shutdown
contains the Timeout Exceeded message, with a maximum of 3 retries.

Signed-off-by: Marc Lopez <[email protected]>
@marclop marclop self-assigned this May 5, 2021
@marclop
Copy link
Contributor Author

marclop commented May 6, 2021

Seems to work as expected, there was a transient test failure on the Terraform binary download, but otherwise the tests passed:

17:23:54  + make testacc-ci
17:23:54  make[1]: Entering directory '/var/lib/jenkins/workspace/elastic+terraform-provider-ec+pull-request'
17:23:54  -> Running terraform acceptance tests...
[..]
18:11:38  cannot run Terraform provider tests: unexpected error: Get "https://checkpoint-api.hashicorp.com/v1/check/terraform?arch=amd64&os=linux&signature=&version=": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
18:11:38  FAIL	github.com/elastic/terraform-provider-ec/ec/acc	2823.271s
18:11:38  FAIL

Tests actually passed after a re-run: https://devops-ci.elastic.co/job/elastic+terraform-provider-ec+pull-request/481/

marclop added 2 commits May 6, 2021 07:00
Signed-off-by: Marc Lopez <[email protected]>
Signed-off-by: Marc Lopez <[email protected]>
@marclop marclop added acceptance Improve or add a test for a resource enhancement New feature or request Team:Delivery labels May 6, 2021
@marclop marclop added this to the v0.2.0 milestone May 6, 2021
@marclop marclop marked this pull request as ready for review May 6, 2021 03:53
@marclop marclop requested a review from a team as a code owner May 6, 2021 03:53
@jowiho
Copy link

jowiho commented May 10, 2021

I have a couple of questions:

  • Is retrying the best strategy when the API returns a timeout? What exactly timed out, and how do you know it won't timeout again.
  • Should there be a back-off period in between retries? Or can you retry with a higher timeout?
  • Why is this mechanism specific to destroying deployments? Don't other APIs return timeout too?
  • Is there a better way than string matching to detect a timeout response? Does the API return an error code? We recently improved API responses for failed plans.

@elastic/cloud-orchestration what is your recommendation here? How should a client deal with a "Timeout Exceeded" response to a destroy deployment request (or a deployment change request in general)?

@ean5533
Copy link

ean5533 commented May 10, 2021

To my knowledge, the only (relatively) common cause of destroy plan failures is when Docker is in a bad state and unresponsive to delete container API calls, which is something that generally requires manual intervention to fix (for now!). That is, short-term retries probably wouldn't help until we implement automatic detection and recovery from Docker failures.

Are there other failure scenarios that this change is targeting?

@ean5533
Copy link

ean5533 commented May 10, 2021

For deployment change failures in general, retries for timeouts may be helpful. In particular, plans that create new containers (genesis plans, scale up plans, etc) will sometimes fail due to resource contention on busy allocators. This is especially true for automated test suites in which many clusters are being created and destroyed simultaneously, typically thrashing one or two allocators due to our default allocator fill strategy creating hotspots. In these cases, trying again a couple minutes later may well succeed.

@jhalterman
Copy link

jhalterman commented May 10, 2021

Usually these occur when an entire plan times out while waiting for an instance to start or stop, which @ean5533 is right, could be due to some underlying infra problem that a retry won't necessarily resolve, or could be due to an allocator being CPU bound and just slow. The only way to know the difference is by the allocator reporting progress and/or errors, which we've been working on recently. My sense is we might want retries, but we don't yet have granular enough info to know when that is the case.

@marclop
Copy link
Contributor Author

marclop commented May 11, 2021

Hi everyone, quite a lot of interest on this overnight, ideally, we wouldn't have to take this route and would have the product itself retry this to avoid the transient "Timeout Exceeded" plan failure. However, the destroy "Timeout Exceeded" failures are common enough where it makes sense to introduce a retry at this level. Specially, since the retry actually solves the transient failure, and Terraform supports that primitive natively.

As I mention on the PR description, there's a ceiling of maximum 3 retries, and not explicitly written here, but because of the nature of how we track plans and detect when a plan has finished, there's always a lag of at least 8s (initial API call + 2 seconds * 4 retries) between retries, so we're not flooding the API.

Answering the comments below:

@jowiho's

  • Is retrying the best strategy when the API returns a timeout? What exactly timed out, and how do you know it won't timeout again.

This retry is specific to the destroy operation (analogous to the deployment shutdown API), and will perform a best effort of up to 3 retries. If they fail, the error is returned as it normally would.

  • Should there be a back-off period in between retries? Or can you retry with a higher timeout?

There's no incremental backoff mechanism built in, but as I explained in the paragraph above, we're waiting at least 8s between retries (might be more because of network I/O).

  • Why is this mechanism specific to destroying deployments? Don't other APIs return timeout too?

They do not, this seems to be something specific to stopping the containers, where the allocator doesn't seem to stop the containers in time. I tried doubling the minimum timeout for any API operation which reading the codebase, is only hit on shutdown (due to the capacity being 0) to 4096, but it didn't have much of an effect in reducing these errors.

  • Is there a better way than string matching to detect a timeout response? Does the API return an error code? We recently improved API responses for failed plans.

Not really, the reason being is that we're calling the shutdown API which is async, and that returns a 200, then, to track the progress of the pending plan, GET /deployment/<id> is called and that also, returns a 200 and we read the plan log and process that accordingly.
I can go into details if you wish, but since the deployment has N applications, and each can individually fail, a multierror is returned which wraps each of the errors, the only mechanism we have right now without spending considerable time refactoring quite a lot of code, is string matching that substring we want to see in the "failure".

@ean5533

To my knowledge, the only (relatively) common cause of destroy plan failures is when Docker is in a bad state and unresponsive to delete container API calls, which is something that generally requires manual intervention to fix (for now!). That is, short-term retries probably wouldn't help until we implement automatic detection and recovery from Docker failures.

I believe that might be the cause, interestingly enough, the containers do succeed to be created first in that allocator where Docker is "bad state" but only fail when they're stopped, even though the deployment lifespan was probably 5-15m.

By all means I'm happy to provide any assistance required to help eliminating this problem completely. On the Jenkins acceptance test log there should be plenty of deployment IDs that can be used to investigate the timeline and logs of these failures.

Are there other failure scenarios that this change is targeting?

This change only targets the deployment shutdown Timeout Exceeded failures, no other failure scenarios are tackled with this change.

@ean5533
Copy link

ean5533 commented May 11, 2021

I believe that might be the cause, interestingly enough, the containers do succeed to be created first in that allocator where Docker is "bad state" but only fail when they're stopped, even though the deployment lifespan was probably 5-15m.

That is really interesting, and not something I was aware existed out in the wild. Would definitely be worth one of us Orch folks peeking into at some point.

Copy link

@Aran-K Aran-K left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go code LGTM

re Terraform binary download failure:
We're pretty strict on this with the main Cloud stuff, we can be a bit more lax here, but in general (I'm sure you know this) downloading things during tests is something to avoid (flakyness, adds time etc). The way to fix this is to bake the requirements into the images, so there's nothing to download; there is prior art for this, Infra has all the tooling in place, given this repo is only going to see more development as time goes on it might be worth it.
Alternatively we should at least add retrys around the download part of the test (just as like a step in the pipeline or whatever is easiest).

Not suggesting we need to do the above to merge this, but if this is common an issue should at least be created to follow up.

@marclop
Copy link
Contributor Author

marclop commented May 12, 2021

I just re-ran the tests before merging and they passed, again. So I'm merging this for good.

@marclop marclop merged commit 28607b8 into elastic:master May 12, 2021
@marclop marclop deleted the f/add-destroy-retry-on-timeout-exceeded branch May 12, 2021 11:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acceptance Improve or add a test for a resource enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants