Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intermittent errors pulling state after successful plan/apply #119

Open
cludden opened this issue May 8, 2020 · 5 comments
Open

intermittent errors pulling state after successful plan/apply #119

cludden opened this issue May 8, 2020 · 5 comments

Comments

@cludden
Copy link

cludden commented May 8, 2020

often when running many parallel plans or applies against a single resource but with different workspaces, we encounter this intermittent error that "fails" the step

▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ Terraform Apply ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼

# ..

Apply complete! Resources: 0 added, 2 changed, 0 destroyed.


Outputs:

  # ..

▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ Terraform Apply ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲

Failed To Run Terraform Apply!

2020/05/08 19:15:08 Apply Error: Error running `state pull`: exit status 1, Output: 

@ljfranklin
Copy link
Owner

@cludden I'm not aware of any reasons why running state pull would fail regularly. I added some additional logging around that command here. That change is present in the ljfranklin/terraform-resource:latest and ljfranklin/terraform-resource:0.12.24 images. Try running for a bit with that change to see if you get a more informative error message.

@cludden
Copy link
Author

cludden commented Jul 6, 2020

After much head scratching, I believe this is due to an s3 race condition and happens very intermittently (1/~500 builds). Would you accept a PR that adds some retry logic around this step?

@ljfranklin
Copy link
Owner

@cludden the S3 backend already retries 5 times by default: https://www.terraform.io/docs/backends/types/s3.html#max_retries. Try checking whether the Terraform code treats your error as retryable, e.g. 404 might not be. Or maybe the sleep between retries is too short. In any case, if at all possible I'd rather any retry logic live in Terraform itself so that all Terraform users get the benefit.

@cludden
Copy link
Author

cludden commented Jul 10, 2020

I think the race condition is not in terraform, but instead the resource due to S3 consistency model and by calling state pull so quickly after a successful apply pushes an updated state file. some details from the Amazon S3 data consistency model section of the S3 docs

However, information about the changes must replicate across Amazon S3, which can take some time, and so you might observe the following behaviors:

A process writes a new object to Amazon S3 and immediately lists keys within its bucket. Until the change is fully propagated, the object might not appear in the list.

A process replaces an existing object and immediately tries to read it. Until the change is fully propagated, Amazon S3 might return the previous data.

A process deletes an existing object and immediately tries to read it. Until the deletion is fully propagated, Amazon S3 might return the deleted data.

A process deletes an existing object and immediately lists keys within its bucket. Until the deletion is fully propagated, Amazon S3 might list the deleted object.

here is an updated error with the additional logging you added (thanks again btw)!

▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ Terraform Apply ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼

<redacted>

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.


Outputs:

<redacted>
▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ Terraform Apply ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲

Failed To Run Terraform Apply!

2020/06/18 22:36:21 Apply Error: Error running `state pull`: exit status 1, Output: Failed to refresh state: state data in S3 does not have the expected content.


This may be caused by unusually long delays in S3 processing a previous state

update.  Please wait for a minute or two and try again. If this problem

persists, and neither S3 nor DynamoDB are experiencing an outage, you may need

to manually verify the remote state and update the Digest value stored in the

DynamoDB table to the following value: 575f3c723db817133af25135a8afa327

@ljfranklin
Copy link
Owner

@cludden Terraform is retrying for 10 seconds before returning that error: https://github.com/hashicorp/terraform/blob/10d94fb764dd7762f3e8343fb7d987056fe9c830/backend/remote-state/s3/client.go#L57-L95. Maybe that hardcoded 10 second value should be bumped or made configurable. I'd still suggest opening an issue/PR on Terraform itself. The goal is that users can run terraform state pull and it just works without users needing to roll their own retry wrappers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants