Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store and verify s3 remote state checksum to avoid consistency issues. #14746

Merged
merged 2 commits into from
May 24, 2017

Conversation

jbardin
Copy link
Member

@jbardin jbardin commented May 22, 2017

Updates to objects in S3 are only eventually consistent. This was an accepted limitation of storing state in S3, but due to the infrequency which successive terraform runs are usually executed, it rarely posed a problem.

Now that we natively support state locking, we inadvertently have provided a way for users who aren't taking care to avoid consistency issue to encounter them more frequently. When locks are available, terraform can be executing concurrently without fear of concurrent modification, but can also be executed temporally close enough to easily see S3 eventually consistency in action, which is especially true when blocking on a lock using -lock-timeout

The PR will allow the RemoteClient to use a DynamoDB table when available to store a checksum of the last written state, so the object can be verified by the next client to call Get. If a Get call fails to match the recorded checksum, the Client will sleep and try again up until a timeout, currently set to 10 seconds. Terraform currently doesn't have any sort of user feedback around RefreshState/Get, so we poll only for a short time before returning an error.

This has the drawback of requiring all access to the state to have a lock_table configured, or not, but that seems like a good tradeoff for better state management. We should follow up with some documentation about the feature, and possibly rename lock_table to dynamo_table in the config to remove the idea that it's solely tied to state locks.

fixes #14639

@jbardin
Copy link
Member Author

jbardin commented May 22, 2017

=== RUN   TestBackend_impl
--- PASS: TestBackend_impl (0.00s)
=== RUN   TestBackendConfig
--- PASS: TestBackendConfig (1.53s)
=== RUN   TestBackend
--- PASS: TestBackend (5.49s)
	backend_test.go:248: creating S3 bucket terraform-remote-s3-test-592363cf in us-east-1
=== RUN   TestBackendLocked
--- PASS: TestBackendLocked (17.82s)
	backend_test.go:248: creating S3 bucket terraform-remote-s3-test-592363d4 in us-east-1
	testing.go:222: TestBackend: testing state locking for *s3.Backend
=== RUN   TestBackendExtraPaths
--- PASS: TestBackendExtraPaths (8.80s)
	backend_test.go:248: creating S3 bucket terraform-remote-s3-test-592363e6 in us-east-1
=== RUN   TestRemoteClient_impl
--- PASS: TestRemoteClient_impl (0.00s)
=== RUN   TestRemoteClient
--- PASS: TestRemoteClient (2.67s)
	backend_test.go:248: creating S3 bucket terraform-remote-s3-test-592363ef in us-east-1
=== RUN   TestRemoteClientLocks
--- PASS: TestRemoteClientLocks (12.90s)
	backend_test.go:248: creating S3 bucket terraform-remote-s3-test-592363f1 in us-east-1
=== RUN   TestRemoteClient_clientMD5
--- PASS: TestRemoteClient_clientMD5 (9.09s)
=== RUN   TestRemoteClient_stateChecksum
--- PASS: TestRemoteClient_stateChecksum (17.93s)
	backend_test.go:248: creating S3 bucket terraform-remote-s3-test-59236407 in us-east-1
PASS

@jbardin jbardin force-pushed the jbardin/s3-consistency branch 2 times, most recently from 11b9f0e to 7543800 Compare May 23, 2017 00:59
Copy link
Contributor

@apparentlymart apparentlymart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally this seems good to me!

My inline comments are nits and UX things.

if err := c.putMD5(sum[:]); err != nil {
// if this errors out, we unfortunately have to error out altogether,
// since the next Get will inevitably fail.
return fmt.Errorf("[WARNING] failed to store state MD5: %s", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems weird for an error message to start with [WARNING]...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, copied a log message into an error, will fix!

consistencyRetryPollInterval = 2 * time.Second

// checksum didn't match the remote state
errBadChecksum = errors.New("invalid state checksum")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think there's something better we could say here? Based on what you'd said to me before, it seemed like it's expected for S3 to occasionally take longer than 10 seconds to converge which would suggest this is an error message users would actually see in normal use, so ideally this message would include both an understandable statement of the problem and a suggested next step.

Perhaps:

errBadChecksumFmt = `state data in S3 does not have the expected content.

This may be caused by unusually-long delays in S3 processing a previous state update.
Please wait for a minute or two and try again. If this problem persists, and neither S3 nor
DynamoDB are experiencing an outage, the checksum stored in the DynamoDB table
may need to be manually updated to the following value:
    %x
`

(Where that %x marker would be passed expected as its value below.)

This approach has the disadvantage of not using a consistent error instance for all cases and thus making it harder for the caller to recognize the error, so if that's a concern then of course we could equally do a custom error type that has the expected state checksum as a field. That would then allow us to keep the error message itself simple and produce the more verbose error message at a higher layer, which may be preferable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking this was too far removed from the ui to provide good feedback, but It looks like this will end up being put directly into a ui.Error.

I'm only checking the error value in one place right now too, so I think the UX trumps the clean value comparison in this case. I'll add your error message string.

}

if getSum, err := client.getMD5(); err == nil {
t.Fatalf("expecetd getMD5 error, got none. checksum: %x", getSum)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expecetd

}

if !bytes.Equal(getSum, sum[:]) {
t.Fatalf("getMd5 returned the wrong checksum: expected %x, got %x", sum[:], getSum)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getMd5 should be getMD5 here, I think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, I must have been tired this day.

jbardin added 2 commits May 24, 2017 13:39
Updates to objects in S3 are only eventually consistent. If the
RemoteClient has a DynamoDB table available, use that to store a
checksum of the last written state, so the object can be verified by the
next client to call Get.

Terraform currently doesn't have any sort of user feedback around
RefreshState/Get, so we poll only for a short time before returning an
error.
Have the s3 RemoteClient return a detailed error message to the user in
the case of a mismatch state checksum.
@jbardin jbardin force-pushed the jbardin/s3-consistency branch from 7543800 to 91be40a Compare May 24, 2017 17:54
@jbardin
Copy link
Member Author

jbardin commented May 24, 2017

Fixed the typos, and added a detailed error message.

@jbardin jbardin merged commit ef1d539 into master May 24, 2017
@jbardin jbardin deleted the jbardin/s3-consistency branch May 24, 2017 20:48
@ghost
Copy link

ghost commented Apr 12, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Apr 12, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Work around S3 backend eventual consistency
2 participants