store and verify s3 remote state checksum to avoid consistency issues. #14746

jbardin · 2017-05-22T22:18:54Z

Updates to objects in S3 are only eventually consistent. This was an accepted limitation of storing state in S3, but due to the infrequency which successive terraform runs are usually executed, it rarely posed a problem.

Now that we natively support state locking, we inadvertently have provided a way for users who aren't taking care to avoid consistency issue to encounter them more frequently. When locks are available, terraform can be executing concurrently without fear of concurrent modification, but can also be executed temporally close enough to easily see S3 eventually consistency in action, which is especially true when blocking on a lock using -lock-timeout

The PR will allow the RemoteClient to use a DynamoDB table when available to store a checksum of the last written state, so the object can be verified by the next client to call Get. If a Get call fails to match the recorded checksum, the Client will sleep and try again up until a timeout, currently set to 10 seconds. Terraform currently doesn't have any sort of user feedback around RefreshState/Get, so we poll only for a short time before returning an error.

This has the drawback of requiring all access to the state to have a lock_table configured, or not, but that seems like a good tradeoff for better state management. We should follow up with some documentation about the feature, and possibly rename lock_table to dynamo_table in the config to remove the idea that it's solely tied to state locks.

fixes #14639

jbardin · 2017-05-22T22:20:24Z

=== RUN   TestBackend_impl
--- PASS: TestBackend_impl (0.00s)
=== RUN   TestBackendConfig
--- PASS: TestBackendConfig (1.53s)
=== RUN   TestBackend
--- PASS: TestBackend (5.49s)
	backend_test.go:248: creating S3 bucket terraform-remote-s3-test-592363cf in us-east-1
=== RUN   TestBackendLocked
--- PASS: TestBackendLocked (17.82s)
	backend_test.go:248: creating S3 bucket terraform-remote-s3-test-592363d4 in us-east-1
	testing.go:222: TestBackend: testing state locking for *s3.Backend
=== RUN   TestBackendExtraPaths
--- PASS: TestBackendExtraPaths (8.80s)
	backend_test.go:248: creating S3 bucket terraform-remote-s3-test-592363e6 in us-east-1
=== RUN   TestRemoteClient_impl
--- PASS: TestRemoteClient_impl (0.00s)
=== RUN   TestRemoteClient
--- PASS: TestRemoteClient (2.67s)
	backend_test.go:248: creating S3 bucket terraform-remote-s3-test-592363ef in us-east-1
=== RUN   TestRemoteClientLocks
--- PASS: TestRemoteClientLocks (12.90s)
	backend_test.go:248: creating S3 bucket terraform-remote-s3-test-592363f1 in us-east-1
=== RUN   TestRemoteClient_clientMD5
--- PASS: TestRemoteClient_clientMD5 (9.09s)
=== RUN   TestRemoteClient_stateChecksum
--- PASS: TestRemoteClient_stateChecksum (17.93s)
	backend_test.go:248: creating S3 bucket terraform-remote-s3-test-59236407 in us-east-1
PASS

apparentlymart

Functionally this seems good to me!

My inline comments are nits and UX things.

apparentlymart · 2017-05-24T16:46:37Z

backend/remote-state/s3/client.go

+	if err := c.putMD5(sum[:]); err != nil {
+		// if this errors out, we unfortunately have to error out altogether,
+		// since the next Get will inevitably fail.
+		return fmt.Errorf("[WARNING] failed to store state MD5: %s", err)


Seems weird for an error message to start with [WARNING]...

Oops, copied a log message into an error, will fix!

apparentlymart · 2017-05-24T16:57:05Z

backend/remote-state/s3/client.go

+	consistencyRetryPollInterval = 2 * time.Second
+
+	// checksum didn't match the remote state
+	errBadChecksum = errors.New("invalid state checksum")


Do you think there's something better we could say here? Based on what you'd said to me before, it seemed like it's expected for S3 to occasionally take longer than 10 seconds to converge which would suggest this is an error message users would actually see in normal use, so ideally this message would include both an understandable statement of the problem and a suggested next step.

Perhaps:

errBadChecksumFmt = `state data in S3 does not have the expected content. This may be caused by unusually-long delays in S3 processing a previous state update. Please wait for a minute or two and try again. If this problem persists, and neither S3 nor DynamoDB are experiencing an outage, the checksum stored in the DynamoDB table may need to be manually updated to the following value: %x `

(Where that %x marker would be passed expected as its value below.)

This approach has the disadvantage of not using a consistent error instance for all cases and thus making it harder for the caller to recognize the error, so if that's a concern then of course we could equally do a custom error type that has the expected state checksum as a field. That would then allow us to keep the error message itself simple and produce the more verbose error message at a higher layer, which may be preferable.

I was thinking this was too far removed from the ui to provide good feedback, but It looks like this will end up being put directly into a ui.Error.

I'm only checking the error value in one place right now too, so I think the UX trumps the clean value comparison in this case. I'll add your error message string.

apparentlymart · 2017-05-24T16:57:22Z

backend/remote-state/s3/client_test.go

+	}
+
+	if getSum, err := client.getMD5(); err == nil {
+		t.Fatalf("expecetd getMD5 error, got none. checksum: %x", getSum)


apparentlymart · 2017-05-24T16:58:04Z

backend/remote-state/s3/client_test.go

+	}
+
+	if !bytes.Equal(getSum, sum[:]) {
+		t.Fatalf("getMd5 returned the wrong checksum: expected %x, got %x", sum[:], getSum)


getMd5 should be getMD5 here, I think?

yup, I must have been tired this day.

Updates to objects in S3 are only eventually consistent. If the RemoteClient has a DynamoDB table available, use that to store a checksum of the last written state, so the object can be verified by the next client to call Get. Terraform currently doesn't have any sort of user feedback around RefreshState/Get, so we poll only for a short time before returning an error.

Have the s3 RemoteClient return a detailed error message to the user in the case of a mismatch state checksum.

jbardin · 2017-05-24T17:55:26Z

Fixed the typos, and added a detailed error message.

ghost · 2020-04-12T01:58:45Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

jbardin added bug backend/s3 labels May 22, 2017

jbardin requested a review from apparentlymart May 22, 2017 22:18

jbardin force-pushed the jbardin/s3-consistency branch 2 times, most recently from 11b9f0e to 7543800 Compare May 23, 2017 00:59

apparentlymart approved these changes May 24, 2017

View reviewed changes

jbardin added 2 commits May 24, 2017 13:39

add detailed error message s3 checksum mismatch

91be40a

Have the s3 RemoteClient return a detailed error message to the user in the case of a mismatch state checksum.

jbardin force-pushed the jbardin/s3-consistency branch from 7543800 to 91be40a Compare May 24, 2017 17:54

jbardin merged commit ef1d539 into master May 24, 2017

jbardin deleted the jbardin/s3-consistency branch May 24, 2017 20:48

ghost locked and limited conversation to collaborators Apr 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

store and verify s3 remote state checksum to avoid consistency issues. #14746

store and verify s3 remote state checksum to avoid consistency issues. #14746

jbardin commented May 22, 2017 •

edited

Loading

jbardin commented May 22, 2017

apparentlymart left a comment

apparentlymart May 24, 2017

jbardin May 24, 2017

apparentlymart May 24, 2017

jbardin May 24, 2017

apparentlymart May 24, 2017

apparentlymart May 24, 2017

jbardin May 24, 2017

jbardin commented May 24, 2017

ghost commented Apr 12, 2020

store and verify s3 remote state checksum to avoid consistency issues. #14746

store and verify s3 remote state checksum to avoid consistency issues. #14746

Conversation

jbardin commented May 22, 2017 • edited Loading

jbardin commented May 22, 2017

apparentlymart left a comment

Choose a reason for hiding this comment

apparentlymart May 24, 2017

Choose a reason for hiding this comment

jbardin May 24, 2017

Choose a reason for hiding this comment

apparentlymart May 24, 2017

Choose a reason for hiding this comment

jbardin May 24, 2017

Choose a reason for hiding this comment

apparentlymart May 24, 2017

Choose a reason for hiding this comment

apparentlymart May 24, 2017

Choose a reason for hiding this comment

jbardin May 24, 2017

Choose a reason for hiding this comment

jbardin commented May 24, 2017

ghost commented Apr 12, 2020

jbardin commented May 22, 2017 •

edited

Loading