-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
store and verify s3 remote state checksum to avoid consistency issues. #14746
Conversation
|
11b9f0e
to
7543800
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functionally this seems good to me!
My inline comments are nits and UX things.
backend/remote-state/s3/client.go
Outdated
if err := c.putMD5(sum[:]); err != nil { | ||
// if this errors out, we unfortunately have to error out altogether, | ||
// since the next Get will inevitably fail. | ||
return fmt.Errorf("[WARNING] failed to store state MD5: %s", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems weird for an error message to start with [WARNING]
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, copied a log message into an error, will fix!
backend/remote-state/s3/client.go
Outdated
consistencyRetryPollInterval = 2 * time.Second | ||
|
||
// checksum didn't match the remote state | ||
errBadChecksum = errors.New("invalid state checksum") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think there's something better we could say here? Based on what you'd said to me before, it seemed like it's expected for S3 to occasionally take longer than 10 seconds to converge which would suggest this is an error message users would actually see in normal use, so ideally this message would include both an understandable statement of the problem and a suggested next step.
Perhaps:
errBadChecksumFmt = `state data in S3 does not have the expected content.
This may be caused by unusually-long delays in S3 processing a previous state update.
Please wait for a minute or two and try again. If this problem persists, and neither S3 nor
DynamoDB are experiencing an outage, the checksum stored in the DynamoDB table
may need to be manually updated to the following value:
%x
`
(Where that %x marker would be passed expected
as its value below.)
This approach has the disadvantage of not using a consistent error instance for all cases and thus making it harder for the caller to recognize the error, so if that's a concern then of course we could equally do a custom error type that has the expected state checksum as a field. That would then allow us to keep the error message itself simple and produce the more verbose error message at a higher layer, which may be preferable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking this was too far removed from the ui to provide good feedback, but It looks like this will end up being put directly into a ui.Error.
I'm only checking the error value in one place right now too, so I think the UX trumps the clean value comparison in this case. I'll add your error message string.
} | ||
|
||
if getSum, err := client.getMD5(); err == nil { | ||
t.Fatalf("expecetd getMD5 error, got none. checksum: %x", getSum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expecetd
} | ||
|
||
if !bytes.Equal(getSum, sum[:]) { | ||
t.Fatalf("getMd5 returned the wrong checksum: expected %x, got %x", sum[:], getSum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getMd5
should be getMD5
here, I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, I must have been tired this day.
Updates to objects in S3 are only eventually consistent. If the RemoteClient has a DynamoDB table available, use that to store a checksum of the last written state, so the object can be verified by the next client to call Get. Terraform currently doesn't have any sort of user feedback around RefreshState/Get, so we poll only for a short time before returning an error.
Have the s3 RemoteClient return a detailed error message to the user in the case of a mismatch state checksum.
7543800
to
91be40a
Compare
Fixed the typos, and added a detailed error message. |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further. |
Updates to objects in S3 are only eventually consistent. This was an accepted limitation of storing state in S3, but due to the infrequency which successive terraform runs are usually executed, it rarely posed a problem.
Now that we natively support state locking, we inadvertently have provided a way for users who aren't taking care to avoid consistency issue to encounter them more frequently. When locks are available, terraform can be executing concurrently without fear of concurrent modification, but can also be executed temporally close enough to easily see S3 eventually consistency in action, which is especially true when blocking on a lock using
-lock-timeout
The PR will allow the RemoteClient to use a DynamoDB table when available to store a checksum of the last written state, so the object can be verified by the next client to call Get. If a Get call fails to match the recorded checksum, the Client will sleep and try again up until a timeout, currently set to 10 seconds. Terraform currently doesn't have any sort of user feedback around RefreshState/Get, so we poll only for a short time before returning an error.
This has the drawback of requiring all access to the state to have a lock_table configured, or not, but that seems like a good tradeoff for better state management. We should follow up with some documentation about the feature, and possibly rename
lock_table
todynamo_table
in the config to remove the idea that it's solely tied to state locks.fixes #14639