-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Safety features for Terraform State #6540
Conversation
f66f674
to
de1310d
Compare
b32e4a8
to
ee28ef8
Compare
@phinze I think this is ready for review now. This is the initial set of "state safety" changes we discussed a couple weeks ago for possible inclusion in 0.7. |
This is amazing! Would love to see this merged soon 🎉 |
This is now broken by the change from #6811. An entirely different approach will be needed for preventing the application of stale plans now that planning doesn't actually update the persistent state. |
The lineage of a state is an identifier shared by a set of states whose serials are meaningfully comparable because they are produced by progressive Refresh/Apply operations from the same initial empty state. This is initialized as a type-4 (random) UUID when a new state is initialized and then preserved on all other changes. Since states before this change will not have lineage but users may wish to set a lineage for an existing state in order to get the safety benefits it will grow to imply, an empty lineage is considered to be compatible with all lineages.
After running "terraform plan -out=tfplan" and then "terraform apply tfplan" the plan file is left on disk and could previously potentially be applied a second time. Here we add a new constraint that prevents the use of a plan that was not produced from the current state, thus avoiding that problem. It will also reduce race conditions (on the human timescale) between running "plan" and later running "apply", in environments where multiple people/processes are using Terraform with the same remote state. This hazard cannot be eliminated entirely without proper locking, but the with this change in place the race condition is only for two concurrent *applies*, as opposed to overlapping of the whole time period between plan and apply.
Accidentally losing a state with resources in it can be anywhere from annoying to catastrophic. A common cause of such a problem is accidentally clobbering a remote state with an unrelated local state or vice-versa. Here we introduce safety checks that exploit the new "lineage" concept to ensure that once a state location is established it can only be updated with new states from the same lineage as the initial state. We also make the remote state caching mechanism treat a lineage mismatch as a confict, thus ensuring that automatic state syncing will stop if the user manages to get into a broken, mismatched state. As a special exception, we allow *empty* states to be "clobbered" regardless of lineage. This enables remote state to be configured easily in the common case where earlier user actions have implicitly caused an empty local state to be created, and also allows the remote state left behind from a config that has been destroyed to be overwritten by a new, unrelated state from a different config. The priority here is preventing actions that should never occur, so the UX is not polished. Later we may wish to make changes at higher levels of abstraction to either prevent these situations from arising in the first place (e.g. making remote configuration automatic based on config) or giving the user more guidance on resolution.
ee28ef8
to
de82005
Compare
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further. |
Through real-world use of Terraform in production I have encountered a variety of "gotchas" relating to state management.
This PR is an attempt to tackle a couple of these where I think I have a reasonable path forward. These changes are grouped into a single PR because they share a common building-block:
Serial
is only expected to produce meaningful results whenLineage
matches.The "features" themselves are:
terraform remote config
to enable or re-configure remote state when a local state is already present. (formerly the separate PR [WIP] Tracking Remote State Lineage to Reduce Mistakes #4618)The primary goal here is to return an error when the user seems to be accidentally doing something dangerous, with as little impact as possible to "legitimate" workflows. In future version of Terraform we may make more fundamental changes to these features to help the user not make these mistakes in the first place, but this is intended as a short-term fix to reduce the risk of a state-related catastrophe.