[WIP] Safety features for Terraform State #6540

apparentlymart · 2016-05-08T21:35:37Z

Through real-world use of Terraform in production I have encountered a variety of "gotchas" relating to state management.

This PR is an attempt to tackle a couple of these where I think I have a reasonable path forward. These changes are grouped into a single PR because they share a common building-block:

"State lineage" concept, which allows us to determine when one state is an earlier or later version of another state vs. when the two states are entirely unrelated. Comparing Serial is only expected to produce meaningful results when Lineage matches.
- Some existing tests need work here because they need to meet the lineage-matching requirements that didn't exist before.

The "features" themselves are:

Detect when a plan applies to a different state than the current state and refuse to apply that plan. (formerly the separate PR [WIP] Refuse to apply a stale plan #4616) Without locking this is not a 100% guarantee, but it makes things safer than they were before and I expect it to be pretty effective at human timescales. (Something like remote: Introduce locking mechanism into remote backend interface #5036 in addition could make this even safer; see some discussion over there for more details.)
When syncing local and remote state, fail with an error if the two states are of different lineage. This includes running terraform remote config to enable or re-configure remote state when a local state is already present. (formerly the separate PR [WIP] Tracking Remote State Lineage to Reduce Mistakes #4618)
- As a special case, if local cache has different lineage but it is entirely empty of resources then just silently drop it and replace it with the remote state, since the loss of an empty state is trivial to recover from and this will reduce friction when users are bootstrapping a new config.

The primary goal here is to return an error when the user seems to be accidentally doing something dangerous, with as little impact as possible to "legitimate" workflows. In future version of Terraform we may make more fundamental changes to these features to help the user not make these mistakes in the first place, but this is intended as a short-term fix to reduce the risk of a state-related catastrophe.

Rethink how the "no stale plans" change can work in light of core: Do not persist state after plans #6811

apparentlymart · 2016-05-16T16:00:00Z

@phinze I think this is ready for review now. This is the initial set of "state safety" changes we discussed a couple weeks ago for possible inclusion in 0.7.

gkze · 2016-05-17T23:00:34Z

This is amazing! Would love to see this merged soon 🎉

apparentlymart · 2016-05-31T19:59:48Z

This is now broken by the change from #6811. An entirely different approach will be needed for preventing the application of stale plans now that planning doesn't actually update the persistent state.

The lineage of a state is an identifier shared by a set of states whose serials are meaningfully comparable because they are produced by progressive Refresh/Apply operations from the same initial empty state. This is initialized as a type-4 (random) UUID when a new state is initialized and then preserved on all other changes. Since states before this change will not have lineage but users may wish to set a lineage for an existing state in order to get the safety benefits it will grow to imply, an empty lineage is considered to be compatible with all lineages.

After running "terraform plan -out=tfplan" and then "terraform apply tfplan" the plan file is left on disk and could previously potentially be applied a second time. Here we add a new constraint that prevents the use of a plan that was not produced from the current state, thus avoiding that problem. It will also reduce race conditions (on the human timescale) between running "plan" and later running "apply", in environments where multiple people/processes are using Terraform with the same remote state. This hazard cannot be eliminated entirely without proper locking, but the with this change in place the race condition is only for two concurrent *applies*, as opposed to overlapping of the whole time period between plan and apply.

Accidentally losing a state with resources in it can be anywhere from annoying to catastrophic. A common cause of such a problem is accidentally clobbering a remote state with an unrelated local state or vice-versa. Here we introduce safety checks that exploit the new "lineage" concept to ensure that once a state location is established it can only be updated with new states from the same lineage as the initial state. We also make the remote state caching mechanism treat a lineage mismatch as a confict, thus ensuring that automatic state syncing will stop if the user manages to get into a broken, mismatched state. As a special exception, we allow *empty* states to be "clobbered" regardless of lineage. This enables remote state to be configured easily in the common case where earlier user actions have implicitly caused an empty local state to be created, and also allows the remote state left behind from a config that has been destroyed to be overwritten by a new, unrelated state from a different config. The priority here is preventing actions that should never occur, so the UX is not polished. Later we may wish to make changes at higher levels of abstraction to either prevent these situations from arising in the first place (e.g. making remote configuration automatic based on config) or giving the user more guidance on resolution.

apparentlymart · 2016-06-06T15:01:49Z

In light of #6811 I'm closing this in favor of #7026, which has the same introduction of lineage and the second feature from this PR, but skips the "stale plans" feature that now no longer makes sense.

ghost · 2020-04-19T02:24:03Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

apparentlymart added enhancement core labels May 8, 2016

This was referenced May 8, 2016

[WIP] Refuse to apply a stale plan #4616

Closed

[WIP] Tracking Remote State Lineage to Reduce Mistakes #4618

Closed

apparentlymart force-pushed the f-state-safety branch from f66f674 to de1310d Compare May 8, 2016 22:48

apparentlymart mentioned this pull request May 13, 2016

remote: Introduce locking mechanism into remote backend interface #5036

Closed

apparentlymart force-pushed the f-state-safety branch 8 times, most recently from b32e4a8 to ee28ef8 Compare May 16, 2016 15:51

apparentlymart changed the title ~~[WIP] Safety features for Terraform State~~ Safety features for Terraform State May 16, 2016

This was referenced May 19, 2016

Re-configuring remote state raises conflict on pull #5410

Closed

core: Do not persist state after plans #6811

Merged

apparentlymart changed the title ~~Safety features for Terraform State~~ [WIP] Safety features for Terraform State May 31, 2016

apparentlymart added 3 commits June 6, 2016 07:48

apparentlymart force-pushed the f-state-safety branch from ee28ef8 to de82005 Compare June 6, 2016 14:49

apparentlymart mentioned this pull request Jun 6, 2016

Prevent overwriting states with other unrelated states #7026

Closed

apparentlymart closed this Jun 6, 2016

stack72 deleted the f-state-safety branch November 25, 2016 16:27

ghost locked and limited conversation to collaborators Apr 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Safety features for Terraform State #6540

[WIP] Safety features for Terraform State #6540

apparentlymart commented May 8, 2016 •

edited

Loading

apparentlymart commented May 16, 2016

gkze commented May 17, 2016

apparentlymart commented May 31, 2016

apparentlymart commented Jun 6, 2016

ghost commented Apr 19, 2020

[WIP] Safety features for Terraform State #6540

[WIP] Safety features for Terraform State #6540

Conversation

apparentlymart commented May 8, 2016 • edited Loading

apparentlymart commented May 16, 2016

gkze commented May 17, 2016

apparentlymart commented May 31, 2016

apparentlymart commented Jun 6, 2016

ghost commented Apr 19, 2020

apparentlymart commented May 8, 2016 •

edited

Loading