Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Safety features for Terraform State #6540

Closed
wants to merge 3 commits into from

Conversation

apparentlymart
Copy link
Contributor

@apparentlymart apparentlymart commented May 8, 2016

Through real-world use of Terraform in production I have encountered a variety of "gotchas" relating to state management.

This PR is an attempt to tackle a couple of these where I think I have a reasonable path forward. These changes are grouped into a single PR because they share a common building-block:

  • "State lineage" concept, which allows us to determine when one state is an earlier or later version of another state vs. when the two states are entirely unrelated. Comparing Serial is only expected to produce meaningful results when Lineage matches.
    • Some existing tests need work here because they need to meet the lineage-matching requirements that didn't exist before.

The "features" themselves are:

  • Detect when a plan applies to a different state than the current state and refuse to apply that plan. (formerly the separate PR [WIP] Refuse to apply a stale plan #4616) Without locking this is not a 100% guarantee, but it makes things safer than they were before and I expect it to be pretty effective at human timescales. (Something like remote: Introduce locking mechanism into remote backend interface #5036 in addition could make this even safer; see some discussion over there for more details.)
  • When syncing local and remote state, fail with an error if the two states are of different lineage. This includes running terraform remote config to enable or re-configure remote state when a local state is already present. (formerly the separate PR [WIP] Tracking Remote State Lineage to Reduce Mistakes #4618)
    • As a special case, if local cache has different lineage but it is entirely empty of resources then just silently drop it and replace it with the remote state, since the loss of an empty state is trivial to recover from and this will reduce friction when users are bootstrapping a new config.

The primary goal here is to return an error when the user seems to be accidentally doing something dangerous, with as little impact as possible to "legitimate" workflows. In future version of Terraform we may make more fundamental changes to these features to help the user not make these mistakes in the first place, but this is intended as a short-term fix to reduce the risk of a state-related catastrophe.

@apparentlymart apparentlymart force-pushed the f-state-safety branch 8 times, most recently from b32e4a8 to ee28ef8 Compare May 16, 2016 15:51
@apparentlymart apparentlymart changed the title [WIP] Safety features for Terraform State Safety features for Terraform State May 16, 2016
@apparentlymart
Copy link
Contributor Author

@phinze I think this is ready for review now. This is the initial set of "state safety" changes we discussed a couple weeks ago for possible inclusion in 0.7.

@gkze
Copy link
Contributor

gkze commented May 17, 2016

This is amazing! Would love to see this merged soon 🎉

@apparentlymart apparentlymart changed the title Safety features for Terraform State [WIP] Safety features for Terraform State May 31, 2016
@apparentlymart
Copy link
Contributor Author

This is now broken by the change from #6811. An entirely different approach will be needed for preventing the application of stale plans now that planning doesn't actually update the persistent state.

The lineage of a state is an identifier shared by a set of states whose
serials are meaningfully comparable because they are produced by
progressive Refresh/Apply operations from the same initial empty state.

This is initialized as a type-4 (random) UUID when a new state is
initialized and then preserved on all other changes.

Since states before this change will not have lineage but users may wish
to set a lineage for an existing state in order to get the safety
benefits it will grow to imply, an empty lineage is considered to be
compatible with all lineages.
After running "terraform plan -out=tfplan" and then
"terraform apply tfplan" the plan file is left on disk and could
previously potentially be applied a second time.

Here we add a new constraint that prevents the use of a plan that was
not produced from the current state, thus avoiding that problem.

It will also reduce race conditions (on the human timescale) between
running "plan" and later running "apply", in environments where multiple
people/processes are using Terraform with the same remote state. This
hazard cannot be eliminated entirely without proper locking, but the
with this change in place the race condition is only for two concurrent
*applies*, as opposed to overlapping of the whole time period between plan
and apply.
Accidentally losing a state with resources in it can be anywhere from
annoying to catastrophic. A common cause of such a problem is accidentally
clobbering a remote state with an unrelated local state or vice-versa.

Here we introduce safety checks that exploit the new "lineage" concept
to ensure that once a state location is established it can only be
updated with new states from the same lineage as the initial state.

We also make the remote state caching mechanism treat a lineage mismatch
as a confict, thus ensuring that automatic state syncing will stop if
the user manages to get into a broken, mismatched state.

As a special exception, we allow *empty* states to be "clobbered"
regardless of lineage. This enables remote state to be configured
easily in the common case where earlier user actions have implicitly
caused an empty local state to be created, and also allows the remote
state left behind from a config that has been destroyed to be overwritten
by a new, unrelated state from a different config.

The priority here is preventing actions that should never occur, so the
UX is not polished. Later we may wish to make changes at higher
levels of abstraction to either prevent these situations from arising
in the first place (e.g. making remote configuration automatic based on
config) or giving the user more guidance on resolution.
@apparentlymart
Copy link
Contributor Author

In light of #6811 I'm closing this in favor of #7026, which has the same introduction of lineage and the second feature from this PR, but skips the "stale plans" feature that now no longer makes sense.

@stack72 stack72 deleted the f-state-safety branch November 25, 2016 16:27
@ghost
Copy link

ghost commented Apr 19, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Apr 19, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants