Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persist state more frequently #14834

Merged
merged 5 commits into from
May 25, 2017
Merged

Persist state more frequently #14834

merged 5 commits into from
May 25, 2017

Conversation

jbardin
Copy link
Member

@jbardin jbardin commented May 25, 2017

State implementations may not commit state to storage until PersistState is called. With the new separate backends running the show, PersistState wasn't called during the apply execution, and if the process was terminated early there would be no state saved.

This PR provides 2 new pieces to help mitigate the issue. The StateHook, which writes the state after changes during the apply walk, will periodically call PersistState. This is set to a minimum 10 second interval from the previous PersistState call.

The second improvement is when the backend received notice that the operation is being cancelled, it immediately calls PersistState to attempt to flush and cached state to permanent storage.

Since we now have possible concurrent access to the StateWriters, the outer structs all need to be concurrent-safe. Mutexes are added to the methods that mutate the structs to serialize access, along with tests for the race detector.

Fixes #14487

jbardin added 5 commits May 24, 2017 16:16
Have StateHook periodically call PersistState to flush any cached state
to permanent storage. This uses a minimal 10 second interval between
calls to PersistState.
When the backend operation is cancelled, immediately call PersistState.
The is a high likelihood that the user is going to terminate the process
early if the provider doesn't return in a timely manner, so persist as
much state as possible.
Copy link
Contributor

@apparentlymart apparentlymart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine to me in principle! Better to have the state persisted more often.

One detail I'm a little concerned about is what impact this might have on the current Terraform Enterprise UI. Currently the UI will list every single state write performed by a particular run, and already this causes confusion for customers when they see e.g. the state created by the refresh followed by the state created by the apply itself.

We have had some internal conversations about changing the Enterprise data model and UI to have a special understanding about the "initial state" and "final state" of an operation, and de-emphasizing (or hiding altogether) intermediate states, but until now that felt low priority.

I'm not sure we need to block on addressing this, but it seems worth starting a conversation with some of the folks working on Enterprise to make sure this is on their radar and we don't create issues for the support team.

@jbardin
Copy link
Member Author

jbardin commented May 25, 2017

@apparentlymart,

That's a good point, but I would lean heavily on the side of TFE not depending on the behavior of the state in the UI.

Copy link
Contributor

@apparentlymart apparentlymart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I meant to mark this as approved earlier... the note about looping in the Enterprise team was more intended as an async thing to make sure it doesn't lead to surprise.

@jbardin jbardin merged commit 1d58576 into master May 25, 2017
@jbardin jbardin deleted the jbardin/state-hook branch May 25, 2017 22:47
@manuwela
Copy link
Contributor

manuwela commented Jun 8, 2017

@jbardin, @apparentlymart : Since this was supposed to fix: #14487, and now it has been reverted in 0.9.8, when is the actual fix for this expected.

It creates a bit of a problem for us to use terraform 0.9+ versions since S3 backend has nothing populated in some error scenarios. But we do have to move to v0.9+ for some other unrelated fixes/features.

@apparentlymart
Copy link
Contributor

Due to the this revert being unplanned we've unfortunately had to prioritize other ongoing work ahead of it for the moment, but we do plan to come back to this soon once we've got some of the big logistical work for 0.10 out of the way.

However, if there are situations where Terraform is failing without doing a final write of the state, that is a bug in itself and worth addressing separately, so a new top-level issue would be welcome for that if you have some more details to share about what kinds of errors are experiencing that. Periodic state writes during apply would just be a band-aid for that.

Also worth noting that this PR wasn't reverted in its entirety, but rather just the one commit about periodically persisting state. The part of this change where state is persisted immediately on SIGINT is still in 0.9.8, which doesn't help with the "not saving state on error" problem, but it does address a related concern that others might find useful.

@manuwela
Copy link
Contributor

Thanks for the reply @apparentlymart

@manuwela
Copy link
Contributor

manuwela commented Jul 3, 2017

Hi @apparentlymart , @jbardin : Any update on when this will be taken up since there have been a couple of 0.9.X and one 0.10 beta release since then. Thanks

@jbardin
Copy link
Member Author

jbardin commented Jul 3, 2017

@manuwelakanade,

While we do hope to improve the functionality of how the state is persisted in the backend, we would like to stress that this is working as intended right now. Because Terraform operates concurrently, and state checkpoints aren't atomic, there no guarantee that any state seen while running reflects the actual state of the infrastructure, which makes incremental state updates less useful in most cases.

As apparentlymart mentioned, if you have a case where an updated state isn't persisted at all from a run, please file a new issue as that could be very serious. In the meantime, there are other related issues pertaining to state serialization that need to be sorted out before we can attempt to re-instate incremental updates in remote backends.

@glasser
Copy link
Contributor

glasser commented Jul 6, 2017

Is this specifically a problem with Atlas? Maybe this could be reverted but only enabled for backends that claim to support it? (This is probably what we'll do in our fork.)

@jbardin
Copy link
Member Author

jbardin commented Jul 6, 2017

@glasser, no this has turned out to be a problem with state handling in general, it was just that Atlas was the first to encounter an issue because of its strict handling of state checksums and serial numbers.

Terraform 0.10-beta2 contains some major refactoring of how state serialization is handled, and we can see how that works in practice before moving forward.

@ghost
Copy link

ghost commented Apr 8, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Apr 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

remote state is not incrementally updated
4 participants