Persist state more frequently #14834

jbardin · 2017-05-25T15:25:49Z

State implementations may not commit state to storage until PersistState is called. With the new separate backends running the show, PersistState wasn't called during the apply execution, and if the process was terminated early there would be no state saved.

This PR provides 2 new pieces to help mitigate the issue. The StateHook, which writes the state after changes during the apply walk, will periodically call PersistState. This is set to a minimum 10 second interval from the previous PersistState call.

The second improvement is when the backend received notice that the operation is being cancelled, it immediately calls PersistState to attempt to flush and cached state to permanent storage.

Since we now have possible concurrent access to the StateWriters, the outer structs all need to be concurrent-safe. Mutexes are added to the methods that mutate the structs to serialize access, along with tests for the race detector.

Fixes #14487

Have StateHook periodically call PersistState to flush any cached state to permanent storage. This uses a minimal 10 second interval between calls to PersistState.

When the backend operation is cancelled, immediately call PersistState. The is a high likelihood that the user is going to terminate the process early if the provider doesn't return in a timely manner, so persist as much state as possible.

apparentlymart

This seems fine to me in principle! Better to have the state persisted more often.

One detail I'm a little concerned about is what impact this might have on the current Terraform Enterprise UI. Currently the UI will list every single state write performed by a particular run, and already this causes confusion for customers when they see e.g. the state created by the refresh followed by the state created by the apply itself.

We have had some internal conversations about changing the Enterprise data model and UI to have a special understanding about the "initial state" and "final state" of an operation, and de-emphasizing (or hiding altogether) intermediate states, but until now that felt low priority.

I'm not sure we need to block on addressing this, but it seems worth starting a conversation with some of the folks working on Enterprise to make sure this is on their radar and we don't create issues for the support team.

jbardin · 2017-05-25T20:55:55Z

@apparentlymart,

That's a good point, but I would lean heavily on the side of TFE not depending on the behavior of the state in the UI.

apparentlymart

Sorry I meant to mark this as approved earlier... the note about looping in the Enterprise team was more intended as an async thing to make sure it doesn't lead to surprise.

manuwela · 2017-06-08T05:53:45Z

@jbardin, @apparentlymart : Since this was supposed to fix: #14487, and now it has been reverted in 0.9.8, when is the actual fix for this expected.

It creates a bit of a problem for us to use terraform 0.9+ versions since S3 backend has nothing populated in some error scenarios. But we do have to move to v0.9+ for some other unrelated fixes/features.

apparentlymart · 2017-06-09T16:57:26Z

Due to the this revert being unplanned we've unfortunately had to prioritize other ongoing work ahead of it for the moment, but we do plan to come back to this soon once we've got some of the big logistical work for 0.10 out of the way.

However, if there are situations where Terraform is failing without doing a final write of the state, that is a bug in itself and worth addressing separately, so a new top-level issue would be welcome for that if you have some more details to share about what kinds of errors are experiencing that. Periodic state writes during apply would just be a band-aid for that.

Also worth noting that this PR wasn't reverted in its entirety, but rather just the one commit about periodically persisting state. The part of this change where state is persisted immediately on SIGINT is still in 0.9.8, which doesn't help with the "not saving state on error" problem, but it does address a related concern that others might find useful.

manuwela · 2017-06-12T11:24:08Z

Thanks for the reply @apparentlymart

manuwela · 2017-07-03T06:56:59Z

Hi @apparentlymart , @jbardin : Any update on when this will be taken up since there have been a couple of 0.9.X and one 0.10 beta release since then. Thanks

jbardin · 2017-07-03T13:16:13Z

@manuwelakanade,

While we do hope to improve the functionality of how the state is persisted in the backend, we would like to stress that this is working as intended right now. Because Terraform operates concurrently, and state checkpoints aren't atomic, there no guarantee that any state seen while running reflects the actual state of the infrastructure, which makes incremental state updates less useful in most cases.

As apparentlymart mentioned, if you have a case where an updated state isn't persisted at all from a run, please file a new issue as that could be very serious. In the meantime, there are other related issues pertaining to state serialization that need to be sorted out before we can attempt to re-instate incremental updates in remote backends.

glasser · 2017-07-06T19:30:03Z

Is this specifically a problem with Atlas? Maybe this could be reverted but only enabled for backends that claim to support it? (This is probably what we'll do in our fork.)

jbardin · 2017-07-06T19:55:16Z

@glasser, no this has turned out to be a problem with state handling in general, it was just that Atlas was the first to encounter an issue because of its strict handling of state checksums and serial numbers.

Terraform 0.10-beta2 contains some major refactoring of how state serialization is handled, and we can see how that works in practice before moving forward.

ghost · 2020-04-08T02:14:16Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

jbardin added 5 commits May 24, 2017 16:16

remove old state hook code

b3795e2

have StateHook periodically PersistState

b73d037

Have StateHook periodically call PersistState to flush any cached state to permanent storage. This uses a minimal 10 second interval between calls to PersistState.

call PersistState immediately when cancelling

9e4c0ff

When the backend operation is cancelled, immediately call PersistState. The is a high likelihood that the user is going to terminate the process early if the provider doesn't return in a timely manner, so persist as much state as possible.

add mutexes to remote.State

f0f2220

add mutexes to Local, Backup, and InmemState

4866f35

jbardin added bug core labels May 25, 2017

jbardin requested a review from apparentlymart May 25, 2017 15:27

apparentlymart reviewed May 25, 2017

View reviewed changes

apparentlymart approved these changes May 25, 2017

View reviewed changes

jbardin merged commit 1d58576 into master May 25, 2017

jbardin deleted the jbardin/state-hook branch May 25, 2017 22:47

apparentlymart mentioned this pull request Jun 7, 2017

terraform 0.9.7 can't push state to atlas during apply #15175

Closed

ghost locked and limited conversation to collaborators Apr 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persist state more frequently #14834

Persist state more frequently #14834

jbardin commented May 25, 2017 •

edited

Loading

apparentlymart left a comment

jbardin commented May 25, 2017

apparentlymart left a comment

manuwela commented Jun 8, 2017

apparentlymart commented Jun 9, 2017

manuwela commented Jun 12, 2017

manuwela commented Jul 3, 2017

jbardin commented Jul 3, 2017 •

edited

Loading

glasser commented Jul 6, 2017

jbardin commented Jul 6, 2017

ghost commented Apr 8, 2020

Persist state more frequently #14834

Persist state more frequently #14834

Conversation

jbardin commented May 25, 2017 • edited Loading

apparentlymart left a comment

Choose a reason for hiding this comment

jbardin commented May 25, 2017

apparentlymart left a comment

Choose a reason for hiding this comment

manuwela commented Jun 8, 2017

apparentlymart commented Jun 9, 2017

manuwela commented Jun 12, 2017

manuwela commented Jul 3, 2017

jbardin commented Jul 3, 2017 • edited Loading

glasser commented Jul 6, 2017

jbardin commented Jul 6, 2017

ghost commented Apr 8, 2020

jbardin commented May 25, 2017 •

edited

Loading

jbardin commented Jul 3, 2017 •

edited

Loading