Ability to gracefully recover from unexpected errors on long running resource operations. #21652

chrisst · 2019-06-07T20:40:10Z

Use-cases

There are a handful of resources in GCP that take a really long time in order to successfully create (up to an hour) with a single POST request. We currently poll the resource for success but there are often scenarios where Terraform gets interrupted before the polling has gotten a successful state back. For example there could be a network interruption long enough to kill the polling operation which would result in a tainted resource the next time Terraform was run or a Jenkins worker gets killed mid job. For these particular long running resources it would be great to allow a subsequent run of Terraform was able to more gracefully recover instead of incurring a very costly resource recreation.

Attempted Solutions

When encountered locally running untaint is viable but when running in automation it's more difficult. But it requires building a script to list the resources in state and make best guesses at which could have been tainted (or to parse the state file) and running untaint on them.

Proposal

I have a handful of different possible ideas:

Adding something like ignore_taint to lifecycle would allow a user to acknowledge that this resource doesn't need to be recreated if there was an error during apply but the resource does currently exist. Giving the end user this control means that the provider doesn't have to make any assumptions about what else is happening in the config such as provisioners or how downstream resources need to rely upon it.
Give the provider the ability to override the taint behavior on certain error conditions for a given resource. This can allow errors like timeouts, gateway failures or sigterms to ignore the tainting of the resource but other categories of errors coming back from the API to behave normally.
An ability for a resource to "resume" the create operation. Some API's provide the concept of a request ID which can be used to 'resume' a long running create. Terraform could use its own request id, and possibly allow a resource to override, that would allow a resource to jump back to waiting on a stateChangeConf when it is rerun.

References

The text was updated successfully, but these errors were encountered:

apparentlymart · 2019-11-12T22:01:56Z

Hi @chrisst! Thanks for sharing these use-cases.

Currently Terraform gives providers a few different options in how they respond to "apply" requests:

In the happy path, everything has completed successfully and the provider returns a representation of the new state with no errors, and Terraform saves it.
In the very-sad path, the operation went so wrong that nothing was created at all, and so the provider returns just errors and a null new state object.
In the kinda-sad path, the operation partially completed enough that some objects were left behind, and so the provider returns the errors along with a partial new state object that contains enough information for a subsequent request to destroy the object, and Terraform saves it marked as "tainted" in order to remember that it needs to be destroyed on the next plan.

What we've understood from what you've described here is a new forth case which Terraform doesn't have any explicit support for:

Ambivalent path: the operation has been cancelled by the user (with SIGINT), but the remote system was left in a sufficiently happy state that a subsequent plan could potentially continue the work that the provider has started, turning the incomplete object into a complete one without replacing it.

Although Terraform doesn't currently explicitly model this ambivalent path, for a provider that explicitly supports cancellation and is able to respond to it by creating this sort of partial result, there is a way to imply this behavior today: respond to the cancellation by returning a partial state object with no errors, and then make sure the resource type has suitable CustomizeDiff and/or Update logic to recognize that partial result and treat as a funny sort of "drift", where the remote system doesn't match with the configuration.

The above assumes that it's possible to encode the incomplete state well enough within the resource schema that a subsequent plan can recognize it and treat it as an update, which seems possible in theory but I'm sure there are situations where it's hard to do in practice, particularly within the assumptions of the current SDK which can make it hard to deal with normal drift in a robust way already.

With that said though, we'd be interested to explore some real-world examples and see if this implicit ambivalent path can be workable for them for right now. Where possible we like to try to experiment with new concepts via provider-specific features at first, because it tends to be easier to design for individual cases first to gather experience to inform a possible general feature later. If you have some specific examples of resource types where the create takes a particularly long time that we could use as motivating examples, I'd love to think through the implications of implementing them using "implied-ambivalent-path" in the short term, and to capture any specific drawbacks we identify in order to inform the design of a possible explicit protocol feature.

Thanks!

chrisst · 2019-11-15T19:56:33Z

@apparentlymart thanks for the response! I think my favorite example of something taking a significantly long time is google_composer_environment which can take up to an hour to successfully complete. For many of the Google resources that take a while to provision the API returns an operation object (composer operation) that we then poll against to determine the result. Other API's support passing in an idempotency token such as requestId. Using a requestId can allow subsequent Create calls to return the same Operation object allowing a user to resume polling for success/failure of the original Create call.

I'll play around with storing a partial state that includes the operation ID and "resuming" polling within the update as Composer doesn't support requestID's at this time.

apparentlymart · 2019-11-16T00:29:56Z

Thanks @chrisst! I'm excited to hear how those experiments play out. 💃

hashibot · 2019-12-21T01:40:37Z

Hello again!

We didn't hear back from you, so I'm going to close this in the hope that a previous response gave you the information you needed. If not, please do feel free to re-open this and leave another comment with the information my human friends requested above. Thanks!

ghost · 2020-01-20T01:50:29Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

chrisst added the enhancement label Jun 7, 2019

apparentlymart added core waiting-response An issue/pull request is waiting for a response from the community labels Nov 12, 2019

ghost removed waiting-response An issue/pull request is waiting for a response from the community labels Nov 15, 2019

apparentlymart added the waiting-response An issue/pull request is waiting for a response from the community label Nov 16, 2019

chrisst mentioned this issue Dec 16, 2019

POC for resumable GKE creates hashicorp/terraform-provider-google#5197

Closed

hashibot closed this as completed Dec 21, 2019

ghost locked and limited conversation to collaborators Jan 20, 2020

ghost removed the waiting-response An issue/pull request is waiting for a response from the community label Jan 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to gracefully recover from unexpected errors on long running resource operations. #21652

Ability to gracefully recover from unexpected errors on long running resource operations. #21652

chrisst commented Jun 7, 2019

apparentlymart commented Nov 12, 2019

chrisst commented Nov 15, 2019

apparentlymart commented Nov 16, 2019

hashibot commented Dec 21, 2019

ghost commented Jan 20, 2020

Ability to gracefully recover from unexpected errors on long running resource operations. #21652

Ability to gracefully recover from unexpected errors on long running resource operations. #21652

Comments

chrisst commented Jun 7, 2019

Use-cases

Attempted Solutions

Proposal

References

apparentlymart commented Nov 12, 2019

chrisst commented Nov 15, 2019

apparentlymart commented Nov 16, 2019

hashibot commented Dec 21, 2019

ghost commented Jan 20, 2020