-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to gracefully recover from unexpected errors on long running resource operations. #21652
Comments
Hi @chrisst! Thanks for sharing these use-cases. Currently Terraform gives providers a few different options in how they respond to "apply" requests:
What we've understood from what you've described here is a new forth case which Terraform doesn't have any explicit support for:
Although Terraform doesn't currently explicitly model this ambivalent path, for a provider that explicitly supports cancellation and is able to respond to it by creating this sort of partial result, there is a way to imply this behavior today: respond to the cancellation by returning a partial state object with no errors, and then make sure the resource type has suitable The above assumes that it's possible to encode the incomplete state well enough within the resource schema that a subsequent plan can recognize it and treat it as an update, which seems possible in theory but I'm sure there are situations where it's hard to do in practice, particularly within the assumptions of the current SDK which can make it hard to deal with normal drift in a robust way already. With that said though, we'd be interested to explore some real-world examples and see if this implicit ambivalent path can be workable for them for right now. Where possible we like to try to experiment with new concepts via provider-specific features at first, because it tends to be easier to design for individual cases first to gather experience to inform a possible general feature later. If you have some specific examples of resource types where the create takes a particularly long time that we could use as motivating examples, I'd love to think through the implications of implementing them using "implied-ambivalent-path" in the short term, and to capture any specific drawbacks we identify in order to inform the design of a possible explicit protocol feature. Thanks! |
@apparentlymart thanks for the response! I think my favorite example of something taking a significantly long time is google_composer_environment which can take up to an hour to successfully complete. For many of the Google resources that take a while to provision the API returns an operation object (composer operation) that we then poll against to determine the result. Other API's support passing in an idempotency token such as I'll play around with storing a partial state that includes the operation ID and "resuming" polling within the update as Composer doesn't support requestID's at this time. |
Thanks @chrisst! I'm excited to hear how those experiments play out. 💃 |
Hello again! We didn't hear back from you, so I'm going to close this in the hope that a previous response gave you the information you needed. If not, please do feel free to re-open this and leave another comment with the information my human friends requested above. Thanks! |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further. |
Use-cases
There are a handful of resources in GCP that take a really long time in order to successfully create (up to an hour) with a single POST request. We currently poll the resource for success but there are often scenarios where Terraform gets interrupted before the polling has gotten a successful state back. For example there could be a network interruption long enough to kill the polling operation which would result in a tainted resource the next time Terraform was run or a Jenkins worker gets killed mid job. For these particular long running resources it would be great to allow a subsequent run of Terraform was able to more gracefully recover instead of incurring a very costly resource recreation.
Attempted Solutions
When encountered locally running
untaint
is viable but when running in automation it's more difficult. But it requires building a script to list the resources in state and make best guesses at which could have been tainted (or to parse the state file) and runninguntaint
on them.Proposal
I have a handful of different possible ideas:
Adding something like
ignore_taint
to lifecycle would allow a user to acknowledge that this resource doesn't need to be recreated if there was an error during apply but the resource does currently exist. Giving the end user this control means that the provider doesn't have to make any assumptions about what else is happening in the config such as provisioners or how downstream resources need to rely upon it.Give the provider the ability to override the taint behavior on certain error conditions for a given resource. This can allow errors like timeouts, gateway failures or sigterms to ignore the tainting of the resource but other categories of errors coming back from the API to behave normally.
An ability for a resource to "resume" the create operation. Some API's provide the concept of a request ID which can be used to 'resume' a long running create. Terraform could use its own request id, and possibly allow a resource to override, that would allow a resource to jump back to waiting on a stateChangeConf when it is rerun.
References
The text was updated successfully, but these errors were encountered: