Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research alternative lifecycles to support resources which can't work with create_before_destroy #35763

Open
natedogith1 opened this issue Sep 23, 2024 · 21 comments

Comments

@natedogith1
Copy link

Terraform Version

Terraform v1.9.6
on windows_386

Use Cases

Some resources need to be removed from their dependents before they can be deleted.

I've encountered this issue with google_monitoring_alert_policy referencing google_monitoring_notification_channel and google_compute_region_url_map referencing google_compute_region_backend_service. There are also several github issues from encountering this problem.

Attempted Solutions

create_before_destroy has frequently been offered as a solution. This can work if there's not name conflicts. The problem is that, by design, this change propagates to the entire dependency graph, which means it can't be used if any dependency doesn't support create_before_destroy.

Proposal

Add support for removing resources from dependents before destroying. The create_before_destroy documentation should probably also be updated to mention how it affects ordering with other resources being updated.

References

@natedogith1 natedogith1 added enhancement new new issue not yet triaged labels Sep 23, 2024
@jbardin
Copy link
Member

jbardin commented Sep 23, 2024

Hi @natedogith1,

There's going to be some problems we first need to overcome to do something like this, which would probably require major changes to not just Terraform, but providers and the protocol they use. The biggest hurdle is that since there's no value to use in the first update call, it would have to be some sort of "incomplete update" operation, which the provider would opt into with a planned partial set of changes that they could accomplish without the final planned values. This also encounters a "prior knowledge" problem, where the initial plan needs to be made differently in anticipation of a problem which hasn't happened yet. Provider's generally can't tell when an update is going to fail because of conditions on another resource, even if you asked a provider if the plan will be OK it cannot tell you because the failure won't happen until the final API call is made. Because it's not going to be possible to predict the failure, providers would have to defensively rewrite all actions which might result in this type of change to always assume it is required.

Given the amount of work that it would take to implement a new resource lifecycle in core, and create a new protocol to implement across providers and the frameworks they use, it's generally far less friction to update providers to work best with the current protocol. With few exceptions, resources should all be designed with the ability to handle create_before_destroy, and most of the time that represents a bug in the resource if it can't be used.

@natedogith1
Copy link
Author

With few exceptions, resources should all be designed with the ability to handle create_before_destroy, and most of the time that represents a bug in the resource if it can't be used.

I would expect most resources have a user-specified unique id (e.g. google_compute_instance's name) that would cause create_before_destroy to fail.

@jbardin
Copy link
Member

jbardin commented Sep 24, 2024

Many resources do require a unique name, but have an input like name_prefix instructing the provider to generate a unique name with a given prefix, which avoids a conflict when replacing the instance.

@natedogith1
Copy link
Author

From my relatively limited experience, it seems only a few resources have something like name_prefix. And switching over to a randomly generated name because some dependent might need create_before_destroy isn't ideal.

I don't know Terraform's internals, but something like create_before_destroy that instead only applies on deletion, and thus shouldn't need to propagate, might be easier to implement and would likely handle most of the use cases.

Actually, update -> destroy instead of the current default of destroy -> update seems logically sounds at first glance, so it might be possible to have that as a default. Though I don't know if there'd be issues with re-creation causing the order of update/destroy to be swapped.

@jbardin
Copy link
Member

jbardin commented Sep 24, 2024

The orderings we have for dependency operations in Terraform are not arbitrary, they are the result of requiring a deterministic, consistent ordering with no dependency cycles. If you make the destroy operation depend on the update, then you have a cycle because the create also depends on the destroy. This is why you need create_before_destroy to get the inverse ordering around updates.

I made some notes here with simplified examples to explain the basics of destroy ordering: https://github.com/hashicorp/terraform/blob/main/docs/destroying.md

@natedogith1
Copy link
Author

I don't think there would be a cycle if destroy depending on update was conditional on if it was a re-deploy. Though it would mean the ordering would be dependent on if it's a redeploy, which might not be desirable as a default.

Could there be an option that behaves like create_before_destroy except it also removes the dependency between the creation and deletion of the replaced resource? That seems like it would allow consistent edges of the dependency graph without introducing cycles and without requiring propagating the change.

@jbardin
Copy link
Member

jbardin commented Sep 25, 2024

Having an order of operations which is conditional is not desirable specifically because it is not consistent. The ordering is important for correct resource behavior, and if that was different based on combinations of changes it would not be possible to write a configuration with predictable results. You can't just rely on example situations with only a couple operations, you have to imagine these combined with hundreds of other arbitrary operations on resources, locals, variables, outputs, modules, providers, data sources, etc, all interconnected and all working in all valid permutations.

Removing the dependency between the creation and deletion isn't possible, both because we guarantee that order, and removing the order means we don't know which one will happen first. That leaves things in a nondeterministic state, where what you want may work, but only some of the time while possibly breaking something else.

@natedogith1
Copy link
Author

Terraform currently guarantees an ordering between creation and deletion, but lets the user specify what the ordering is between them. I don't see what the issue is with not knowing the order, if the resource supports create_before_destroy, I would expect any order would work, similar to how I assume resources without dependencies between them are already handled.

@jbardin
Copy link
Member

jbardin commented Sep 25, 2024

All related actions need a defined order, because they do not happen in isolation. Not being ordered might work in small examples, but causes nondeterministic behavior in complex ones. As I mentioned before, you have to incorporate this within arbitrarily large configurations, where there are other things dependent on both the destroy and the create operations, and using any generic resource type designed with whatever constraints were possible within the ordering we have guaranteed.

@natedogith1
Copy link
Author

If all actions need a defined order, how does Terraform order things that have no detected dependencies between each other? I would have assumed they have no defined order and Terraform would do them in parallel.

@jbardin
Copy link
Member

jbardin commented Sep 25, 2024

Yes, if there is no dependency, direct or transitive, then the operations are concurrent.

@natedogith1
Copy link
Author

If Terraform supports running actions concurrently, I'm not sure what the problem would be with letting the user tell Terraform that a specific resource can be created and destroyed at the same time (similar to how the user can already tell Terraform that a resource can be created before it's destroyed).

@natedogith1
Copy link
Author

Working through an a <- b <- c dependency graph where b has "create_before_destroy except unlink b's create and destroy so it doesn't need to propagate), I see how that could cause a loop (c -> b -> a -> a_destroy -> b_destroy -> c). I think lifeycle.replace_triggered_by could be used to work around this issue without propagating the changes.

I still think splitting update into optional pre-delete and post-create would be a good idea, rather than requiring a redeploy (the provider could either assume all removals are deletions and need to happen in pre-delete, or it could be informed what terraform is deleting (identified using something like the import id) so it could handle removal of non-destroyed items in post-create).
create_before_destroy isn't usable because common resources don't support it (e.g. google_compute_instance, google_compute_address, azurerm_linux_virtual_machine, azurerm_network_interface).

@natedogith1
Copy link
Author

For google_compute_backend_service, it looks like there's an in-progress issue in the provider to work around it by updating the url map from the backend service resource: hashicorp/terraform-provider-google#4543.

@jbardin
Copy link
Member

jbardin commented Sep 26, 2024

If you make the destroy operation depend on the update, then you have an impossible ordering, the cycle is kind of the side effect of the undefined order and most of the time just "breaking" that cycle is usually not the correct resolution. If you then remove the edge between the create and destroy, while that may be unordered in isolation, you have implicitly recreated the same create_before_destroy ordering when connected to the update, bringing us back to the same problem you started with. replace_triggered_by cannot change the ordering, it only forces a replace action from the plan in the same order it would otherwise happen.

Changing the order also has ramifications on all upstream resource operations too, because it's reversing an edge between destroy and create subgraphs in the same way as create_before_destroy, cascading the change throughout the graph in order to resolve the ordering, and again returning to the same problem you were trying to originally solve.

There may be other valid patterns to use for certain resources, but Terraform needs to work with any resource which follows the lifecycle we've defined. Use cases outside of the defined lifecycle may also be possible, but that requires redefining what Terraform can do and the contract it has with providers.

@natedogith1
Copy link
Author

I think the issue is that update is logically a combination of create and destroy, but Terraform doesn't include the destroy dependencies. I expect it can get away with this either because dependencies are often to produced values rather than to the underlying resource or because providers are working around the problem making changes to resources not mentioned in the plan (e.g. google_compute_disks seems to look for all VMs that use it and detaches itself from them). replace_triggered_by should serve as a workaround by forcing terraform to use separate delete and create nodes instead of a combined update node.

@jbardin
Copy link
Member

jbardin commented Sep 27, 2024

I'm not sure where you are going now, but it sounds like we are back in the realm of "redesign terraform" rather than add a specific enhancement. We have an API contract with providers to maintain the order we have defined already, changing that is a breaking change to all providers, so relatively small conceptual changes may actually require major architectural work. There will always be some use cases which can't fit into a generically defined workflow, but we've found that most resources can be made to fit this workflow which has enabled Terraform to interoperate across so many provider types.

Also take into consideration that this is a system which has been refined over the course of nearly a decade, so there is an immense body of work which has already been thoroughly considered (sometimes even mistakingly implemented and then removed). Not to say that there aren't changes that can still be made (I have #35553 and #35481 in mind here), but as time goes on, novel alterations which fit into the existing system become very few and far between.

@natedogith1
Copy link
Author

Could an optional pre-destroy update be added? That doesn't seem like it should be a breaking change unless all API changes are breaking changes.

Provider's generally can't tell when an update is going to fail because of conditions on another resource, even if you asked a provider if the plan will be OK it cannot tell you because the failure won't happen until the final API call is made. Because it's not going to be possible to predict the failure, providers would have to defensively rewrite all actions which might result in this type of change to always assume it is required.

Is it an issue if they always re-write such changes? The provider might even be able to detect if a change needs to be re-written to happen in pre-destroy if it's told what resources are being destroyed (e.g. google_compute_instance could only detatch disks in pre-destroy that terraform indicates the disks are going to be destroyed).

@jbardin
Copy link
Member

jbardin commented Sep 27, 2024

We don't have a mechanism to plan for multiple independent updates. You can see our replies to a similar question starting here: #31707 (comment)

The issue with providers being re-written to adopt such a change is getting developers to rewrite the providers to support a new protocol. As we've learned from supporting other complex use cases which can already be done with the current lifecycle, providers will be written via the path of least resistance, and complex code to support more rare edge cases via a new part of the protocol is unlikely to be adopted en masse (a new part of the protocol which is strictly hypothetical at this point and not proven to be possible with any appreciable number of resources).

This is all hinging on the argument that some resources don't support create_before_destroy. It seems that fixing the resources to support create_before_destroy (which they should anyway) would be far easier than a multi-year rollout of a major architectural change in the Terraform ecosystem.

@natedogith1
Copy link
Author

I think lots of resource couldn't support create_before_destroy. Anything where names have to be unique (can workaround by adding randomness to the name, but that's not great if the name is meant to human-readable (e.g. DNS names) or if name length limits prevent adding randomness (e.g. Active Directory doesn't support names longer than 15 characters)). Anything that expects to be linked to a single resource but can be re-linked (e.g. GCP disks can be moved between VMs but don't support being attached to multiple in write mode. I don't think this can be reasonably worked around).

@jbardin
Copy link
Member

jbardin commented Oct 1, 2024

While in our experience most resources can be made to work with create_before_destroy, yes it is possible that there may be resources unable to support that ordering. Not being able to somehow make create_before_destroy work in the provider does however mean that the resource would inherently have limited support in Terraform, because the create_before_destroy ordering can be imposed entirely outside of the control of the provider if it's required by other things in the dependency graph.

This means that to make use of something like this in practice, it would require multiple conditions:

  • The replaced resource that cannot use create_before_destroy has a dependent which uses a "registration pattern", requiring updates of the changed dependency before the delete can happen.
  • The dependent resource has already been rewritten to implement our yet-to-be-defined alternative update ordering protocol, and is opted in within the configuration.
  • Nothing else downstream requires create_before_destroy which is going to force the resource in question to be replaced in that order anyway.

So yes, I agree there may be an issue for a small subset of resources, but it's not a simple problem, and there is no way to just drop in a new order of operations. Rewriting large portions of the most intricate parts of Terraform, along with any of the resources which could be downstream from problematic resources in all the providers is a major undertaking. We can leave this here as an open request for research ideas, but there is unlikely to be anything we can do in the near term.

@jbardin jbardin added core thinking and removed new new issue not yet triaged labels Oct 1, 2024
@jbardin jbardin changed the title Add support for update -> delete -> create -> update ordering Research alternative lifecycles to support resources which can't work with create_before_destroy Oct 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants