Removing state of dependent providers (partial fix for progressive apply) #27728

dak1n1 · 2021-02-10T02:16:16Z

Current Terraform Version

Terraform v0.14.4

Use-cases

I'm one of the maintainers of the Kubernetes and Helm providers. All Kubernetes and Helm resources are built on top of a Kubernetes cluster. I would like to be able to express this dependency in a Terraform config, so that users have a more intuitive experience with our providers. Many users want to do this in a single apply, which is a currently not possible.

A Kubernetes module or resource depends on the up-to-date credentials from cloud providers such as EKS, GKE, or AKS. The goal is to have all Kubernetes resources created on the new cluster when the underlying cluster is replaced, and to avoid initializing the provider with outdated credentials during this replacement process.

Attempted Solutions

I have some example Terraform modules for AKS, EKS, and GKE. Any one of them can be used as a reproducer for this issue. They each have a README which describes how to replace the cluster.

In order to successfully replace a cluster, without hitting a progressive apply issue, the user needs to manually run terraform state rm module.kubernetes-config. This removes all resources owned by the Kubernetes and Helm providers. Starting with a clean state like this allows the terraform apply to replace the underlying cluster and create all the Kubernetes/Helm resources from scratch on the new cluster. Without this work-around, credentials from the old cluster are loaded into the Kubernetes/Helm providers (or perhaps omitted entirely), and the apply fails with authentication-related errors. This is one of the most common issues faced by our users.

Currently, the work-around mentioned above is the only way to get a single-apply scenario to work. Alternatively, users can apply the Kubernetes/Helm changes separately from the underlying cluster. But targeting the underlying cluster module during apply does not seem to be adequate (terraform apply -target=module.aks-cluster) when there are Kubernetes/Helm resources in state already. It doesn't work for cluster replacement, specifically.

We also can't work around it by adding a "recreate trigger" like the null provider has, because the problem comes into existence the moment the provider is initialized with the outdated credentials. So basically we're looking for a way to completely defer reading the Kubernetes/Helm resources until the new cluster exists. Removing the Kubernetes/Helm resources from state has been the only way to accomplish this so far.

Proposal

Have an option to remove all existing state for a provider or module.

My goal with this proposal is to make sure that the information being passed into the dependent provider is fully up-to-date before initializing the dependent provider. If that is impossible, then removing the state for the dependent provider seems to be a sufficient mechanism for accomplishing the same thing.

The new config option could look like replace_on_change. This would allow us to mark the Kubernetes resources as having been destroyed by the change to the EKS/GKE/AKS cluster. (This is literally what happens on the cluster... the delete/recreate of the cluster destroys all dependent resources on that cluster). So in effect, the Kubernetes provider wouldn't initialize until after the new cluster exists. It would remove all of the dependent resources from state, so that they can be re-created on the new cluster.

Some amount of provider dependency exists already. That's how we can create the Kubernetes cluster and then pass the authentication credentials from the cloud provider (aws, google, azure, etc) into the Kubernetes provider during the initial create.

This also works for destroys, but only if the information being passed into the dependent provider hasn't changed (host, certs, token, etc). Otherwise, we actually see failures with that too (the GKE or EKS token expires, and then terraform destroy fails). So having a way to express this dependency would also benefit us during destroys.

References

This seems like an issue that impacts a lot of users. Here's what I found by just browsing for a couple hours. I'm sure there's many more:

The text was updated successfully, but these errors were encountered:

apparentlymart · 2021-02-11T01:49:22Z

Hi @dak1n1! Thanks for sharing this use-case and proposal.

Based on your write-up so far I'm afraid I'm having trouble following how this proposal differs in the implementation requirements from the existing partial apply proposal, but I suspect that my imagination is being limited by my understanding of the original proposal I wrote. If you'd be willing, it would be helpful to see a fuller example of what the user workflow would look like in the situation you're considering were we to implement your proposal, including what new configuration the user might write (if any) and what sequence of Terraform commands the user would run in order to achieve the goal you've described of replacing an existing Kubernetes cluster.

Thanks again!

dak1n1 · 2021-02-15T18:38:28Z

Thanks for looking at my proposal! I'll show you what I have in mind. My proposal differs from progressive apply in that it only solves a part of the issue, specifically for cases where you have stacked resources, like an EKS cluster with Kubernetes resources stacked on top. Since Kubernetes users are hitting this issue so often, I wanted to support their use case, since it seems like a smaller subset of the larger problem, which may have a simpler solution than the one proposed in the Progressive Apply issue. My goal is to achieve this in the simplest manner possible, without making changes to the user's workflow, or any big architectural changes to Terraform.

I didn't think to impose my own implementation idea here... But I'll give it a try. Though, disclaimer: I don't know the Terraform Core architecture at all. I'm only familiar with provider development.

Here is the scenario:

A user has two modules: eks-cluster and kubernetes-config. One contains the cluster, the other contains Kubernetes resources that live on that cluster. The Kubernetes resources cannot exist without the underlying cluster.

The user makes an update to their eks-cluster module, which will cause the cluster to be recreated, which in turn causes the host, token, and ca_cert attributes of that cluster to change. These are fields the Kubernetes provider relies on for authentication. And so the Kubernetes provider will be unable to initialize successfully and the plan/apply will fail.

But if we were to establish a kind of dependency between the Kubernetes provider and the underlying cluster, such as in the example below, we could tell Terraform to delete the state of this provider any time the dependent resource is re-created.

provider "kubernetes" {
  stacked_resource       = data.aws_eks_cluster.default # depends_on would be a nice option too
  host                   = data.aws_eks_cluster.default.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.default.certificate_authority[0].data)
  token                  = data.aws_eks_cluster_auth.default.token
}

This new field would trigger the deletion of the state for this provider when the resource it depends is marked for deletion. By deleting the state of the Kubernetes resources, the provider will never initialize with outdated credentials. This will solve the issue for anyone who is replacing a Kubernetes cluster. The apply will succeed the same as if it were the initial apply.

Right now, the initial apply works reliably. But if you try to replace the underlying cluster, it will always fail, because during the plan phase, Terraform tries to read the Kubernetes resources using outdated credentials. My proposal is to create the conditions of an "initial apply" during resource deletion. This will also solve the problem of deleting a cluster that has outdated credentials. By deleting the state of the dependent resources, we can cover an edge case that is not solved by #27408. (Specifically, the case of long-running applies where the token expires before the Kubernetes resources can be deleted).

The idea of stacking resources is also used here by AWS CloudFormations.

A stack, for instance, can include all the resources required to run a web application, such as a web server, a database, and networking rules. If you no longer require that web application, you can simply delete the stack, and all of its related resources are deleted.

This proposal is a similar idea to this CloudFormation Stack (though not identical). It's just giving users the option to tell Terraform "don't worry about deleting these resources. They will disappear when the underlying cluster is deleted". Similar to deleting an RDS instance that hosts a database, the database itself will automatically disappear once the underlying RDS VM is removed. There's no need to call delete on all those dependent resources explicitly. Removing them from state is adequate.

I know this isn't a fix for the whole issue faced with Progressive Apply, but I figured it might bring some faster relief to users who are struggling. Thanks for reading!

dak1n1 · 2021-05-12T20:59:59Z

I actually would like to close this issue, after doing some further reflection about this approach. While it seemed useful to break down this large, complex problem into a tiny chunk that could be solved, I'm not happy with the approach I'm proposing here. Specifically, it's because I ran into a scenario where simply deleting the Kubernetes provider's state prior to deleting the EKS cluster wasn't an adequate solution. One of the Kubernetes resources had created other cloud infrastructure, which was left orphaned with this approach. (Potentially, both Load Balancers and cloud storage volumes could be orphaned when the associated Kubernetes resources are not deleted properly). So I have a different idea that I think could be more effective. But it will take quite some time to prioritize collecting the information needed.

TL;DR: I'll be back with better data at a later time. Thanks!

github-actions · 2021-06-12T02:07:37Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dak1n1 added enhancement new new issue not yet triaged labels Feb 10, 2021

jbardin mentioned this issue Feb 11, 2021

Destroy fails with Error: Unauthorized when removing kubernetes resources and access token is used. #27741

Closed

dak1n1 mentioned this issue Mar 12, 2021

Replace EKS test-infra with example hashicorp/terraform-provider-kubernetes#1192

Merged

2 tasks

dak1n1 closed this as completed May 12, 2021

github-actions bot locked as resolved and limited conversation to collaborators Jun 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing state of dependent providers (partial fix for progressive apply) #27728

Removing state of dependent providers (partial fix for progressive apply) #27728

dak1n1 commented Feb 10, 2021 •

edited

Loading

apparentlymart commented Feb 11, 2021

dak1n1 commented Feb 15, 2021

dak1n1 commented May 12, 2021

github-actions bot commented Jun 12, 2021

Removing state of dependent providers (partial fix for progressive apply) #27728

Removing state of dependent providers (partial fix for progressive apply) #27728

Comments

dak1n1 commented Feb 10, 2021 • edited Loading

Current Terraform Version

Use-cases

Attempted Solutions

Proposal

References

apparentlymart commented Feb 11, 2021

dak1n1 commented Feb 15, 2021

dak1n1 commented May 12, 2021

github-actions bot commented Jun 12, 2021

dak1n1 commented Feb 10, 2021 •

edited

Loading