Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destroy fails with Error: Unauthorized when removing kubernetes resources and access token is used. #27741

Closed
jaceq opened this issue Feb 11, 2021 · 7 comments
Labels
bug new new issue not yet triaged

Comments

@jaceq
Copy link

jaceq commented Feb 11, 2021

Terraform Version

0.14.5

But this seem to affect other versions: 0.14.X

Terraform Configuration Files

data "google_client_config" "XXX" {}

provider "kubernetes" {
  host                   = "https://${google_container_cluster.XXX.endpoint}"
  token                  = data.google_client_config.XXX.access_token
  cluster_ca_certificate = base64decode(google_container_cluster.XXX.master_auth.0.cluster_ca_certificate)
}

Crash Output

Error: Unauthorized

Expected Behavior

A destroy operation should succeed.

Actual Behavior

During destroy operation error:

Error: Unauthorized

shows up, after that resources from kubernetes provider remain in state (and aren't deleted).
Things get more complicates when same state has GKE / EKS cluster, as we do and we end up in situation where cluster itself gets deleted but kubernetes resources remain in state. Given that at that stage cluster isn't there anymore, kubernetes provider failes and this renders state unusable.

Steps to Reproduce

Configure a state with a GKE / EKS and a couple of kubernetes resources and build it.
Use 'access token' to configure kubernetes provider.
Wait for at least one hour!
Try to destroy.

Additional Context

Within a discussion here: terraform-aws-modules/terraform-aws-eks#1162 someone figured this out.
Long story short it seems that as of terraform 0.14 there is NO refresh operation done before destroy.
This leads to situation where data source:

data "google_client_config" "XXX" {}

isn't refresh, and hence no new token for kubernetes provider is fetched. (Given that token seems to be valid for 1 hours, a apply and destroy operation will succeed if they are done within that timeframe).

A workaround is to run refresh manually before destroy operation.

References

terraform-aws-modules/terraform-aws-eks#1162

@jbardin
Copy link
Member

jbardin commented Feb 11, 2021

Hi @jaceq

Since the error mentioned here is coming from the provider, and terraform can only safely remove resources from the state when the provider reports them as being removed, there's not much that can be done from within terraform itself. I believe the ability remove all state related to the particular provider would be covered by the request in #27728.

Attempting to create and destroy resources when a provider itself depends on those resources is not recommended and can be quite difficult to achieve with the design of terraform. There are probably more users in the community forum familiar with these multi-layered setups for EKS which may be a better source if information. We use GitHub issues for tracking bugs and enhancements, rather than for questions.

With that out of the way, It's not clear exactly what is failing in this case, other than the provider is returning an error. If this is a case of the provider configuration being stale and needing a new token, is does running terraform refresh immediately before destroy prevent the failure? If this is failing lster during the destroy operation, we would need the log output to see where the issue might be.

As mention above, destroying the infrastructure that the provider itself depends on is often likely to fail, and in these cases the usual recommendation is to have multiple configurations, one to setup the base infrastructure, and one to deploy the additional layer upon that infrastructure.

@jbardin jbardin added the waiting-response An issue/pull request is waiting for a response from the community label Feb 11, 2021
@jaceq
Copy link
Author

jaceq commented Feb 11, 2021

Hi @jbardin

Thanks for quick response.
This issue is not coming from the provider (and also didn't exist in terraform pre 0.14 versions), this comes from the fact that apparently there is no 'refresh' done prior to executing 'destroy'.
What happens basically is that when a data source used to generate an access token which then in turn is used to delete those resources isn't re-generated (so it's invalid) and 'destroy' operations on affected resources fail due to unauthorized requestes being made. It's an issue in logical flow of actions in terraform binary itself from what I understand.
btw. I use GKE (but people with EKS have same issues), if you have access to GKE I can compose a very simple state where this fails.
All in all, from what I see, data sources are not re-read prior to destroy (please, correct me if I'm wrong on that) - and this is what breaks this.

@ghost ghost removed waiting-response An issue/pull request is waiting for a response from the community labels Feb 11, 2021
@jaceq
Copy link
Author

jaceq commented Feb 11, 2021

Just to be more clear (as I get why you got an idea this is provider based), currently it seems that if a data source based token is used for authorization (for any provider), any deletions of resources belonging to such provider will fail (if done after TTL of a previous token).

@dak1n1
Copy link

dak1n1 commented Feb 11, 2021

This commit is scheduled to go into Terraform 0.15 according to the changelog. I believe this will resolve the issue with authentication during destroys (in most cases), since it will refresh the data source containing the Kubernetes credentials prior to the destroy.

However, we will still hit this case during long-running applies/destroys, since I believe the data source is only refreshed once during an apply/destroy. An example of this failing is when you're using EKS with a long-running apply or destroy. An EKS token is only valid for 15 minutes, so if the apply or destroy runs for longer than that, we'll still hit this issue until progressive apply is solved.

For these reasons, it is easiest to keep the Kubernetes resources in a separate state from the underlying cluster, and use two applies. However, if you really need a single-apply configuration, we have some examples in the Kubernetes provider repo that demonstrate working configurations for AKS, EKS, and GKE.

If you have the option of using an exec block like this, it can ensure your token is always up-to-date, but this only works if you're able to install the binary on the system running Terraform:

provider "kubernetes" {
  host                   = var.cluster_endpoint
  cluster_ca_certificate = base64decode(var.cluster_ca_cert)
  exec {
    api_version = "client.authentication.k8s.io/v1alpha1"
    args        = ["eks", "get-token", "--cluster-name", var.cluster_name]
    command     = "aws"
  }
}

I still need to add the equivalent of this to the GKE and AKS examples though. There is more work to be done there.

@jaceq
Copy link
Author

jaceq commented Feb 11, 2021

Wow @dak1n1, indeed it seems commit you mentioned will solve this issue.
For now given we use a wrapper around terraform binary, I am forcing an explicit refresh before destroy.
I will test that properly tomorrow (given it's evening in my timezone now) as I will be deleting a couple of states.
About short validity of token that could possibly also be an issue overall, however (and I am not sure how long is a GKE token valid) in my setup this doesn't affect me.

@jbardin
Copy link
Member

jbardin commented Feb 11, 2021

Thanks for the additional info @dak1n1! That is the PR I was about to mention.

I'm going to close this as the initial issue reported is a duplicate of #27172. @dak1n1 did a great job of summing up the other considerations, and we have open proposals already for improving the workflow in general. Since any major change to the workflow is unlikely in the near term due to the large architectural changes required, we still suggest using separate configurations as shown in the linked documentation.

Thanks!

@jbardin jbardin closed this as completed Feb 11, 2021
@ghost
Copy link

ghost commented Mar 14, 2021

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug new new issue not yet triaged
Projects
None yet
Development

No branches or pull requests

3 participants