Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2.0.1 Authentication failures with token retrieved via aws_eks_cluster_auth #1131

Closed
tomaspinho opened this issue Jan 26, 2021 · 37 comments
Closed

Comments

@tomaspinho
Copy link
Contributor

Terraform Version, Provider Version and Kubernetes Version

Terraform version: 0.12.24
Kubernetes provider version: 2.0.1
Kubernetes version: v1.16.15-eks-ad4801

Affected Resource(s)

Terraform Configuration Files

data "aws_eks_cluster" "c" {
  name = var.k8s_name
}

data "aws_eks_cluster_auth" "c" {
  name = var.k8s_name
}

provider "kubernetes" {
  host = data.aws_eks_cluster.c.endpoint

  cluster_ca_certificate = base64decode(data.aws_eks_cluster.c.certificate_authority.0.data)

  token = data.aws_eks_cluster_auth.c.token
}

Debug Output

Panic Output

Steps to Reproduce

Expected Behavior

What should have happened?
Resources should have been created/modified/deleted.1

Actual Behavior

What actually happened?

Error: the server has asked for the client to provide credentials
Error: Failed to update daemonset: Unauthorized
Error: Failed to update deployment: Unauthorized
Error: Failed to update deployment: Unauthorized
Error: Failed to update service account: Unauthorized
Error: Failed to update service account: Unauthorized
Error: Failed to delete Job! API error: Unauthorized
Error: Failed to update service account: Unauthorized
Error: the server has asked for the client to provide credentials
Error: the server has asked for the client to provide credentials
Error: Failed to update deployment: Unauthorized
Error: Failed to update service account: Unauthorized
Error: the server has asked for the client to provide credentials
Error: Failed to delete Job! API error: Unauthorized
Error: Failed to update daemonset: Unauthorized

Important Factoids

No, we're just using EKS.

References

Community Note

  • Please vote on this issue by adding a +1 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@tomaspinho tomaspinho added the bug label Jan 26, 2021
@angelabad
Copy link

Hi, same problem here with Terraform v0.14.5, but different error message:

Error: Kubernetes cluster unreachable: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable

And the configuration is the same as with previous version provider.

provider "kubernetes" {
  host                   = data.aws_eks_cluster.eks.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.eks.certificate_authority[0].data)
  token                  = data.aws_eks_cluster_auth.eks.token
}

@tomaspinho tomaspinho changed the title v2.0.1 Authentication failures with token retried via aws_eks_cluster_auth v2.0.1 Authentication failures with token retrieved via aws_eks_cluster_auth Jan 27, 2021
@dak1n1
Copy link
Contributor

dak1n1 commented Feb 9, 2021

Can you try running terraform refresh to see if that pulls in a new token? The token generated by aws_eks_cluster_auth is only valid for 15 minutes. For this reason, we recommend using an exec plugin to keep the token up to date automatically. Here's an example of that configuration:

provider "kubernetes" {
  host                   = data.aws_eks_cluster.eks.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.eks.certificate_authority[0].data)
  exec {
    api_version = "client.authentication.k8s.io/v1alpha1"
    args        = ["eks", "get-token", "--cluster-name", var.cluster_name]
    command     = "aws"
  }
}

Alternatively, running the Kubernetes provider in separate terraform apply from the EKS cluster creation should work every time. (I'm not sure offhand if your EKS cluster is being created in the same apply, but just guessing since it's a common configuration).

There's also a working EKS example you can compare with your configs. There are some improvements coming soon for the example, since we're working on related authentication issues.

@nikitazernov
Copy link

@dak1n1 I am considering this as a temporary workaround.

@tomaspinho
Copy link
Contributor Author

Can you try running terraform refresh to see if that pulls in a new token? The token generated by aws_eks_cluster_auth is only valid for 15 minutes. For this reason, we recommend using an exec plugin to keep the token up to date automatically. Here's an example of that configuration:

provider "kubernetes" {
  host                   = data.aws_eks_cluster.eks.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.eks.certificate_authority[0].data)
  exec {
    api_version = "client.authentication.k8s.io/v1alpha1"
    args        = ["eks", "get-token", "--cluster-name", var.cluster_name]
    command     = "aws"
  }
}

Alternatively, running the Kubernetes provider in separate terraform apply from the EKS cluster creation should work every time. (I'm not sure offhand if your EKS cluster is being created in the same apply, but just guessing since it's a common configuration).

There's also a working EKS example you can compare with your configs. There are some improvements coming soon for the example, since we're working on related authentication issues.

Not sure about the 15mins issue, as we've been using this provider for almost a year now and the token validity has never been a problem. In fact, downgrading the provider to <2.0 works as expected.

I'll try force refreshing the token and report back the results.

@dak1n1
Copy link
Contributor

dak1n1 commented Feb 10, 2021

Not sure about the 15mins issue, as we've been using this provider for almost a year now and the token validity has never been a problem. In fact, downgrading the provider to <2.0 works as expected.

I'll try force refreshing the token and report back the results.

Thanks! And about the downgrade fixing this -- that makes sense. Depending on your provider configuration, prior to 2.0, the Kubernetes provider may have actually been reading the KUBECONFIG environment variable (despite your valid configuration which includes a token and does not reference the kubeconfig file). This was a source of confusion that we were aiming to alleviate. The authentication workflow still needs some work though.

@tomaspinho
Copy link
Contributor Author

Not sure about the 15mins issue, as we've been using this provider for almost a year now and the token validity has never been a problem. In fact, downgrading the provider to <2.0 works as expected.

I'll try force refreshing the token and report back the results.

Thanks! And about the downgrade fixing this -- that makes sense. Depending on your provider configuration, prior to 2.0, the Kubernetes provider may have actually been reading the KUBECONFIG environment variable (despite your valid configuration which includes a token and does not reference the kubeconfig file). This was a source of confusion that we were aiming to alleviate. The authentication workflow still needs some work though.

The KUBECONFIG issue is not present in our environment as we run Terraform in GitLab CI and never use that file to authenticate to clusters from it.

@loungerider
Copy link

loungerider commented Feb 10, 2021

Terraform version: 0.14.5
Kubernetes provider version: 2.0.2
Kubernetes version: v1.18.9

I tried an apply with a clean state using the exec instead of the token in the kubernetes provider on the initial run when the eks cluster is created. I get the same Error: Unauthorized results for both when trying to apply my kubernetes resources.

Using the exec

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 3.26.0"
    }
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "~> 2.0.2"
    }
  }
}

provider "aws" {
  region = var.region
}

data "aws_eks_cluster_auth" "cluster_token" {
  name = module.eks.name
}
provider "kubernetes" {
  host                   = module.eks.endpoint
  cluster_ca_certificate = base64decode(module.eks.certificate)
  exec {
    api_version = "client.authentication.k8s.io/v1alpha1"
    args        = ["eks", "get-token", "--cluster-name", module.eks.name]
    command     = "aws"
  }
}

The kubernetes resources are created correctly on a retry of the pipeline as stated in the comments above; using the token or exec method.

@dak1n1
Copy link
Contributor

dak1n1 commented Feb 17, 2021

Terraform version: 0.14.5
Kubernetes provider version: 2.0.2
Kubernetes version: v1.18.9

I tried an apply with a clean state using the exec instead of the token in the kubernetes provider on the initial run when the eks cluster is created. I get the same Error: Unauthorized results for both when trying to apply my kubernetes resources.

Using the exec

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 3.26.0"
    }
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "~> 2.0.2"
    }
  }
}

provider "aws" {
  region = var.region
}

data "aws_eks_cluster_auth" "cluster_token" {
  name = module.eks.name
}
provider "kubernetes" {
  host                   = module.eks.endpoint
  cluster_ca_certificate = base64decode(module.eks.certificate)
  exec {
    api_version = "client.authentication.k8s.io/v1alpha1"
    args        = ["eks", "get-token", "--cluster-name", module.eks.name]
    command     = "aws"
  }
}

The kubernetes resources are created correctly on a retry of the pipeline as stated in the comments above; using the token or exec method.

@loungerider Thanks for testing this. I believe the issue in your case has to do with certain parameters passed into the Kubernetes provider which are unknown at the time of the provider initialization. I'm guessing module.eks.endpoint is unknown at plan time, but also the data source is probably being read too soon.

In the data source, the value of name = module.eks.name is likely known before the cluster is ready. So the data source will read the cluster too early, and pass invalid credentials into the Kubernetes provider. I'll show you an example that will make the data source wait until the cluster is ready:

data "aws_eks_cluster" "default" {
  name = module.eks.cluster_id
}

# This data source is only needed if you're passing the token into the provider using `token =`.
data "aws_eks_cluster_auth" "default" {
  name = module.eks.cluster_id
}

provider "kubernetes" {
  # This defers provider initialization until the cluster is ready
  host                   = data.aws_eks_cluster.default.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.default.certificate_authority[0].data)

  # This keeps the token up-to-date during subsequent applies, even if they run longer than the token TTL.
  exec {
    api_version = "client.authentication.k8s.io/v1alpha1"
    args        = ["eks", "get-token", "--cluster-name", module.eks.name]
    command     = "aws"
  }
}

I'm assuming you're using the EKS module here, which has an output that waits for the cluster API to be ready (cluster_id). That's why the data source needs to know about cluster_id. Another option would be to add a depends_on explicitly to wait for this field (depends_on = [module.eks.cluster_id])

I also added a data source to read the cluster's hostname and CA cert data, so it will be able to read the new hostname and certs, if those ever change, such as on the first apply, or during cluster replacement.

Although a single apply scenario like this is less reliable than running apply twice, it is possible to do, it just has these gotchas to be aware of.

@jw-maynard
Copy link

jw-maynard commented Feb 21, 2021

@dak1n1 I'm getting the same errors with the following:

Terraform version: 0.14.6
Kubernetes provider version: 2.0.2
EKS version: v1.18.9 -> v1.19.6

As you can see the the only change I'm attempting is to upgrade EKS from 1.18 to 1.19. With out posting all the code the relevant portions:

resource "null_resource" "wait_for_cluster" {
  depends_on = [aws_eks_cluster.cluster]

  provisioner "local-exec" {
    command     = "for i in `seq 1 60`; do if `command -v wget > /dev/null`; then wget --no-check-certificate -O - -q $ENDPOINT/healthz >/dev/null && exit 0 || true; else curl -k -s $ENDPOINT/healthz >/dev/null && exit 0 || true;fi; sleep 5; done; echo TIMEOUT && exit 1"
    interpreter = ["/bin/sh", "-c"]
    environment = {
      ENDPOINT = aws_eks_cluster.cluster.endpoint
    }
  }
}

data "aws_eks_cluster" "eks_cluster" {
  name       = aws_eks_cluster.cluster.name
  depends_on = [null_resource.wait_for_cluster]
}

data "aws_eks_cluster_auth" "eks_cluster" {
  name       = aws_eks_cluster.cluster.name
  depends_on = [null_resource.wait_for_cluster]
}

provider "kubernetes" {
  host                   = data.aws_eks_cluster.eks_cluster.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.eks_cluster.certificate_authority.0.data)
  token                  = data.aws_eks_cluster_auth.eks_cluster.token
}

provider "helm" {
  kubernetes {
    host                   = data.aws_eks_cluster.eks_cluster.endpoint
    token                  = data.aws_eks_cluster_auth.eks_cluster.token
    cluster_ca_certificate = base64decode(data.aws_eks_cluster.eks_cluster.certificate_authority.0.data)
  }
}

My module follows the same conventions as the module you mentioned above except that I'm using the token instead of the exec method. We use Terraform Cloud for our workflow and I don't believe the AWS CLI is installed on those workers. The docs also warn against trying to install extra software on workers and even if you decide to ignore that advise doing so is kinda hacky to say the least. So IMO using the aws cli to generate creds should not be a solution to this issue.

I've tried running this multiple times and always get errors like these:

Error: Kubernetes cluster unreachable: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable
Error: Get "http://localhost/apis/rbac.authorization.k8s.io/v1/namespaces/default/rolebindings/edit": dial tcp 127.0.0.1:80: connect: connection refused

My first question would be, is the token being stored somewhere in the state? I would assume the data source would be refreshed every run in case something changed (in this case I assume the token would be new with every run) therefore the 15 minute expiration should only be an issue on initial cluster creation where the token is created before the cluster. In the case above I would assume that should never happen due to the dependency chain of aws_eks_cluster -> null_resource -> aws_eks_cluster_auth.

If the token is refreshed every time then why am I seeing this error when specifying an upgrade to an already provisioned cluster. The upgrade is not changing the cluster name, it should change in place. The existing cluster should be there, so the token should be created and the provider should be able to read the cluster state and make an appropriate plan. I also find it very curious that I don't see any errors like this related to resources provisioned by the helm provider. I don't know if maybe that's because the errors in the kubernetes provider are ending the plan before it gets to helm or if there is something different in how Helm is doing things that dodges this issue.

I may try downgrading my provider to < 2.0 to see if this works there. If that's the case it's not a hidden KUBECONFIG file issue as you mentioned above because we run this on TFC and don't generate a KUBECONFIG file in our TF code for clusters. If I do try this I will try to remember to post results here.

@ghost ghost removed waiting-response labels Feb 21, 2021
@jw-maynard
Copy link

Did some further digging and we may be barking in the wrong place: hashicorp/terraform-provider-aws#10269 (comment)

@dak1n1
Copy link
Contributor

dak1n1 commented Feb 23, 2021

@jw-maynard I'm glad you found that other issue! It sounds like the EKS cluster could be getting replaced rather than updated in-place. Could you do a terraform plan to confirm this? (There should be a line that tells you if a change "forces replacement").

What I saw in your configuration is what we call a "single apply" scenario (that is, a configuration which contains both the EKS cluster (aws_eks_cluster.cluster) and the Kubernetes resources that will live on that cluster. In a single apply scenario, any replacement of an underlying Kubernetes cluster will cause the Kubernetes provider to fail to initialize, unless you do a specific workaround that I'll mention below.

This is a known limitation in Terraform core, which I recently saw described well in this comment. It's a problem any time you have a provider that depends on a resource (in this case, the Kubernetes provider is dependent on information from aws_eks_cluster.cluster, which is read from the data source... but that information is not available when the provider is initialized, because, presumably, the cluster is getting replaced).

If an underlying Kubernetes cluster is going to be replaced, and you already have Kubernetes resources provisioned using the Kubernetes provider, you'll have to work around this issue by doing a terraform state rm on the module containing all the Kubernetes resources (there's an example here). That way the Kubernetes resources will be recreated on the new cluster, and the terraform plan will succeed. Otherwise, the provider tries to initialize using an empty credentials block, since it does not yet know the credentials associated with the cluster being replaced.

This workaround is only needed in single-apply scenarios where you have the cluster and the Kubernetes resources sharing a single state. In general, it's more reliable to keep the Kubernetes resources in a separate state from the EKS cluster resource (for example, a different workspace in TFC, or a different root module). Two applies will work every time, but a single apply involves some work-arounds, depending on the scenario.

@jw-maynard
Copy link

@dak1n1 It never gets that far because the plan errors but I know that version upgrades in EKS are an update in place scenario for sure. I guess they could have introduced a bug in the aws provider but I don't think so.

I did a lot of digging around in logs at the TRACE level for this plan and found some differences in how a successful plan handles the two data sources compared to how it handles them in a plan where I try to upgrade the version. Unfortunately I'm not familiar enough with the inner workings of TF and it's providers to know if this is fixable in the provider or not. I'm happy to share my findings privately with anyone at HashiCorp who's willing to listen. Single apply scenarios seem to be something that a fair number of people would like to be able to do when working with Kubernetes on cloud providers.

I can share what I think it the difference in the two runs. The failed one ends up in here for both EKS data sources (I'm just sharing aws_eks_cluster_auth but aws_eks_cluster has a the same log line:

2021/02/21 20:39:29 [TRACE] evalReadDataPlan: module.kubernetes_cluster.data.aws_eks_cluster_auth.eks_cluster configuration is fully known, but we're forcing a read plan to be created

This appears to becoming from here https://github.com/hashicorp/terraform/blob/618a3edcd13f5231a77a699b7ba2a3fba352b7a3/terraform/eval_read_data_plan.go#L65 which tells me that n.forcePlanRead(ctx) is True. Since the successful runs hit a log that comes from L107 (linked below) it seems to point to the failures running into something inside the if block from L63 to L103 and falling apart there.

A working run where the version is not updated I don't see the above at all but I see this:

2021/02/21 20:37:10 [TRACE] EvalReadData: module.kubernetes_cluster.data.aws_eks_cluster_auth.eks_cluster configuration is complete, so reading from provider
2021/02/21 20:37:10 [TRACE] GRPCProvider: ReadDataSource
2021-02-21T20:37:10.945Z [INFO]  plugin.terraform-provider-aws_v3.29.0_x5: 2021/02/21 20:37:10 [DEBUG] Reading EKS Cluster: {
  Name: "kubernetes01"
}: timestamp=2021-02-21T20:37:10.943Z
2021/02/21 20:37:10 [WARN] Provider "registry.terraform.io/hashicorp/aws" produced an unexpected new value for module.kubernetes_cluster.data.aws_eks_cluster_auth.eks_cluster.
      - .token: inconsistent values for sensitive attribute

Then a call to eks/DescribeCluster. This EvalReadData appears to be logged inside the readDataSource here https://github.com/hashicorp/terraform/blob/618a3edcd13f5231a77a699b7ba2a3fba352b7a3/terraform/eval_read_data_plan.go#L107

So in the failed state it seems like the data source is not even updating for some reason. Odd considering the cluster would be updated in place. The fact that there's no read of the data source in the failure when something is changing just makes me feel like there's a logical bug somewhere maybe in core, but I don't feel knowledgeable enough to articulate it in an issue over there.

All that being said I am aware of the pitfalls with single apply scenarios and this certainly maybe one of those issues. The unfortunate part is that like they do with the EKS module you posted above, there are some things in EKS that require managing resource inside the cluster (aws-auth being a notable one) and it seems clunky to have to use two modules to fully provision one resource (EKS) to our specs.

@loungerider
Copy link

@dak1n1 This config worked for me. Thanks!

data "aws_eks_cluster" "default" {
  name       = module.eks.name
  depends_on = [module.eks.name]
}

data "aws_eks_cluster_auth" "default" {
  name = module.eks.name
}

provider "kubernetes" {
  host                   = data.aws_eks_cluster.default.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.default.certificate_authority[0].data)
  exec {
    api_version = "client.authentication.k8s.io/v1alpha1"
    args        = ["eks", "get-token", "--cluster-name", module.eks.name]
    command     = "aws"
  }
}

@albertrdixon
Copy link

Using exec is not a viable solution when running in terraform cloud using remote execution. Our current thinking is to implement a workaround to essentially taint the aws_eks_cluster_auth data source so it gets refreshed for every plan. It would be ideal if the kubernetes provider had native support for getting and refreshing managed kubernetes service authentication tokens / credentials in order to support environments in which the only guaranteed tooling is terraform itself.

@vitali-s
Copy link

We faced with the same issue when running destroy (introduced in Terraform 0.14). Actually multiple providers affected helm, kubernetes, kubernetes-alpha. In 0.14 data sources are no longer refreshed on destroy, which is causing provider issues, it was implemented as part of:
hashicorp/terraform#15386

Related issue is (which is closed):
hashicorp/terraform#27172

For example any providers using datasource aws_eks_cluster_auth will fail on destroy:

data "aws_eks_cluster_auth" "cluster" {
  name = var.cluster_name
}

The proposed workaround is to run plan or refresh (which may not be the best solution for every team).

@ayalaluquez
Copy link

I had a similar problem run an apply from the CI/CD.

Error: XXXX failed to create kubernetes rest client for update of resource: Unauthorized

The apply worked locally because, I had the AWS region configured in my AWS credentials but not in the pipeline.

This configuration files worked for me

data "aws_eks_cluster" "default" {
  name = var.cluster_name
}

data "aws_eks_cluster_auth" "default" {
  name = var.cluster_name
}

provider "kubernetes" {
  host                   = data.aws_eks_cluster.default.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.default.certificate_authority[0].data)
  exec {
    api_version = "client.authentication.k8s.io/v1alpha1"
    args        = ["eks", "get-token", "--cluster-name", var.cluster_name, "--region", var.aws_region]
    command     = "aws"
  }
}

@krzysztof-magosa
Copy link

Can you try running terraform refresh to see if that pulls in a new token? The token generated by aws_eks_cluster_auth is only valid for 15 minutes. For this reason, we recommend using an exec plugin to keep the token up to date automatically. Here's an example of that configuration:

provider "kubernetes" {
  host                   = data.aws_eks_cluster.eks.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.eks.certificate_authority[0].data)
  exec {
    api_version = "client.authentication.k8s.io/v1alpha1"
    args        = ["eks", "get-token", "--cluster-name", var.cluster_name]
    command     = "aws"
  }
}

Alternatively, running the Kubernetes provider in separate terraform apply from the EKS cluster creation should work every time. (I'm not sure offhand if your EKS cluster is being created in the same apply, but just guessing since it's a common configuration).

There's also a working EKS example you can compare with your configs. There are some improvements coming soon for the example, since we're working on related authentication issues.

Correct me if i'm wrong, but that assumes you have eks configured, which isn't the case when you run terraform in dynamic environment like Atlantis or Terraform Cloud...

vijay-veeranki added a commit to ministryofjustice/cloud-platform-infrastructure that referenced this issue Sep 28, 2021
This is to fix the authentication error, caused by this issue:
hashicorp/terraform-provider-kubernetes#1131
vijay-veeranki added a commit to ministryofjustice/cloud-platform-infrastructure that referenced this issue Sep 28, 2021
@jbg
Copy link

jbg commented Feb 10, 2022

We run into this issue with virtually every apply now that we use Atlantis:

  1. PR is opened with some changes to the TF config
  2. Atlantis runs terraform plan and comments the output on the PR
  3. Someone looks at the plan and approves it
  4. PR opener comments atlantis apply (which causes Atlantis to run terraform apply)
  5. Apply fails with Unauthorized if there are any kubernetes_* resources

This happens whenever the time between step 2 and step 4 is more than 15 minutes.

The workaround of callingaws eks get-token from the provider configuration would only work if we add the AWS CLI to the Atlantis container image. We can do that but it seems like a bit of a hack.

Is it a limitation of Terraform that this provider cannot refresh the token during apply? Is there a related Terraform issue?

@alexsomesan
Copy link
Member

@jbg without logs and samples of your configuration, there isn't a lot to go on in your report. Also, no Terraform, providers and cluster versions involved. Please help us help you.

@jbg
Copy link

jbg commented Feb 10, 2022

Sorry @alexsomesan I thought that the issue was well understood from earlier discussion in this issue. Is it not the case that a) the token expires after 15 minutes, and b) this provider does not request a new token during the apply stage?

(TF 1.1.5, terraform-provider-kubernetes 2.8.0, k8s 1.21, but the issue has existed for more than a year while always using the latest version of TF and the provider, and through k8s 1.19->1.20->1.21. It just affected us less before we moved to using Atlantis because we usually applied very soon after planning.)

@alexsomesan
Copy link
Member

The provider will only request a new token if you configure it to use the cloud provider's auth plugin, by using the exec block in provider config (explained here).

If you use the data-source from the AWS provider, that will only refresh once per operation (apply or plan). If your apply takes longer than then token expiration period, by using the data source you run the risk of using an expired token at some point in your apply run.

@jbg
Copy link

jbg commented Feb 10, 2022

This is not an issue of the apply taking more than 15 minutes. The problem occurs if the gap between plan and apply is more than 15 minutes. Applying a single resource (which takes mere seconds) still demonstrates the issue. It appears that the token is not refreshed (data source is not re-read) at apply time.

The exec hack, given that it requires an executable external to Terraform that can provide a token, isn't suitable for all environments. In particular I doubt it can work in Terraform Cloud, but even in our self-hosted setup it's an unpleasant approach.

@alexsomesan
Copy link
Member

I would not call the exec solution a hack. It's the default and preferred mechanism for credentials access on both EKS and GKE and it's what the official tooling from both cloud providers uses by default.

Have a look at the contents of a kubeconfig file produces by the AWS CLI:

➤ aws eks update-kubeconfig --name k8s-dev                                                                                                                                                                                                         12:48:03
Added new context arn:aws:eks:eu-central-1:XXXXXXXXXXXX:cluster/k8s-dev to /Users/alex/.kube/config

➤ kubectl config view                                                                                                                                                                                                                                  12:48:35
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: DATA+OMITTED
    server: https://XXXXXXXXXXXXXXXXXXXXXXXX.gr7.eu-central-1.eks.amazonaws.com
  name: arn:aws:eks:eu-central-1:XXXXXXXXXXXX:cluster/k8s-dev
contexts:
- context:
    cluster: arn:aws:eks:eu-central-1:XXXXXXXXXXXX:cluster/k8s-dev
    user: arn:aws:eks:eu-central-1:XXXXXXXXXXXX:cluster/k8s-dev
  name: arn:aws:eks:eu-central-1:XXXXXXXXXXXX:cluster/k8s-dev
current-context: arn:aws:eks:eu-central-1:XXXXXXXXXXXX:cluster/k8s-dev
kind: Config
preferences: {}
users:
- name: arn:aws:eks:eu-central-1:XXXXXXXXXXXX:cluster/k8s-dev
  user:
    exec:
      apiVersion: client.authentication.k8s.io/v1alpha1
      args:
      - --region
      - eu-central-1
      - eks
      - get-token
      - --cluster-name
      - k8s-dev
      command: aws
      env: null
      interactiveMode: IfAvailable
      provideClusterInfo: false

The same happens on GKE, and for good reason.

Most IAM systems advise to use short lived credentials obtained via some sort of dynamic role impersonation. EKS doesn't allow setting the lifespan of the token for the same reason. They want users to adopt role impersonation, which is the least risky way to handle credentials. This really isn't a hack.

Back on the topic of Terraform, there is a solid reason why the datasource is not refreshed before apply in your scenario. SInce Atlantis is supplying a pre-generated plan to the terraform apply command, the contract implies that those should be the only changes enacted by terraform during the apply. If it were to refresh datasources, that would potentially propagate new values through the plan potentially incurring changes to resources after the plan had been reviewed and approved, thus negating the value of that process.

In conclusion, there really isn't any better way of handling these short-lived credentials other than auth plugins.

@jbg
Copy link

jbg commented Feb 10, 2022

SInce Atlantis is supplying a pre-generated plan to the terraform apply command, the contract implies that those should be the only changes enacted by terraform during the apply. If it were to refresh datasources, that would potentially propagate new values through the plan potentially incurring changes to resources after the plan had been reviewed and approved, thus negating the value of that process.

Thanks, this is the key insight I was missing, it is indeed not possible for the data source to be refreshed at apply time.

It's unfortunate though that this means terraform cloud users are out of luck. We can build AWS CLI into our Atlantis image and set up processes for keeping it up to date, it's an inconvenience but not that bad, but on some platforms there is no similar solution that would allow the exec approach to be used.

@alexsomesan
Copy link
Member

TFC allows one to use custom agents, as docker containers. Should be easy to add the auth plugins to those.
It implies managing your own worker pool, which isn't what everyone may want to do. The TFC development team is aware of this limitation, but they may not be aware of the amount of users affected. It may help to add weight to the issue by letting them know about it using their support request inputs.

@eightseventhreethree
Copy link

eightseventhreethree commented Feb 16, 2022

This issue in the very least should require a review of all of the official documentation, since you cannot actually use the provider in it's documented state.

@jbg
Copy link

jbg commented Feb 24, 2022

A related issue to this, is that this provider seems to update the state with the changes that it attempted to apply, as if the apply was successful, even though the authentication failed due to expired credentials.

So if you plan a change, and then wait 15 minutes, and then try to apply the plan, you will get an error like "Error: the server has asked for the client to provide credentials". Then if you try to plan again with -refresh=false, there will be "No changes. Your infrastructure matches the configuration". On large states this increases the pain of this issue considerably as it creates the need for repeated refreshing of the state which can take tens of minutes or more.

@trallnag
Copy link

trallnag commented Mar 11, 2022

I'm just using local exec to deploy the few Kubernetes resources I want to "manage" with Terraform. At the moment I don't want to split my rather small Terraform state into at least two layers just to be able to use the Kubernetes provider properly with an AWS EKS Kubernetes cluster 💁‍♀️

@aidanmelen
Copy link

aidanmelen commented Apr 8, 2022

You can get around this with Kubernetes Service Account Tokens. The code snippet would look something like this:

# create service account
resource "kubernetes_service_account_v1" "terraform_admin" {
  metadata {
    name      = "terraform-admin"
    namespace = "kube-system"
    labels    = local.labels
  }
}

# grant privileges to the service account
module "terraform_admin" {
  source  = "aidanmelen/kubernetes/rbac"
  version = "v0.1.1"

  labels = local.labels

  cluster_roles = {
    "cluster-admin" = {
      create_cluster_role       = false
      cluster_role_binding_name = "terraform-admin-global"
      cluster_role_binding_subjects = [
        {
          kind = "ServiceAccount"
          name = kubernetes_service_account_v1.terraform_admin.metadata[0].name
        }
      ]
    }
  }
}

# retreive service account token from secret
data "kubernetes_secret" "terraform_admin" {
  metadata {
    name      = kubernetes_service_account_v1.terraform_admin.metadata[0].name
    namespace = kubernetes_service_account_v1.terraform_admin.metadata[0].namespace
  }
}

# call provider with long-lived service account token
provider "kubernetes" {
  alias                  = "terraform-admin"
  host                   = "https://kubernetes.docker.internal:6443"
  cluster_ca_certificate = data.kubernetes_secret.terraform_admin.data["ca.crt"]
  token                  = data.kubernetes_secret.terraform_admin.data["token"]
}

Please see authn-authz example from the aidanmelen/kubernetes/rbac module for more information.

⚠️ This comes with the security trade-off since this token will need to be manually rotated.

@ame24924
Copy link

ame24924 commented May 22, 2022

I run into the same problem in TFC.
The cause is I used an IAM role as AWS provider.

provider "aws" {
  assume_role {
    role_arn = var.assume_role_arn
  }
}

I solved this problem by explicitly specifying the IAM role when I get a token such as:

  exec {
    api_version = "client.authentication.k8s.io/v1alpha1"
    command     = "aws"
    args = ["eks", "get-token", "--cluster-name", module.eks.name, "--role-arn", var.assume_role]
  }

Also, you may have to add your AWS region.

@tylerbillgo
Copy link

Using exec is not a viable solution when running in terraform cloud using remote execution. Our current thinking is to implement a workaround to essentially taint the aws_eks_cluster_auth data source so it gets refreshed for every plan. It would be ideal if the kubernetes provider had native support for getting and refreshing managed kubernetes service authentication tokens / credentials in order to support environments in which the only guaranteed tooling is terraform itself.

This does work when using Terraform Cloud. It's how we have it working.

@romankydybets
Copy link

in the general terraform k8s provide, is defined to use exec approach
https://registry.terraform.io/providers/hashicorp/kubernetes/latest/docs#exec-plugins

@tstraley
Copy link

It's the default and preferred mechanism for credentials access on both EKS and GKE and it's what the official tooling from both cloud providers uses by default.

Have a look at the contents of a kubeconfig file produces by the AWS CLI:

➤ aws eks update-kubeconfig --name k8s-dev

But the difference here is that you are using the AWS CLI to produce that kubeconfig. Of course it is sensible to also use the aws cli exec pattern to get the token within that kubeconfig.

In Terraform, the expectation is that I should be able to utilize Terraform to interact with AWS for all things I need including getting a valid kubernetes auth token.

I completely understand the limitations within terraform that prevent this data source from being resolved during apply if a plan state is provided, but maybe a reasonable solution would be to make the token TTL configurable (obviously an ask for the aws provider, not this kubernetes provider)

@harryfinbow
Copy link

Running the code below with the commented out block gives us the same error message as above: Error: Kubernetes cluster unreachable: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable. However, when I uncomment the Helm provider it works fine.

data "aws_eks_cluster_auth" "cluster" {
  name = data.terraform_remote_state.kubernetes.outputs.cluster_name
}

provider "kubernetes" {
  host                   = data.terraform_remote_state.kubernetes.outputs.cluster_endpoint
  cluster_ca_certificate = data.terraform_remote_state.kubernetes.outputs.cluster_ca_certificate
  token                  = data.aws_eks_cluster_auth.cluster.token
}

/*
provider "helm" {
  kubernetes {
    host                   = data.terraform_remote_state.kubernetes.outputs.cluster_endpoint
    cluster_ca_certificate = data.terraform_remote_state.kubernetes.outputs.cluster_ca_certificate
    token                  = data.aws_eks_cluster_auth.cluster.token
  }
}
*/

This seems pretty weird to me, not sure if it's helpful.

@romankydybets
Copy link

Running the code below with the commented out block gives us the same error message as above: Error: Kubernetes cluster unreachable: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable. However, when I uncomment the Helm provider it works fine.

data "aws_eks_cluster_auth" "cluster" {
  name = data.terraform_remote_state.kubernetes.outputs.cluster_name
}

provider "kubernetes" {
  host                   = data.terraform_remote_state.kubernetes.outputs.cluster_endpoint
  cluster_ca_certificate = data.terraform_remote_state.kubernetes.outputs.cluster_ca_certificate
  token                  = data.aws_eks_cluster_auth.cluster.token
}

/*
provider "helm" {
  kubernetes {
    host                   = data.terraform_remote_state.kubernetes.outputs.cluster_endpoint
    cluster_ca_certificate = data.terraform_remote_state.kubernetes.outputs.cluster_ca_certificate
    token                  = data.aws_eks_cluster_auth.cluster.token
  }
}
*/

This seems pretty weird to me, not sure if it's helpful.

i find a big issue terraform-aws-modules/terraform-aws-eks#1234

this is my part and it is working for 2 years

data "aws_eks_cluster" "cluster" {
  name = var.eks_cluster_name
}

data "aws_eks_cluster_auth" "cluster" {
  name = var.eks_cluster_name
}

provider "kubernetes" {
  host                   = var.eks_endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.cluster.certificate_authority[0].data)
  token                  = data.aws_eks_cluster_auth.cluster.token
  # if you face issue with TTL just uncomment this
  # exec {
  #   api_version = "client.authentication.k8s.io/v1alpha1"
  #   args        = [
  #     "eks", "get-token",
  #     "--cluster-name", var.eks_cluster_name,
  #     "--region", var.region,
  #     "--profile", var.environment
  #   ]
  #   command     = "aws"
  # }
}


provider "helm" {
  kubernetes {
    host                   = var.eks_endpoint
    cluster_ca_certificate = base64decode(data.aws_eks_cluster.cluster.certificate_authority[0].data)
    token                  = data.aws_eks_cluster_auth.cluster.token
    # exec {
    #   api_version = "client.authentication.k8s.io/v1alpha1"
    #   args        = [
    #     "eks", "get-token",
    #     "--cluster-name", var.eks_cluster_name,
    #     "--region", var.region,
    #     "--profile", var.environment
    #   ]
    #   command     = "aws"
    # }
  }
}

@ppodevlabs
Copy link

after a few day since i created my EKS cluster, we are facing the same issue

Error: Kubernetes cluster unreachable: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable

it seems that the kubernetes and helm provider can not communicate with my cluster. here is my configuration

provider "kubernetes" {
  host                   = data.aws_eks_cluster.eks.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.eks.certificate_authority[0].data)
  token                  = data.aws_eks_cluster_auth.eks.token
}

provider "helm" {
  kubernetes {
  host                     = data.aws_eks_cluster.eks.endpoint
    cluster_ca_certificate = base64decode(data.aws_eks_cluster.eks.certificate_authority[0].data)
    token                  = data.aws_eks_cluster_auth.eks.token
  }
}

provider "kubectl" {
  apply_retry_count      = 10
  host                   = data.aws_eks_cluster.eks.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.eks.certificate_authority[0].data)
  load_config_file       = false
  token                  = data.aws_eks_cluster_auth.eks.token
}

i’ve tried to run a terraform refresh —target module.eks, and it works, but it doesn’t seems to work as i have the same error when i try a refresh, plan, apply

Copy link

github-actions bot commented Dec 6, 2023

Marking this issue as stale due to inactivity. If this issue receives no comments in the next 30 days it will automatically be closed. If this issue was automatically closed and you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. This helps our maintainers find and focus on the active issues. Maintainers may also remove the stale label at their discretion. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests