Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue communicating with cluster -- dial tcp: i/o timeout #1167

Closed
bryankaraffa opened this issue Feb 18, 2021 · 6 comments
Closed

Issue communicating with cluster -- dial tcp: i/o timeout #1167

bryankaraffa opened this issue Feb 18, 2021 · 6 comments

Comments

@bryankaraffa
Copy link

Terraform Version, Provider Version and Kubernetes Version

Terraform v0.14.6
+ provider registry.terraform.io/hashicorp/aws v3.28.0
+ provider registry.terraform.io/hashicorp/external v2.0.0
+ provider registry.terraform.io/hashicorp/helm v2.0.2
+ provider registry.terraform.io/hashicorp/kubernetes v2.0.2
+ provider registry.terraform.io/hashicorp/local v2.0.0
+ provider registry.terraform.io/hashicorp/null v3.0.0
+ provider registry.terraform.io/hashicorp/random v3.0.1
+ provider registry.terraform.io/hashicorp/template v2.2.0

Affected Resource(s)

  • kubernetes_storage_class
  • kubernetes_config_map
  • assuming all kubernetes_* resources but we only use those 2

Terraform Configuration Files

# Copy-paste your Terraform configurations here - for large Terraform configs,
# please use a service like Dropbox and share a link to the ZIP file. For
# security, you can also encrypt the files using our GPG public key.

Debug Output

Kubernetes provider is failing to communicate with kubernetes cluster:
https://gist.github.com/bryankaraffa/8d473ee172075d2016211f59655fd099#file-failed-http-request-to-kubernetes-cluster-with-terraform-binaries-txt

The same exact request works with another binary [curl]:
https://gist.github.com/bryankaraffa/8d473ee172075d2016211f59655fd099#file-same-request-works-with-curl-txt

Expected Behavior

We expect the Kubernetes provider to be able to refresh / communicate with kubernetes cluster because there's no network route or connectivity issues [validated with curl]. Also, this behavior is only able to be reproduced on 2 of our teams local machines -- other team members and our CI/CD system can run the plan with no issues.

With the v2.x provider, we are expecting any local KUBECONFIG env or configuration in ~/.kube/config to be ignored by the kubernetes provider [we are passing cluster host, token, and certificate statically - providers.tf]

Actual Behavior

Kubernetes provider is failing to communicate with the cluster like there is an issue with the network route / connectivity [which we validated there is not with curl]

Important Factoids

I suspect this is a local configuration / conflict... This behavior is only able to be reproduced on 2 of our teams local machines -- other team members, and our CI/CD system can run the plan with no issues. From the troubleshooting we have done and because we are able to reproduce this behavior 100% of the time on 2 particular team member's machines, I feel like we've eliminated the possibility this is an issue with general internet connectivity that is intermittent..

References

Did not find anything related

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@alexsomesan
Copy link
Member

@bryankaraffa Are you creating the EKS cluster in the same apply operation where you are seeing these errors?

@bryankaraffa
Copy link
Author

@alexsomesan -- Yes the kubernetes_config_map, kubernetes_storage_class, and aws_eks_cluster resources are in the same terraform module [module.cluster] and so they are getting created/managed with the same apply operation

@ghost ghost removed waiting-response labels Feb 24, 2021
@dak1n1
Copy link
Contributor

dak1n1 commented Feb 25, 2021

I suspect this is a local configuration / conflict... This behavior is only able to be reproduced on 2 of our teams local machines -- other team members, and our CI/CD system can run the plan with no issues.

This made me think of an issue we're going to solve with the next release. The issue is, when one of the KUBE environment variables are set on a system, that variable can override explicit configuration settings in the provider. It might be helpful to check the affected machine for KUBE* environment variables. I run this on my system to check: env |grep KUBE. The Kubernetes provider reads the following environment variables and will use them to configure the provider:

  • KUBE_CONFIG_PATH
  • KUBE_HOST
  • KUBE_CLUSTER_CA_CERT_DATA
  • KUBE_CLIENT_CERT_DATA
  • KUBE_CLIENT_KEY_DATA
  • KUBE_USERNAME
  • KUBE_PASSWORD
  • KUBE_CTX
  • KUBE_CTX_USER
  • KUBE_CTX_CLUSTER
  • KUBE_TOKEN
  • KUBE_INSECURE

Potentially, that could override your token. Though I'm not sure if an invalid token will give you an i/o timeout.

Another possibility is a race condition: the provider could be attempting to read the API before the EKS cluster is ready. To avoid this scenario, I recommend using a data source instead of referencing the module here:

  host                   = module.cluster.eks_cluster_host
  cluster_ca_certificate = module.cluster.eks_cluster_ca_certificate

Here's the configuration I've got in a PR to add to our EKS example:

# Wait for cluster API to be ready before reading.
data "aws_eks_cluster" "default" {
  name = module.cluster.cluster_id
}

provider "kubernetes" {
  host                   = data.aws_eks_cluster.default.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.default.certificate_authority[0].data)
  exec {
    api_version = "client.authentication.k8s.io/v1alpha1"
    args        = ["eks", "get-token", "--cluster-name", var.cluster_name]
    command     = "aws"
  }
}

The part I want to emphasize is the name = module.cluster.cluster_id. That cluster_id attribute will block the data source from reading until the cluster's API is ready. That has worked reliably in my testing, and it's based on the EKS module here.

@github-actions
Copy link

Marking this issue as stale due to inactivity. If this issue receives no comments in the next 30 days it will automatically be closed. If this issue was automatically closed and you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. This helps our maintainers find and focus on the active issues. Maintainers may also remove the stale label at their discretion. Thank you!

@github-actions github-actions bot added the stale label Mar 23, 2022
@bryankaraffa
Copy link
Author

Closing this for now because we are not able to replicate anymore with latest version of providers..

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants