Issue communicating with cluster -- dial tcp: i/o timeout #1167

bryankaraffa · 2021-02-18T19:30:22Z

Terraform Version, Provider Version and Kubernetes Version

Terraform v0.14.6
+ provider registry.terraform.io/hashicorp/aws v3.28.0
+ provider registry.terraform.io/hashicorp/external v2.0.0
+ provider registry.terraform.io/hashicorp/helm v2.0.2
+ provider registry.terraform.io/hashicorp/kubernetes v2.0.2
+ provider registry.terraform.io/hashicorp/local v2.0.0
+ provider registry.terraform.io/hashicorp/null v3.0.0
+ provider registry.terraform.io/hashicorp/random v3.0.1
+ provider registry.terraform.io/hashicorp/template v2.2.0

Affected Resource(s)

kubernetes_storage_class
kubernetes_config_map
assuming all kubernetes_* resources but we only use those 2

Terraform Configuration Files

# Copy-paste your Terraform configurations here - for large Terraform configs,
# please use a service like Dropbox and share a link to the ZIP file. For
# security, you can also encrypt the files using our GPG public key.

Debug Output

Kubernetes provider is failing to communicate with kubernetes cluster:
https://gist.github.com/bryankaraffa/8d473ee172075d2016211f59655fd099#file-failed-http-request-to-kubernetes-cluster-with-terraform-binaries-txt

The same exact request works with another binary [curl]:
https://gist.github.com/bryankaraffa/8d473ee172075d2016211f59655fd099#file-same-request-works-with-curl-txt

Expected Behavior

We expect the Kubernetes provider to be able to refresh / communicate with kubernetes cluster because there's no network route or connectivity issues [validated with curl]. Also, this behavior is only able to be reproduced on 2 of our teams local machines -- other team members and our CI/CD system can run the plan with no issues.

With the v2.x provider, we are expecting any local KUBECONFIG env or configuration in ~/.kube/config to be ignored by the kubernetes provider [we are passing cluster host, token, and certificate statically - providers.tf]

Actual Behavior

Kubernetes provider is failing to communicate with the cluster like there is an issue with the network route / connectivity [which we validated there is not with curl]

Important Factoids

I suspect this is a local configuration / conflict... This behavior is only able to be reproduced on 2 of our teams local machines -- other team members, and our CI/CD system can run the plan with no issues. From the troubleshooting we have done and because we are able to reproduce this behavior 100% of the time on 2 particular team member's machines, I feel like we've eliminated the possibility this is an issue with general internet connectivity that is intermittent..

References

Did not find anything related

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

alexsomesan · 2021-02-24T15:56:32Z

@bryankaraffa Are you creating the EKS cluster in the same apply operation where you are seeing these errors?

bryankaraffa · 2021-02-24T18:46:45Z

@alexsomesan -- Yes the kubernetes_config_map, kubernetes_storage_class, and aws_eks_cluster resources are in the same terraform module [module.cluster] and so they are getting created/managed with the same apply operation

dak1n1 · 2021-02-25T01:13:00Z

I suspect this is a local configuration / conflict... This behavior is only able to be reproduced on 2 of our teams local machines -- other team members, and our CI/CD system can run the plan with no issues.

This made me think of an issue we're going to solve with the next release. The issue is, when one of the KUBE environment variables are set on a system, that variable can override explicit configuration settings in the provider. It might be helpful to check the affected machine for KUBE* environment variables. I run this on my system to check: env |grep KUBE. The Kubernetes provider reads the following environment variables and will use them to configure the provider:

KUBE_CONFIG_PATH
KUBE_HOST
KUBE_CLUSTER_CA_CERT_DATA
KUBE_CLIENT_CERT_DATA
KUBE_CLIENT_KEY_DATA
KUBE_USERNAME
KUBE_PASSWORD
KUBE_CTX
KUBE_CTX_USER
KUBE_CTX_CLUSTER
KUBE_TOKEN
KUBE_INSECURE

Potentially, that could override your token. Though I'm not sure if an invalid token will give you an i/o timeout.

Another possibility is a race condition: the provider could be attempting to read the API before the EKS cluster is ready. To avoid this scenario, I recommend using a data source instead of referencing the module here:

  host                   = module.cluster.eks_cluster_host
  cluster_ca_certificate = module.cluster.eks_cluster_ca_certificate

Here's the configuration I've got in a PR to add to our EKS example:

# Wait for cluster API to be ready before reading.
data "aws_eks_cluster" "default" {
  name = module.cluster.cluster_id
}

provider "kubernetes" {
  host                   = data.aws_eks_cluster.default.endpoint
  cluster_ca_certificate = base64decode(data.aws_eks_cluster.default.certificate_authority[0].data)
  exec {
    api_version = "client.authentication.k8s.io/v1alpha1"
    args        = ["eks", "get-token", "--cluster-name", var.cluster_name]
    command     = "aws"
  }
}

The part I want to emphasize is the name = module.cluster.cluster_id. That cluster_id attribute will block the data source from reading until the cluster's API is ready. That has worked reliably in my testing, and it's based on the EKS module here.

…-provider-kubernetes#1167

github-actions · 2022-03-23T00:00:38Z

Marking this issue as stale due to inactivity. If this issue receives no comments in the next 30 days it will automatically be closed. If this issue was automatically closed and you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. This helps our maintainers find and focus on the active issues. Maintainers may also remove the stale label at their discretion. Thank you!

bryankaraffa · 2022-04-14T17:58:28Z

Closing this for now because we are not able to replicate anymore with latest version of providers..

github-actions · 2022-05-15T02:31:15Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

bryankaraffa added the bug label Feb 18, 2021

aareet added the waiting-response label Feb 24, 2021

ghost removed waiting-response labels Feb 24, 2021

dak1n1 mentioned this issue Feb 25, 2021

Provider allows mutually-exclusive configuration options #1179

Open

osterman mentioned this issue Mar 8, 2021

Fail with I/O timeout due to bad configuration of the Kubernetes provider cloudposse/terraform-aws-eks-cluster#104

Closed

dak1n1 added the theme/auth label Mar 22, 2021

johnctitus pushed a commit to rackspace-infrastructure-automation/aws-terraform-eks that referenced this issue Mar 30, 2021

Update kubernetes provider configuration based on hashicorp/terraform…

50e4ff6

…-provider-kubernetes#1167

johnctitus pushed a commit to rackspace-infrastructure-automation/aws-terraform-eks that referenced this issue Mar 30, 2021

Update kubernetes provider configuration based on hashicorp/terraform…

2963942

…-provider-kubernetes#1167

github-actions bot added the stale label Mar 23, 2022

bryankaraffa closed this as completed Apr 14, 2022

github-actions bot removed the stale label Apr 14, 2022

github-actions bot locked as resolved and limited conversation to collaborators May 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue communicating with cluster -- dial tcp: i/o timeout #1167

Issue communicating with cluster -- dial tcp: i/o timeout #1167

bryankaraffa commented Feb 18, 2021

alexsomesan commented Feb 24, 2021

bryankaraffa commented Feb 24, 2021

dak1n1 commented Feb 25, 2021

github-actions bot commented Mar 23, 2022

bryankaraffa commented Apr 14, 2022

github-actions bot commented May 15, 2022

Issue communicating with cluster -- dial tcp: i/o timeout #1167

Issue communicating with cluster -- dial tcp: i/o timeout #1167

Comments

bryankaraffa commented Feb 18, 2021

Terraform Version, Provider Version and Kubernetes Version

Affected Resource(s)

Terraform Configuration Files

Debug Output

Expected Behavior

Actual Behavior

Important Factoids

References

Community Note

alexsomesan commented Feb 24, 2021

bryankaraffa commented Feb 24, 2021

dak1n1 commented Feb 25, 2021

github-actions bot commented Mar 23, 2022

bryankaraffa commented Apr 14, 2022

github-actions bot commented May 15, 2022