Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: context dealing exceeded due to metrics-server pod not terminating on terraform destroy #353

Closed
1 task done
camba1 opened this issue Mar 16, 2022 · 16 comments
Closed
1 task done
Assignees
Labels
bug Something isn't working

Comments

@camba1
Copy link
Contributor

camba1 commented Mar 16, 2022

Welcome to Amazon SSP EKS Accelerator!

  • Yes, I've searched similar issues on GitHub and didn't find any.

Amazon EKS Accelerator Release version

3.5

What is your environment, configuration and the example used?

Terraform v1.0.10
on darwin_amd64

  • provider registry.terraform.io/gavinbunney/kubectl v1.13.1
  • provider registry.terraform.io/hashicorp/aws v4.5.0
  • provider registry.terraform.io/hashicorp/cloudinit v2.2.0
  • provider registry.terraform.io/hashicorp/helm v2.4.1
  • provider registry.terraform.io/hashicorp/kubernetes v2.8.0
  • provider registry.terraform.io/hashicorp/local v2.1.0
  • provider registry.terraform.io/hashicorp/null v3.1.0
  • provider registry.terraform.io/terraform-aws-modules/http v2.4.1

Using the getting started guide

What did you do and What did you see instead?

What did I want to do?
Tried to terraform destroy the cluster

What did I expect:
The cluster to be deprovisioned

What happened?
Terraform destroy failed with error: context deadline exceeded
Exploring the cluster, the error took place because the metrics-server pod was stuck in terminating state

Once the pod was deleted forcefully (kubectl delete pod metrics-server-694d47d564-4xv72 --grace-period=0 --force -n metrics-server), and then the related namespace was deleted, Terraform was able to destroy the cluster

Additional Information

No response

@camba1 camba1 added the bug Something isn't working label Mar 16, 2022
@askulkarni2
Copy link
Contributor

Thanks @camba1 fro reporting the issue. @Zvikan can you please have a look at this one?

@naris-silpakit
Copy link
Contributor

Terraform failed to destroy metrics-server because of a dependency issue in the getting started example where the VPC resources were getting deleted before the EKS cluster was completely deleted. #356 resolves this. Since it's merged I'll close this ticket now.

@danvau7
Copy link

danvau7 commented Apr 4, 2022

Still getting a similar issue regarding Error: context deadline exceeded. Sometimes terraform destroy works, sometimes it doesn't. Any ideas? The hangup is always around the kubernetes-addons module.

@naris-silpakit
Copy link
Contributor

Hi @danvau7, thanks for following up on this. We're working on a list of known issues, including this one. We're also working on a fix for this problem in particular, but don't have an estimate for when that will be pushed out. In the meantime, if you are trying to deploy the examples you can avoid the issue by deploying the VPC separately from the rest of the example.

@danvau7
Copy link

danvau7 commented Apr 5, 2022

Setting the following in the aws-eks-accelerator-for-terraform helps significantly and would perhaps recommend adding to main example in the README.md:

  cluster_timeouts = {
    create = "120m"
    update = "120m"
    delete = "120m"
  }

Also modifying the helm configs such as in the following also helps. I change the default value from 1200 -> 3600.

  enable_argocd = true
  # Optional Map value
  argocd_helm_config = {
    name             = "argo-cd"
    chart            = "argo-cd"
    repository       = "https://argoproj.github.io/argo-helm"
    version          = "3.35.4"
    namespace        = "argocd"
    timeout          = "3600" # changed from 1200
    create_namespace = true
  }

@Zvikan
Copy link
Contributor

Zvikan commented Apr 5, 2022

Sorry for replying late to this.
As Naris mentioned, currently our examples are a single TF apply and destroy approach, the problem with it is that other modules and resources may be deleted (e.g. VPC resources), this may lead to addons not being deleted properly and reporting back with timeouts/deadline exceeded.

The ideal destroy order would be:

  1. k8s Addons
  2. k8s managed addons (e.g. VPC CNI)
  3. nodes
  4. cluster
  5. VPC and other.

You can use the -target arg while running TF apply/destroy to target the specific modules/resources you want to be deleted.

Here's what I suggest to check next time you are facing this issue:

  • Run kubectl describe pod <POD_NAME> where pod name is the one with the timeout/deadline exceed issue, check under the events if there are any issues around sandbox/CNI
  • Run kubectl get nodes and check the status of the nodes

If any of the above is true, check terraform destroy order, and you may see that VPC resources have been deleted before the addons, which leads to nodes unhealthy and overall cluster in a state that addons can't be deleted properly (this may be true NOT just for metrics-server, but other addons too).

@schwichti
Copy link
Contributor

schwichti commented May 6, 2022

Sadly, I am also facing this issue:

module.eks-addons.module.aws_for_fluent_bit[0].module.irsa.kubernetes_namespace_v1.irsa[0]: Still destroying... [id=logging, 4m50s elapsed]
module.eks-addons.module.ingress_nginx[0].module.helm_addon.module.irsa[0].kubernetes_namespace_v1.irsa[0]: Still destroying... [id=nginx, 4m50s elapsed]
╷
│ Error: context deadline exceeded
│
│

I am using version 3.3.0 of the EKS Blueprints module.

@schwichti
Copy link
Contributor

schwichti commented May 6, 2022

The Pods hang in Terminating state. Here is the describe pod output of the nginx controller:

Name:                      ingress-nginx-controller-559c9cc878-wgg42
Namespace:                 nginx
Priority:                  0
Node:                      ip-XX-XX-XX-XX.eu-central-1.compute.internal/XX.XX.XX.XX
Start Time:                Thu, 05 May 2022 23:09:12 +0200
Labels:                    app.kubernetes.io/component=controller
                           app.kubernetes.io/instance=ingress-nginx
                           app.kubernetes.io/name=ingress-nginx
                           pod-template-hash=559c9cc878
Annotations:               kubernetes.io/psp: eks.privileged
Status:                    Terminating (lasts 10h)
Termination Grace Period:  300s
IP:                        10.1.0.249
IPs:
  IP:           XX.XX.XX.XXX
Controlled By:  ReplicaSet/ingress-nginx-controller-559c9cc878
Containers:
  controller:
    Container ID:  docker://631f2c8a81683c06978875f60218319a9ecf7a62ff97a5835ea158dab1c86819
    Image:         k8s.gcr.io/ingress-nginx/controller:v1.0.4@sha256:545cff00370f28363dad31e3b59a94ba377854d3a11f18988f5f9e56841ef9ef
    Image ID:      docker-pullable://k8s.gcr.io/ingress-nginx/controller@sha256:545cff00370f28363dad31e3b59a94ba377854d3a11f18988f5f9e56841ef9ef
    Ports:         80/TCP, 443/TCP, 8443/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Args:
      /nginx-ingress-controller
      --publish-service=$(POD_NAMESPACE)/ingress-nginx-controller
      --election-id=ingress-controller-leader
      --controller-class=k8s.io/ingress-nginx
      --configmap=$(POD_NAMESPACE)/ingress-nginx-controller
      --validating-webhook=:8443
      --validating-webhook-certificate=/usr/local/certificates/cert
      --validating-webhook-key=/usr/local/certificates/key
    State:          Running
      Started:      Thu, 05 May 2022 23:09:19 +0200
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:      100m
      memory:   90Mi
    Liveness:   http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=5
    Readiness:  http-get http://:10254/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Environment:
      POD_NAME:                     ingress-nginx-controller-559c9cc878-wgg42 (v1:metadata.name)
      POD_NAMESPACE:                nginx (v1:metadata.namespace)
      LD_PRELOAD:                   /usr/local/lib/libmimalloc.so
      AWS_DEFAULT_REGION:           eu-central-1
      AWS_REGION:                   eu-central-1
      AWS_ROLE_ARN:                 arn:aws:iam::XXXXXXXXXX:role/simphera-reference-dev-eks-ingress-nginx-sa-irsa
      AWS_WEB_IDENTITY_TOKEN_FILE:  /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /usr/local/certificates/ from webhook-cert (ro)
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-r824n (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   True
  PodScheduled      True
Volumes:
  aws-iam-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
  webhook-cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  ingress-nginx-admission
    Optional:    false
  kube-api-access-r824n:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

Thats the output of kubectl get nodes:

NAME                                          STATUS     ROLES    AGE   VERSION
ip-XX-XX-XX-XX.eu-central-1.compute.internal    NotReady   <none>   11h   v1.21.5-eks-9017834
ip-XX-XX-XX-XX.eu-central-1.compute.internal   NotReady   <none>   11h   v1.21.5-eks-9017834

After running the following command for the hanging pods, I was able to complete terraform destroy:

kubectl delete pod --grace-period=0 --force --namespace nginx ingress-nginx-controller-559c9cc878-wgg42

@schwichti
Copy link
Contributor

@Zvikan do you have a hint for me how I can fix it?

@Zvikan
Copy link
Contributor

Zvikan commented May 9, 2022

Hey @schwichti , I can see that your nodes are in status "NotReady" , this may be similar issue as I explained above.
Did you run the whole example with a single terraform destroy command?
If so, you may faced a situation where some of the VPC resources were deleted before the addons & the cluster, which leads to this situation where nodes are not healthy, and then TF is unable to destroy the addons.

Did you try to do a cleanup/destroy process using the -target arg as I mentioned above?

@schwichti
Copy link
Contributor

@Zvikan I am not using any example, but created my own module based eks blueprints: https://github.com/dspace-group/simphera-reference-architecture-aws/tree/feat_update_dependencies . In fact, I am able to complete when I delete the Helm charts manually (see above), therefore I do not have the need to use -target. However, I would like to destroy everthing with one run without any troubles.

I think you are right that VPC resources were deleted before the addons, but I cannot see why that is the case. In https://github.com/aws-ia/terraform-aws-eks-blueprints/pull/356/files you have added an explicit dependency from the blueprints module to VPC. I do not see why is necessary, because there is an implicit dependency. It also appears that this explicit dependency was removed again from the examples.

@Zvikan
Copy link
Contributor

Zvikan commented May 9, 2022

@schwichti the implicit dependency is just for several of the VPC resources like the subnet ids, but what can cause this domino effect may be your NAT gateway, or VPC related resources that leads the nodes to be in unstable state and therefore TF not being able to cleanup the addons properly.

And we've removed the explicit dependency between addons and/or accelerator (now known as EKS blueprint) to VPC (due to upstream changes in newer versions where we can't use depends_on as you will face this issue.

We've been trying to keep a single TF apply &/ destroy without any issues, but we entered the dependency rabbit hole and faced the following issues:

  • explicit dependencies can add ALOT of time to the TF plan - due to the nature of how TF works and the amount of the project
  • implicit dependencies doesn't cover all use cases (as the example I provided here with the VPC resources)

And more.

So how do we go from here? we've been thinking alot and game into a decision that the best next steps is to take a step back, and suggest to deploy modules in the correct order via -target. (if you wish to have it all under the same repo and in a single TF state...)

@schwichti
Copy link
Contributor

@Zvikan do I get you right that adding an explicit dependency to the vpc resources could solve the issue in my case (coming at the price of slowing down)?

Using the -target option is quite unsatisfying for me and can only be a temporary solution.

@Zvikan
Copy link
Contributor

Zvikan commented May 10, 2022

@schwichti Yes, by adding explicit dependency in your case you control the deployment (apply/destroy) flow, achieving what I've said above.

@georgelza
Copy link

do think a destroy that includes of a EKS cluster should without being told automatically delete the add-ons first.

@strowi
Copy link

strowi commented Oct 26, 2022

Hi, i started playing with eks-blueprints a couple of weeks ago and ran into similar problems to this and #524 .
Not sure this is 100% related, but after some testing (especially with the ingress-nginx addon), i noticed the Loadbalancer-Service was stuck pending - although the helm-chart was removed.

So i manually ran helm delete ingress-nginx which removed the service successfully. This led me to think the problem is terraform and it's handling of helm_release. I stumbled upon an open issue regarding the helm-provider which.

Main Problem seems to be: Helm-destruction is immediate and terraform continoues which causes problems.. only solution i've seen so far is a fixed wait-timout.. ;(

alidonmez pushed a commit to alidonmez/terraform-aws-eks-blueprints-1 that referenced this issue Mar 25, 2023
Updated k8s.gcr.io references to registry.k8s.io
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants