Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hanging node group after delete #1325

Closed
austinbv opened this issue Sep 11, 2019 · 20 comments
Closed

Hanging node group after delete #1325

austinbv opened this issue Sep 11, 2019 · 20 comments

Comments

@austinbv
Copy link

What happened?
I preformed a eksctl delete nodegroup --cluster prod-eks --name ng-1 the drain failed because of existing daemon sets and some local data.

I drained the nodes manually with kubectl using kubectl drain -l 'alpha.eksctl.io/nodegroup-name=ng-1' --force --ignore-daemonsets --delete-local-data

I ran eksctl delete nodegroup --cluster prod-eks --name ng-1 and now got an error

2019-09-11T18:20:08-05:00 [!]  error getting instance role ARN for nodegroup "ng-1"

The CloudFormation delete has also failed to run with the events


2019-08-28 14:06:18 UTC-0500 | eksctl-mim-prod-eks-nodegroup-ng-1 | DELETE_FAILED | The following resource(s) failed to delete: [NodeInstanceRole].
-- | -- | -- | --
2019-08-28 14:06:17 UTC-0500 | NodeInstanceRole | DELETE_FAILED | Cannot delete entity, must detach all policies first. (Service: AmazonIdentityManagement; Status Code: 409; Error Code: DeleteConflict; Request ID: e9ebc137-c9c6-11e9-a56a-e1f2488279d7)

All instances were terminated but performing a eksctl get nodegroups --cluster prod-eks I can see

→ eksctl get nodegroup --cluster mim-prod-eks
CLUSTER         NODEGROUP       CREATED                 MIN SIZE        MAX SIZE        DESIRED CAPACITY        INSTANCE TYPE   IMAGE ID
prod-eks    ng-1            2019-08-14T16:28:19Z    1               4               3                       t3.medium       ami-0f2e8e5663e16b436
prod-eks    ng-6            2019-09-11T19:21:31Z    1               10              4                       t3.large        ami-0d3998d69ebe9b214

What you expected to happen?
eksctl would no longer list the deleted node group

How to reproduce it?
Not sure why it failed tbh

Anything else we need to know?
Very standard install

Versions
Please paste in the output of these commands:

$ eksctl version
[ℹ]  version.Info{BuiltAt:"", GitCommit:"", GitTag:"0.5.3"}
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T12:36:28Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.10-eks-5ac0f1", GitCommit:"5ac0f1d9ab2c254ea2b0ce3534fd72932094c6e1", GitTreeState:"clean", BuildDate:"2019-08-20T22:39:46Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"}

Logs
Include the output of the command line when running eksctl. If possible, eksctl should be run with debug logs. For example:
eksctl get clusters -v 4
Make sure you redact any sensitive information before posting.
If the output is long, please consider a Gist.

@jonasteif
Copy link

hey

Have you been able to fix this issue in any way? The same thing happened to me yesterday and I can't find a way to permanently delete nodegroup from my EKS cluster.

@davidhole
Copy link

I had this issue, which in my case I found a solution for.

For me it related to dangling ENIs left behind by auto-scaling instances up and down (spot in my case). These ENIs were still attached to the node group security group, so the security groups could not be deleted when deleting the cloudformation stack (initiated by eksctl).

Deleting these ENIs (they have a status of Available and not attached to an instance, also will have the node group security group listed) allowed cloudformation to properly delete the stack for the node group and it appears completely deleted to eksctl.

Deleting these dangling ENIs every so often (depending on how quickly they build up for you) is also good policy as they have caused other issues for me (and others) as well:

See:
aws/amazon-vpc-cni-k8s#59
aws/amazon-vpc-cni-k8s#608
etc

@mr-karan
Copy link

mr-karan commented Oct 4, 2019

+1 faced the same issue

@ddavtian
Copy link

ddavtian commented Dec 3, 2019

Facing the same issue here and not sure how to proceed.

In trying to delete the cluster I see the following error

eksctl delete cluster --name floral-rainbow-1574743755
eksctl version 0.10.2
using region us-east-1
deleting EKS cluster "floral-rainbow-1574743755"
cleaning up LoadBalancer services
[no eksctl-managed CloudFormation stacks found for "floral-rainbow-1574743755"

I went to the AWS console, I see the EKS cluster there, trying to delete the cluster manually I am seeing the following error.

ResourceInUseException
Cluster has node groups attached

Drilling into the NodeGroup, I see it listed there. Tried to manually delete the NodeGroup from AWS console and it error'd out as well with DELETE FAILED

With kubectl I am not seeing the nodes anymore but I see the following resources

kubectl get all --all-namespaces
NAMESPACE     NAME                           READY   STATUS    RESTARTS   AGE
kube-system   pod/coredns-77f96c54b6-c78x4   0/1     Pending   0          4d2h
kube-system   pod/coredns-77f96c54b6-j8jh4   0/1     Pending   0          4d2h

NAMESPACE     NAME                 TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)         AGE
default       service/kubernetes   ClusterIP   10.100.0.1    <none>        443/TCP         6d20h
kube-system   service/kube-dns     ClusterIP   10.100.0.10   <none>        53/UDP,53/TCP   6d20h

NAMESPACE     NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
kube-system   daemonset.apps/aws-node     0         0         0       0            0           <none>          6d20h
kube-system   daemonset.apps/kube-proxy   0         0         0       0            0           <none>          6d20h

NAMESPACE     NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/coredns   0/2     2            0           6d20h

NAMESPACE     NAME                                 DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/coredns-77f96c54b6   2         2         0       6d20h

Any help is appreciated as I don't know of a way to clean and remove this cluster up now

Thanks

@bubeamos
Copy link

Hey @ddavtian did you manage to delete the EKS cluster ? How did you go about it ?

@ddavtian
Copy link

ddavtian commented Feb 1, 2020

@chidiebube I did, the issue is that there is a broken configmap in the cluster and that needs to be manually fixed first. Looking at my history of commands, try poking around this to make sure the yaml is proper

kubectl edit -n kube-system configmap/aws-auth

Fix it and try to remove it again using AWS Console.

@lclpedro
Copy link

lclpedro commented Feb 6, 2020

+1 faced the same issue

@jakazzy
Copy link

jakazzy commented Mar 9, 2020

I had this issue, which in my case I found a solution for.

For me it related to dangling ENIs left behind by auto-scaling instances up and down (spot in my case). These ENIs were still attached to the node group security group, so the security groups could not be deleted when deleting the cloudformation stack (initiated by eksctl).

Deleting these ENIs (they have a status of Available and not attached to an instance, also will have the node group security group listed) allowed cloudformation to properly delete the stack for the node group and it appears completely deleted to eksctl.

Deleting these dangling ENIs every so often (depending on how quickly they build up for you) is also good policy as they have caused other issues for me (and others) as well:

See:
aws/amazon-vpc-cni-k8s#59
aws/amazon-vpc-cni-k8s#608
etc

Thanks this worked for me.

@tedostrem
Copy link

+1

1 similar comment
@musha68k
Copy link

+1

@Gfeuillen
Copy link

+1

We also experienced this issue, what worked for us:

  • Delete dangling ENIs (as mentioned above)
  • Resume deleting process by manually deleting the associated CF stack

@cordiaz
Copy link

cordiaz commented Apr 18, 2020

+1

I trapped to the same problem :((

============

This step worked for me:

  1. Delete all ENI associated with EKS
  2. Delete all Security Group with EKS

after that, I can delete nodegroup and the cluster...

Yeah!

@seanamosw
Copy link

I'll also say this is not an eksctl specific issue. Our EKS cluster was not created or managed with eksctl and we had the same issue of dangling ENIs.

@polanfong
Copy link
Contributor

polanfong commented Apr 23, 2020

Same issue here. Although eksctl said it deleted the node group, the Cloud Formation stack had failed to delete it. The message "must detach all policies first" made me look at the node group's NodeInstanceRole in IAM. I removed the last remaining policy (CloudWatchLogsFullAccess) on that role and that worked for me.

@MG40
Copy link

MG40 commented May 12, 2020

Same issue. I deleted the autoscaling group, The NAT gateways and VPCs. Thanks to the billing alerts. I couldn't find any cluster to delete.

@MG40
Copy link

MG40 commented May 14, 2020

There was another way, the cloudformation stacks were running and I went ahead and deleted the same. That worked too the second time around!

@michaelbeaumont
Copy link
Contributor

See #2172 and potentially fixed by #2762

@dable-mj111
Copy link

dable-mj111 commented Nov 10, 2020

@MG40 +1. I also deleted the hanging nodegroups by deleting the associated cloudformation stack.

@ajinkya933
Copy link

In my case the problem was:

$eksctl delete nodegroup -f secondnode.yaml --approve
[!]  continuing with deletion, error occurred: error getting instance role ARN for nodegroup "second": stack not found for nodegroup "second"
[!]  no nodes found in nodegroup "second" (label selector: "alpha.eksctl.io/nodegroup-name=second")
[!]  removing nodegroup from auth ConfigMap: instance identity ARN "" not found in auth ConfigMap

my issue was nodegroup was not getting deleted.

I fixed it :

Problem was in under IAM>roles I removed all the roles which were not equal to the role I got when I executed:

aws iam list-roles \
    | jq -r ".Roles[] \
    | select(.RoleName \
    | startswith(\"eksctl-$AWS_CLUSTER_NAME\") and contains(\"NodeInstanceRole\")) \
    .RoleName"

eksctl-kubeflow-example-nodegroup-ng-185-NodeInstanceRole-1DDJJXQBG9EM6

I deleted all the roles which were not equal to : eksctl-kubeflow-example-nodegroup-ng-185-NodeInstanceRole-1DDJJXQBG9EM6

and after that I added a new node using: eksctl create nodegroup -f secondnode.yaml

Then second node was successfully attached.

@sakshampaliwal
Copy link

I faced the same issue. I tried deleting the nodegroup both through the GUI and using a command, but it wouldn't delete. It seemed to get stuck. However, after waiting for 10 minutes, it finally got deleted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests