Hanging node group after delete #1325

austinbv · 2019-09-11T23:34:23Z

What happened?
I preformed a eksctl delete nodegroup --cluster prod-eks --name ng-1 the drain failed because of existing daemon sets and some local data.

I drained the nodes manually with kubectl using kubectl drain -l 'alpha.eksctl.io/nodegroup-name=ng-1' --force --ignore-daemonsets --delete-local-data

I ran eksctl delete nodegroup --cluster prod-eks --name ng-1 and now got an error

2019-09-11T18:20:08-05:00 [!]  error getting instance role ARN for nodegroup "ng-1"

The CloudFormation delete has also failed to run with the events


2019-08-28 14:06:18 UTC-0500 | eksctl-mim-prod-eks-nodegroup-ng-1 | DELETE_FAILED | The following resource(s) failed to delete: [NodeInstanceRole].
-- | -- | -- | --
2019-08-28 14:06:17 UTC-0500 | NodeInstanceRole | DELETE_FAILED | Cannot delete entity, must detach all policies first. (Service: AmazonIdentityManagement; Status Code: 409; Error Code: DeleteConflict; Request ID: e9ebc137-c9c6-11e9-a56a-e1f2488279d7)

All instances were terminated but performing a eksctl get nodegroups --cluster prod-eks I can see

→ eksctl get nodegroup --cluster mim-prod-eks
CLUSTER         NODEGROUP       CREATED                 MIN SIZE        MAX SIZE        DESIRED CAPACITY        INSTANCE TYPE   IMAGE ID
prod-eks    ng-1            2019-08-14T16:28:19Z    1               4               3                       t3.medium       ami-0f2e8e5663e16b436
prod-eks    ng-6            2019-09-11T19:21:31Z    1               10              4                       t3.large        ami-0d3998d69ebe9b214

What you expected to happen?
eksctl would no longer list the deleted node group

How to reproduce it?
Not sure why it failed tbh

Anything else we need to know?
Very standard install

Versions
Please paste in the output of these commands:

$ eksctl version
[ℹ]  version.Info{BuiltAt:"", GitCommit:"", GitTag:"0.5.3"}
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T12:36:28Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.10-eks-5ac0f1", GitCommit:"5ac0f1d9ab2c254ea2b0ce3534fd72932094c6e1", GitTreeState:"clean", BuildDate:"2019-08-20T22:39:46Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"}

Logs
Include the output of the command line when running eksctl. If possible, eksctl should be run with debug logs. For example:
eksctl get clusters -v 4
Make sure you redact any sensitive information before posting.
If the output is long, please consider a Gist.

The text was updated successfully, but these errors were encountered:

jonasteif · 2019-09-20T12:02:23Z

hey

Have you been able to fix this issue in any way? The same thing happened to me yesterday and I can't find a way to permanently delete nodegroup from my EKS cluster.

davidhole · 2019-09-25T22:23:41Z

I had this issue, which in my case I found a solution for.

For me it related to dangling ENIs left behind by auto-scaling instances up and down (spot in my case). These ENIs were still attached to the node group security group, so the security groups could not be deleted when deleting the cloudformation stack (initiated by eksctl).

Deleting these ENIs (they have a status of Available and not attached to an instance, also will have the node group security group listed) allowed cloudformation to properly delete the stack for the node group and it appears completely deleted to eksctl.

Deleting these dangling ENIs every so often (depending on how quickly they build up for you) is also good policy as they have caused other issues for me (and others) as well:

See:
aws/amazon-vpc-cni-k8s#59
aws/amazon-vpc-cni-k8s#608
etc

mr-karan · 2019-10-04T07:05:13Z

+1 faced the same issue

ddavtian · 2019-12-03T01:09:21Z

Facing the same issue here and not sure how to proceed.

In trying to delete the cluster I see the following error

eksctl delete cluster --name floral-rainbow-1574743755
eksctl version 0.10.2
using region us-east-1
deleting EKS cluster "floral-rainbow-1574743755"
cleaning up LoadBalancer services
[no eksctl-managed CloudFormation stacks found for "floral-rainbow-1574743755"

I went to the AWS console, I see the EKS cluster there, trying to delete the cluster manually I am seeing the following error.

ResourceInUseException
Cluster has node groups attached

Drilling into the NodeGroup, I see it listed there. Tried to manually delete the NodeGroup from AWS console and it error'd out as well with DELETE FAILED

With kubectl I am not seeing the nodes anymore but I see the following resources

kubectl get all --all-namespaces
NAMESPACE     NAME                           READY   STATUS    RESTARTS   AGE
kube-system   pod/coredns-77f96c54b6-c78x4   0/1     Pending   0          4d2h
kube-system   pod/coredns-77f96c54b6-j8jh4   0/1     Pending   0          4d2h

NAMESPACE     NAME                 TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)         AGE
default       service/kubernetes   ClusterIP   10.100.0.1    <none>        443/TCP         6d20h
kube-system   service/kube-dns     ClusterIP   10.100.0.10   <none>        53/UDP,53/TCP   6d20h

NAMESPACE     NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
kube-system   daemonset.apps/aws-node     0         0         0       0            0           <none>          6d20h
kube-system   daemonset.apps/kube-proxy   0         0         0       0            0           <none>          6d20h

NAMESPACE     NAME                      READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/coredns   0/2     2            0           6d20h

NAMESPACE     NAME                                 DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/coredns-77f96c54b6   2         2         0       6d20h

Any help is appreciated as I don't know of a way to clean and remove this cluster up now

Thanks

bubeamos · 2020-01-31T19:11:16Z

Hey @ddavtian did you manage to delete the EKS cluster ? How did you go about it ?

ddavtian · 2020-02-01T21:43:31Z

@chidiebube I did, the issue is that there is a broken configmap in the cluster and that needs to be manually fixed first. Looking at my history of commands, try poking around this to make sure the yaml is proper

kubectl edit -n kube-system configmap/aws-auth

Fix it and try to remove it again using AWS Console.

lclpedro · 2020-02-06T15:30:00Z

+1 faced the same issue

jakazzy · 2020-03-09T21:38:37Z

I had this issue, which in my case I found a solution for.

For me it related to dangling ENIs left behind by auto-scaling instances up and down (spot in my case). These ENIs were still attached to the node group security group, so the security groups could not be deleted when deleting the cloudformation stack (initiated by eksctl).

Deleting these ENIs (they have a status of Available and not attached to an instance, also will have the node group security group listed) allowed cloudformation to properly delete the stack for the node group and it appears completely deleted to eksctl.

Deleting these dangling ENIs every so often (depending on how quickly they build up for you) is also good policy as they have caused other issues for me (and others) as well:

See:
aws/amazon-vpc-cni-k8s#59
aws/amazon-vpc-cni-k8s#608
etc

Thanks this worked for me.

tedostrem · 2020-04-15T10:17:41Z

+1

musha68k · 2020-04-15T14:37:04Z

+1

Gfeuillen · 2020-04-15T15:09:34Z

+1

We also experienced this issue, what worked for us:

Delete dangling ENIs (as mentioned above)
Resume deleting process by manually deleting the associated CF stack

cordiaz · 2020-04-18T07:32:38Z

+1

I trapped to the same problem :((

============

This step worked for me:

Delete all ENI associated with EKS
Delete all Security Group with EKS

after that, I can delete nodegroup and the cluster...

Yeah!

seanamosw · 2020-04-21T19:35:27Z

I'll also say this is not an eksctl specific issue. Our EKS cluster was not created or managed with eksctl and we had the same issue of dangling ENIs.

polanfong · 2020-04-23T07:17:58Z

Same issue here. Although eksctl said it deleted the node group, the Cloud Formation stack had failed to delete it. The message "must detach all policies first" made me look at the node group's NodeInstanceRole in IAM. I removed the last remaining policy (CloudWatchLogsFullAccess) on that role and that worked for me.

MG40 · 2020-05-12T01:37:43Z

Same issue. I deleted the autoscaling group, The NAT gateways and VPCs. Thanks to the billing alerts. I couldn't find any cluster to delete.

MG40 · 2020-05-14T09:29:33Z

There was another way, the cloudformation stacks were running and I went ahead and deleted the same. That worked too the second time around!

michaelbeaumont · 2020-10-27T15:33:31Z

See #2172 and potentially fixed by #2762

dable-mj111 · 2020-11-10T02:20:24Z

@MG40 +1. I also deleted the hanging nodegroups by deleting the associated cloudformation stack.

ajinkya933 · 2021-03-11T14:35:12Z

In my case the problem was:

$eksctl delete nodegroup -f secondnode.yaml --approve

[!]  continuing with deletion, error occurred: error getting instance role ARN for nodegroup "second": stack not found for nodegroup "second"
[!]  no nodes found in nodegroup "second" (label selector: "alpha.eksctl.io/nodegroup-name=second")
[!]  removing nodegroup from auth ConfigMap: instance identity ARN "" not found in auth ConfigMap

my issue was nodegroup was not getting deleted.

I fixed it :

Problem was in under IAM>roles I removed all the roles which were not equal to the role I got when I executed:

aws iam list-roles \
    | jq -r ".Roles[] \
    | select(.RoleName \
    | startswith(\"eksctl-$AWS_CLUSTER_NAME\") and contains(\"NodeInstanceRole\")) \
    .RoleName"

eksctl-kubeflow-example-nodegroup-ng-185-NodeInstanceRole-1DDJJXQBG9EM6

I deleted all the roles which were not equal to : eksctl-kubeflow-example-nodegroup-ng-185-NodeInstanceRole-1DDJJXQBG9EM6

and after that I added a new node using: eksctl create nodegroup -f secondnode.yaml

Then second node was successfully attached.

sakshampaliwal · 2024-05-25T16:38:24Z

I faced the same issue. I tried deleting the nodegroup both through the GUI and using a command, but it wouldn't delete. It seemed to get stuck. However, after waiting for 10 minutes, it finally got deleted.

austinbv added the kind/bug label Sep 11, 2019

martina-if added the needs-investigation label Apr 14, 2020

kalbir added the area/deletions label May 12, 2020

kalbir mentioned this issue May 12, 2020

eksctl delete cluster should delete cluster and all associated resources #2172

Closed

pandvan mentioned this issue May 27, 2020

Nodegroup is in transitional state "DELETE_FAILED" #2216

Closed

michaelbeaumont closed this as completed Oct 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hanging node group after delete #1325

Hanging node group after delete #1325

austinbv commented Sep 11, 2019

jonasteif commented Sep 20, 2019

davidhole commented Sep 25, 2019

mr-karan commented Oct 4, 2019

ddavtian commented Dec 3, 2019

bubeamos commented Jan 31, 2020

ddavtian commented Feb 1, 2020

lclpedro commented Feb 6, 2020

jakazzy commented Mar 9, 2020

tedostrem commented Apr 15, 2020

musha68k commented Apr 15, 2020

Gfeuillen commented Apr 15, 2020

cordiaz commented Apr 18, 2020 •

edited

Loading

seanamosw commented Apr 21, 2020

polanfong commented Apr 23, 2020 •

edited

Loading

MG40 commented May 12, 2020

MG40 commented May 14, 2020

michaelbeaumont commented Oct 27, 2020

dable-mj111 commented Nov 10, 2020 •

edited

Loading

ajinkya933 commented Mar 11, 2021

sakshampaliwal commented May 25, 2024

Hanging node group after delete #1325

Hanging node group after delete #1325

Comments

austinbv commented Sep 11, 2019

jonasteif commented Sep 20, 2019

davidhole commented Sep 25, 2019

mr-karan commented Oct 4, 2019

ddavtian commented Dec 3, 2019

bubeamos commented Jan 31, 2020

ddavtian commented Feb 1, 2020

lclpedro commented Feb 6, 2020

jakazzy commented Mar 9, 2020

tedostrem commented Apr 15, 2020

musha68k commented Apr 15, 2020

Gfeuillen commented Apr 15, 2020

cordiaz commented Apr 18, 2020 • edited Loading

seanamosw commented Apr 21, 2020

polanfong commented Apr 23, 2020 • edited Loading

MG40 commented May 12, 2020

MG40 commented May 14, 2020

michaelbeaumont commented Oct 27, 2020

dable-mj111 commented Nov 10, 2020 • edited Loading

ajinkya933 commented Mar 11, 2021

sakshampaliwal commented May 25, 2024

cordiaz commented Apr 18, 2020 •

edited

Loading

polanfong commented Apr 23, 2020 •

edited

Loading

dable-mj111 commented Nov 10, 2020 •

edited

Loading