A node was created with error and Karpenter doesn't terminate it #1111

Noksa · 2022-01-09T11:26:34Z

Version

Karpenter: v0.5.4

Kubernetes: v1.21.5

Expected Behavior

Karpenter deletes 'bad' node. At least.

Actual Behavior

For some reason one of nodes was created with the following error in logs:

2022-01-09T11:06:54.813Z	ERROR	controller.provisioning	Could not launch node, launching instances, All attempts fail:
#1: got instance i-0a3e4018687b745ba but PrivateDnsName was not set
#2: got instance i-0a3e4018687b745ba but PrivateDnsName was not set
#3: got instance i-0a3e4018687b745ba but PrivateDnsName was not set	{"commit": "7e79a67", "provisioner": "storage-db-smb-eu-central-1"}

After these messages Karpenter just ignores this instance.
I mean this instance is actually up and running:

But Karpenter said that it couldn't create 'node' and now this instaince is not tracked by Karpenter.
Karpenter doesn't delete it but it is empty.
At the same time this instance has joined a k8s cluster as node:

apiVersion: v1
kind: Node
metadata:
  annotations:
    csi.volume.kubernetes.io/nodeid: '{"ebs.csi.aws.com":"i-0a3e4018687b745ba"}'
    node.alpha.kubernetes.io/ttl: "0"
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2022-01-09T11:07:42Z"
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: c5a.xlarge
    beta.kubernetes.io/os: linux
    failure-domain.beta.kubernetes.io/region: eu-central-1
    failure-domain.beta.kubernetes.io/zone: eu-central-1a
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: ip-192-168-54-192
    kubernetes.io/os: linux
    node.kubernetes.io/instance-type: c5a.xlarge
    topology.ebs.csi.aws.com/zone: eu-central-1a
    topology.kubernetes.io/region: eu-central-1
    topology.kubernetes.io/zone: eu-central-1a
  name: ip-192-168-54-192.eu-central-1.compute.internal
  resourceVersion: "171830741"
  uid: c8deca20-68d4-4452-bef3-e38babd85a82
spec:
  providerID: aws:///eu-central-1a/i-0a3e4018687b745ba
...
status:
...
   - address: ip-192-168-54-192.eu-central-1.compute.internal                                                                                                                   
     type: InternalDNS

What I did:
I scaled statefulset from 0 to 3 and Karpenter created 4 instances instead of 3 (due to the error I guess).

Steps to Reproduce the Problem

Actually I don't know because I encountered this issue first time

Resource Specs and Logs

Can we have a workaround for such cases? I mean Karpenter should track instances if they actually were created even if they are not treated as 'node'.

PS: I tried to restart Karpenter but it didn't find this instance/node and couldn't delete it.
I think this is because the finalizer has not been added by Karpenter due to the error.
Maybe Karpenter should add finalizer anyway despite the fact that there were errors.

The text was updated successfully, but these errors were encountered:

olemarkus · 2022-01-09T11:28:56Z

Same as #1067, I think

Noksa · 2022-01-09T12:16:59Z

Yeah, looks the same.

I think it may happen because PrivateDnsName is set only when an instance is in running state:

privateDnsName
    (IPv4 only) The private DNS hostname name assigned to the instance. This DNS hostname can only be used inside the Amazon EC2 network. This name is not available until the instance enters the running state.

Ref: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_Instance.html)

If for some reason an instance can't be in running state it means it doesn't have PrivateDnsName.
The most common scenario is when EC2 doesn't have capacity of desired instance type and we should wait a bit until EC2 can provision the requested instance type (I'm not sure if we have the same behavior but when I used ASG - I saw such cases a lot of times). Maybe there are other uncommon scenarion when provisioning a new instance takes a bit longer and an instance receives its PrivateDnsName when Karpenter already decided that the instance doesn't have it.

ellistarn · 2022-01-09T17:05:42Z

EC2 can often take several seconds to assign a Private DNS Name. It's usually much quicker, but we've seen this happen. This will be resolved permanently when we move towards id-based naming in kubernetes 1.23.

Are you using a custom launch template? It must inject the label karpenter.sh/provisioner-name="storage-db-smb-eu-central-1". Karpenter does this automatically for launch templates it generates, but can't enforce it with custom launch templates. There is almost certainly a docs gap here. (@geoffcline). This is one of the many pain points of custom launch templates that makes me wonder if we simply shouldn't support them (@suket22). Further, any of the labels karpenter would've applied (provisioner.spec.labels) will be missing.

This label allows karpenter to manage the node, even if there was a loss of availability when the node came online. Further, karpenter will inject the finalizer into the node, as long as the label exists.

Some paths forward:

Include this in your custom LT (and update docs)
We could build an aws specific controller that Describes Instances and discovers nodes and rebuilds labels from AWS tags.
I'd love to learn why you're using a custom LT and if there are other features we can provide to simplify this.

Noksa · 2022-01-09T17:23:35Z

Yeah, I'm using custom LT.
This is the reason #1008
In my case:

some of pods must use Ubuntu as OS.
Some of pods must use specific AMI where we preinstalled some packages on a node (for example tcpdump). So we built that AMI and I have to use it instead of default.

I think if #1008 is implemented I will be able to get rid of custom LT.

Noksa · 2022-01-09T17:26:17Z

@ellistarn
Thanks for the clarification about the label and provisioners.
But could you say why other nodes that are created correctly have this label even if they are provisioned with custom LT?

As example:

apiVersion: v1
kind: Node
metadata:
  annotations:
    node.alpha.kubernetes.io/ttl: "0"
  creationTimestamp: "2022-01-09T17:25:53Z"
  finalizers:
  - karpenter.sh/termination
  labels:
    karpenter.sh/capacity-type: on-demand
    karpenter.sh/provisioner-name: storage-db-smb-eu-central-1
    node.kubernetes.io/instance-type: c5a.xlarge
    only-db: "true"
    topology.kubernetes.io/zone: eu-central-1b
...

It looks like only 'bad' node without PrivateDnsName doesn't have the label and the finalizer
And also Karpenter applied a label from Provisioner spec which is only-db: "true" in my case but as I mentioned - only for 'good' nodes.

olemarkus · 2022-01-09T17:33:01Z

When Karpenter is able to create the instance, it also creates the node object itself with the correct label. When it fails, it is kubelet that creates the node object, but kubelet has not been configured to add the label, thus it is missing.

Noksa · 2022-01-09T17:41:44Z

@olemarkus Thanks for the clarification.
When an instance has been created and joined a k8s cluster (even if Karpenter failed to create a node object itself), Karpenter still is able to add required labels/finalizers by itself because it knows at least the node's ID when it tried to create a node object.
Could we use such behavior to ensure that a node object has all required labels/finalizers/etc that should have been applied by Karpenter?

I think the core problem is that Karpenter relied on a node object which is created by itself. Probably it would help if Karpenter is able to check/modify node objects that are created by kubelet when Karpenter failed to do it.

olemarkus · 2022-01-09T17:49:34Z

It would take some state maintenance that karpenter doesn't do right now. The easier solution is to use the instance ID as node name, which I added support for earlier and that kOps clusters already make use of.

ellistarn · 2022-01-28T18:22:43Z

Hey @Noksa, going to close since things should be working now that you have the correct labels in your custom launch template. Will track follow ups in #1008

Noksa · 2022-01-29T07:03:00Z

Sure, I’m waiting for #1008, it satisfies almost all cases

Noksa added the bug Something isn't working label Jan 9, 2022

Noksa changed the title ~~A node was created with error and Karpenter doesn't delete it~~ A node was created with error and Karpenter doesn't terminate it Jan 9, 2022

ellistarn added the launch-templates Questions related to AWS launch templates label Jan 9, 2022

ellistarn added the blocked Unable to make progress due to some dependency label Jan 28, 2022

suket22 closed this as completed Feb 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A node was created with error and Karpenter doesn't terminate it #1111

A node was created with error and Karpenter doesn't terminate it #1111

Noksa commented Jan 9, 2022 •

edited

Loading

olemarkus commented Jan 9, 2022

Noksa commented Jan 9, 2022 •

edited

Loading

ellistarn commented Jan 9, 2022

Noksa commented Jan 9, 2022

Noksa commented Jan 9, 2022 •

edited

Loading

olemarkus commented Jan 9, 2022 •

edited

Loading

Noksa commented Jan 9, 2022 •

edited

Loading

olemarkus commented Jan 9, 2022

ellistarn commented Jan 28, 2022 •

edited

Loading

Noksa commented Jan 29, 2022

A node was created with error and Karpenter doesn't terminate it #1111

A node was created with error and Karpenter doesn't terminate it #1111

Comments

Noksa commented Jan 9, 2022 • edited Loading

Version

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Resource Specs and Logs

olemarkus commented Jan 9, 2022

Noksa commented Jan 9, 2022 • edited Loading

ellistarn commented Jan 9, 2022

Noksa commented Jan 9, 2022

Noksa commented Jan 9, 2022 • edited Loading

olemarkus commented Jan 9, 2022 • edited Loading

Noksa commented Jan 9, 2022 • edited Loading

olemarkus commented Jan 9, 2022

ellistarn commented Jan 28, 2022 • edited Loading

Noksa commented Jan 29, 2022

Noksa commented Jan 9, 2022 •

edited

Loading

Noksa commented Jan 9, 2022 •

edited

Loading

Noksa commented Jan 9, 2022 •

edited

Loading

olemarkus commented Jan 9, 2022 •

edited

Loading

Noksa commented Jan 9, 2022 •

edited

Loading

ellistarn commented Jan 28, 2022 •

edited

Loading