Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A node was created with error and Karpenter doesn't terminate it #1111

Closed
Noksa opened this issue Jan 9, 2022 · 10 comments
Closed

A node was created with error and Karpenter doesn't terminate it #1111

Noksa opened this issue Jan 9, 2022 · 10 comments
Labels
blocked Unable to make progress due to some dependency bug Something isn't working launch-templates Questions related to AWS launch templates

Comments

@Noksa
Copy link

Noksa commented Jan 9, 2022

Version

Karpenter: v0.5.4

Kubernetes: v1.21.5

Expected Behavior

Karpenter deletes 'bad' node. At least.

Actual Behavior

For some reason one of nodes was created with the following error in logs:

2022-01-09T11:06:54.813Z	ERROR	controller.provisioning	Could not launch node, launching instances, All attempts fail:
#1: got instance i-0a3e4018687b745ba but PrivateDnsName was not set
#2: got instance i-0a3e4018687b745ba but PrivateDnsName was not set
#3: got instance i-0a3e4018687b745ba but PrivateDnsName was not set	{"commit": "7e79a67", "provisioner": "storage-db-smb-eu-central-1"}

After these messages Karpenter just ignores this instance.
I mean this instance is actually up and running:
image
But Karpenter said that it couldn't create 'node' and now this instaince is not tracked by Karpenter.
Karpenter doesn't delete it but it is empty.
At the same time this instance has joined a k8s cluster as node:

apiVersion: v1
kind: Node
metadata:
  annotations:
    csi.volume.kubernetes.io/nodeid: '{"ebs.csi.aws.com":"i-0a3e4018687b745ba"}'
    node.alpha.kubernetes.io/ttl: "0"
    volumes.kubernetes.io/controller-managed-attach-detach: "true"
  creationTimestamp: "2022-01-09T11:07:42Z"
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/instance-type: c5a.xlarge
    beta.kubernetes.io/os: linux
    failure-domain.beta.kubernetes.io/region: eu-central-1
    failure-domain.beta.kubernetes.io/zone: eu-central-1a
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: ip-192-168-54-192
    kubernetes.io/os: linux
    node.kubernetes.io/instance-type: c5a.xlarge
    topology.ebs.csi.aws.com/zone: eu-central-1a
    topology.kubernetes.io/region: eu-central-1
    topology.kubernetes.io/zone: eu-central-1a
  name: ip-192-168-54-192.eu-central-1.compute.internal
  resourceVersion: "171830741"
  uid: c8deca20-68d4-4452-bef3-e38babd85a82
spec:
  providerID: aws:///eu-central-1a/i-0a3e4018687b745ba
...
status:
...
   - address: ip-192-168-54-192.eu-central-1.compute.internal                                                                                                                   
     type: InternalDNS

What I did:
I scaled statefulset from 0 to 3 and Karpenter created 4 instances instead of 3 (due to the error I guess).

Steps to Reproduce the Problem

Actually I don't know because I encountered this issue first time

Resource Specs and Logs

Can we have a workaround for such cases? I mean Karpenter should track instances if they actually were created even if they are not treated as 'node'.

PS: I tried to restart Karpenter but it didn't find this instance/node and couldn't delete it.
I think this is because the finalizer has not been added by Karpenter due to the error.
Maybe Karpenter should add finalizer anyway despite the fact that there were errors.

@Noksa Noksa added the bug Something isn't working label Jan 9, 2022
@Noksa Noksa changed the title A node was created with error and Karpenter doesn't delete it A node was created with error and Karpenter doesn't terminate it Jan 9, 2022
@olemarkus
Copy link
Contributor

Same as #1067, I think

@Noksa
Copy link
Author

Noksa commented Jan 9, 2022

Yeah, looks the same.

I think it may happen because PrivateDnsName is set only when an instance is in running state:

privateDnsName
    (IPv4 only) The private DNS hostname name assigned to the instance. This DNS hostname can only be used inside the Amazon EC2 network. This name is not available until the instance enters the running state. 

Ref: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_Instance.html)

If for some reason an instance can't be in running state it means it doesn't have PrivateDnsName.
The most common scenario is when EC2 doesn't have capacity of desired instance type and we should wait a bit until EC2 can provision the requested instance type (I'm not sure if we have the same behavior but when I used ASG - I saw such cases a lot of times). Maybe there are other uncommon scenarion when provisioning a new instance takes a bit longer and an instance receives its PrivateDnsName when Karpenter already decided that the instance doesn't have it.

@ellistarn ellistarn added the launch-templates Questions related to AWS launch templates label Jan 9, 2022
@ellistarn
Copy link
Contributor

EC2 can often take several seconds to assign a Private DNS Name. It's usually much quicker, but we've seen this happen. This will be resolved permanently when we move towards id-based naming in kubernetes 1.23.

Are you using a custom launch template? It must inject the label karpenter.sh/provisioner-name="storage-db-smb-eu-central-1". Karpenter does this automatically for launch templates it generates, but can't enforce it with custom launch templates. There is almost certainly a docs gap here. (@geoffcline). This is one of the many pain points of custom launch templates that makes me wonder if we simply shouldn't support them (@suket22). Further, any of the labels karpenter would've applied (provisioner.spec.labels) will be missing.

This label allows karpenter to manage the node, even if there was a loss of availability when the node came online. Further, karpenter will inject the finalizer into the node, as long as the label exists.

Some paths forward:

  • Include this in your custom LT (and update docs)
  • We could build an aws specific controller that Describes Instances and discovers nodes and rebuilds labels from AWS tags.
  • I'd love to learn why you're using a custom LT and if there are other features we can provide to simplify this.

@Noksa
Copy link
Author

Noksa commented Jan 9, 2022

Yeah, I'm using custom LT.
This is the reason #1008
In my case:

  • some of pods must use Ubuntu as OS.
  • Some of pods must use specific AMI where we preinstalled some packages on a node (for example tcpdump). So we built that AMI and I have to use it instead of default.

I think if #1008 is implemented I will be able to get rid of custom LT.

@Noksa
Copy link
Author

Noksa commented Jan 9, 2022

@ellistarn
Thanks for the clarification about the label and provisioners.
But could you say why other nodes that are created correctly have this label even if they are provisioned with custom LT?

As example:

apiVersion: v1
kind: Node
metadata:
  annotations:
    node.alpha.kubernetes.io/ttl: "0"
  creationTimestamp: "2022-01-09T17:25:53Z"
  finalizers:
  - karpenter.sh/termination
  labels:
    karpenter.sh/capacity-type: on-demand
    karpenter.sh/provisioner-name: storage-db-smb-eu-central-1
    node.kubernetes.io/instance-type: c5a.xlarge
    only-db: "true"
    topology.kubernetes.io/zone: eu-central-1b
...

It looks like only 'bad' node without PrivateDnsName doesn't have the label and the finalizer
And also Karpenter applied a label from Provisioner spec which is only-db: "true" in my case but as I mentioned - only for 'good' nodes.

@olemarkus
Copy link
Contributor

olemarkus commented Jan 9, 2022

When Karpenter is able to create the instance, it also creates the node object itself with the correct label. When it fails, it is kubelet that creates the node object, but kubelet has not been configured to add the label, thus it is missing.

@Noksa
Copy link
Author

Noksa commented Jan 9, 2022

@olemarkus Thanks for the clarification.
When an instance has been created and joined a k8s cluster (even if Karpenter failed to create a node object itself), Karpenter still is able to add required labels/finalizers by itself because it knows at least the node's ID when it tried to create a node object.
Could we use such behavior to ensure that a node object has all required labels/finalizers/etc that should have been applied by Karpenter?

I think the core problem is that Karpenter relied on a node object which is created by itself. Probably it would help if Karpenter is able to check/modify node objects that are created by kubelet when Karpenter failed to do it.

@olemarkus
Copy link
Contributor

It would take some state maintenance that karpenter doesn't do right now. The easier solution is to use the instance ID as node name, which I added support for earlier and that kOps clusters already make use of.

@ellistarn ellistarn added the blocked Unable to make progress due to some dependency label Jan 28, 2022
@ellistarn
Copy link
Contributor

ellistarn commented Jan 28, 2022

Hey @Noksa, going to close since things should be working now that you have the correct labels in your custom launch template. Will track follow ups in #1008

@Noksa
Copy link
Author

Noksa commented Jan 29, 2022

Sure, I’m waiting for #1008, it satisfies almost all cases

@suket22 suket22 closed this as completed Feb 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked Unable to make progress due to some dependency bug Something isn't working launch-templates Questions related to AWS launch templates
Projects
None yet
Development

No branches or pull requests

4 participants