Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taint nodes with a NoSchedule for Consolidation before Validation begins #651

Open
njtran opened this issue Oct 31, 2023 · 31 comments · May be fixed by wmgroot/karpenter#1
Open

Taint nodes with a NoSchedule for Consolidation before Validation begins #651

njtran opened this issue Oct 31, 2023 · 31 comments · May be fixed by wmgroot/karpenter#1
Assignees
Labels
deprovisioning Issues related to node deprovisioning kind/feature Categorizes issue or PR as related to a new feature. v1 Issues requiring resolution by the v1 milestone

Comments

@njtran
Copy link
Contributor

njtran commented Oct 31, 2023

Description

What problem are you trying to solve?
Karpenter adds a karpenter.sh/disruption:NoSchedule=disrupting taint for Consolidation actions after a 15s validation period. There is a narrow interval of time where pods blocking eviction can schedule to nodes when Karpenter taints the nodes, resulting in these pods getting eventually evicted during termination. This was validated to be a race of < 300ms here.

While #624 includes PreferNoSchedule as an option here, it may make more sense to only use NoSchedule for Consolidation here to completely remove this race condition.

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@njtran njtran added deprovisioning Issues related to node deprovisioning kind/feature Categorizes issue or PR as related to a new feature. v1 Issues requiring resolution by the v1 milestone labels Oct 31, 2023
@njtran njtran changed the title Taint nodes with a NoSchedule for Consolidation once before Validation begins Taint nodes with a NoSchedule for Consolidation before Validation begins Oct 31, 2023
@billrayburn
Copy link

/assign @njtran

@njtran njtran removed their assignment Jan 23, 2024
@ljosyula
Copy link

/assign @ljosyula

@jmdeal
Copy link
Member

jmdeal commented Feb 13, 2024

/assign jmdeal

@wmgroot
Copy link
Contributor

wmgroot commented Feb 13, 2024

I'm working on implementing a patch to deal with this issue at my company. We're running into this constantly when using Karpenter for a cluster with almost nothing but ephemeral CI jobs, which seems like a recipe for constant disruption evaluation because the pod profile is highly volatile.

Can I ask that we prioritize an air-tight fix for the issue instead of another "reduce the chance of it occurring" option that is proposed with the PreferNoSchedule implementation? Right now the do-not-disrupt annotation does not do what it says on tin.

@njtran
Copy link
Contributor Author

njtran commented Feb 14, 2024

@wmgroot Can you share how often you're hitting this? From my understanding the race is super small window, so I'm curious exactly how often this is hitting. Are you seeing a do-not-disrupt pod added/scheduled to a node, only for that node to be soon consolidated after?

@wmgroot
Copy link
Contributor

wmgroot commented Feb 14, 2024

We've seen almost 2300 occurrences in the last week with Gitlab CI job pods. Every pod has the do-not-disrupt annotation. This cluster has about 25 NodeClaims which are primarily used for extremely volatile CI jobs.
Screenshot from 2024-02-14 11-45-20

Here is the patch I've applied to our clusters to address the issue, which verifies there are no do-not-disrupt pods after a simple 30s sleep. We've seen 0 occurrences of the issue since I applied the patch.
https://github.com/wmgroot/karpenter/pull/1/files

@wmgroot
Copy link
Contributor

wmgroot commented Feb 14, 2024

Here are some notes I've taken as I've explored this particular problem space. Hopefully this is helpful as background to anyone else trying to understand the problem.

I'd ultimately prefer we didn't use this annotation at all because it has node-level impact (a single pod can keep a node of arbitrary size around indefinitely). The limitations of Gitlab CI are currently making it necessary to use the annotation to complete a transition from cluster-autoscaler to Karpenter, which is still worth the tradeoff.

  1. Job-type workloads want a way to delay or avoid unnecessary disruption, since this results in lost work that must be performed again.
  2. There are multiple cases where pod shutdown is desired, but each situation might have a different preferred outcome. For example:
    a. Initial shutdown of a node due to efficiency consolidation or a regular maintenance update.
    b. Node lifetime timeout due to compliance requirements.
    c. User-initiated cancellation of a CI pipeline.
  3. The do-not-disrupt annotation is Karpenter-specific, making it not portable between different cluster architectures (which may or may not include cluster node autoscaling.
  4. PDBs are architecture agnostic, but are not a great fit for job-type workloads, they're designed for redundant pod patterns. In our clusters we have several OPA policies that forbid the creation of PDBs that do not tolerate voluntary disruption, blocking their use for single pod cases.
  5. PreStop hooks are architecture agnostic, but make it difficult to support the desired outcomes for the cases described in point 2. They are not aware of "why" the pod is being requested to terminate.

Point 5 is the current reason why we can't just use a preStop hook to address the disruption of CI jobs. The hook doesn't know the difference between karpenter terminating the pod for node consolidation (we want to delay disruption) and a user in the Gitlab UI terminating the pod to cancel a CI pipeline (we want to terminate the pod immediately, but still run any post-job cleanup, which is skipped via forced pod termination). Adding the preStop hook as Gitlab CI is currently implemented would block termination of the job pod if a user attempted to cancel the job through the UI.

In summary I believe this issue is bigger than both Karpenter and Gitlab CI. The tooling we have in K8s is either not flexible enough to support the high disruption patterns we'd like to use in more efficient clusters, or it's not clear enough how to use the existing tooling.

@mikkeloscar
Copy link

We are also still seeing this issue from time to time, though less often than previous to the mitigation in #583

I observed the following events:

{"eventNamespace":"default","involvedObject":{"kind":"Node","name":"ip-172-31-6-136.eu-central-1.compute.internal","uid":"473d160c-783a-427e-a75f-5c57c57e29df","apiVersion":"v1"},"reason":"DisruptionBlocked","eventMessage":"Cannot disrupt Node: Pod \"default/cdp-default-2ekrbha7wakskum3uuw492ts4y\" has \"karpenter.sh/do-not-disrupt\" annotation","source":{"component":"karpenter"},"firstTimestamp":"2024-02-26T16:41:54Z","lastTimestamp":"2024-02-26T16:41:54Z","count":1,"type":"Normal","reportingComponent":"karpenter"}

{"eventNamespace":"default","involvedObject":{"kind":"NodeClaim","name":"default-karpenter-fh9zb","uid":"42c74ed6-77b9-41e9-ae6f-b27b4e29feab","apiVersion":"karpenter.sh/v1beta1"},"reason":"DisruptionBlocked","eventMessage":"Cannot disrupt NodeClaim: Pod \"default/cdp-default-2ekrbha7wakskum3uuw492ts4y\" has \"karpenter.sh/do-not-disrupt\" annotation","source":{"component":"karpenter"},"firstTimestamp":"2024-02-26T16:41:54Z","lastTimestamp":"2024-02-26T16:41:54Z","count":1,"type":"Normal","reportingComponent":"karpenter"}

{"eventNamespace":"default","involvedObject":{"kind":"Node","name":"ip-172-31-6-136.eu-central-1.compute.internal","uid":"473d160c-783a-427e-a75f-5c57c57e29df","apiVersion":"v1"},"reason":"DisruptionTerminating","eventMessage":"Disrupting Node: Consolidation/Delete","source":{"component":"karpenter"},"firstTimestamp":"2024-02-26T16:41:54Z","lastTimestamp":"2024-02-26T16:41:54Z","count":1,"type":"Normal","reportingComponent":"karpenter"}

{"eventNamespace":"default","involvedObject":{"kind":"NodeClaim","name":"default-karpenter-fh9zb","uid":"42c74ed6-77b9-41e9-ae6f-b27b4e29feab","apiVersion":"karpenter.sh/v1beta1"},"reason":"DisruptionTerminating","eventMessage":"Disrupting NodeClaim: Consolidation/Delete","source":{"component":"karpenter"},"firstTimestamp":"2024-02-26T16:41:54Z","lastTimestamp":"2024-02-26T16:41:54Z","count":1,"type":"Normal","reportingComponent":"karpenter"}

{"eventNamespace":"default","involvedObject":{"kind":"Node","name":"ip-172-31-6-136.eu-central-1.compute.internal","uid":"473d160c-783a-427e-a75f-5c57c57e29df","apiVersion":"v1"},"reason":"FailedDraining","eventMessage":"Failed to drain node, 10 pods are waiting to be evicted","source":{"component":"karpenter"},"firstTimestamp":"2024-02-26T16:41:54Z","lastTimestamp":"2024-02-26T16:41:54Z","count":1,"type":"Warning","reportingComponent":"karpenter"}

{"eventNamespace":"default","involvedObject":{"kind":"Pod","namespace":"default","name":"cdp-default-2ekrbha7wakskum3uuw492ts4y","apiVersion":"v1"},"reason":"Evicted","eventMessage":"Evicted pod","source":{"component":"karpenter"},"firstTimestamp":"2024-02-26T16:41:55Z","lastTimestamp":"2024-02-26T16:41:55Z","count":1,"type":"Normal","reportingComponent":"karpenter"}

The first two events show that the pod cdp-default-2ekrbha7wakskum3uuw492ts4y is blocking the disruption, but the very next events 1 second later is that the node is being disrupted.

This pod is used to run e2e tests and evicting this pod basically breaks our e2e run, so it's similar disruption as mentioned in the example from @wmgroot

@jonathan-innis
Copy link
Member

@mikkeloscar Thats interesting. I wouldn't expect to see the behavior that you're describing. I could definitely see this race occurring when there is a pod that is just coming up and we don't see the binding to the new node yet, but it's odd that we would have a node that is blocked by a pod and then one second later we see that that same pod is no longer blocking (but no bindings have changed).

Did the state of that pod change in any way? Did it shift to a new node or did it change its phase or status conditions to indicate that it finished?

@mikkeloscar
Copy link

Did the state of that pod change in any way? Did it shift to a new node or did it change its phase or status conditions to indicate that it finished?

I'm not 100% sure as we rely on events that we ship to our long term storage, but in all the events about the pod I can only find the event about karpenter evicting the pod and before it was in a running state.

I looked at earlier events and the pod was just started 6s before the first attempt to disrupt, so that is maybe also a relevant thing to consider:

{"eventNamespace":"default","involvedObject":{"kind":"Pod","namespace":"default","name":"cdp-default-2ekrbha7wakskum3uuw492ts4y","uid":"1e9e7393-68db-466b-9126-956d9c9830f7","apiVersion":"v1","fieldPath":"spec.containers{main}"},"reason":"Started","eventMessage":"Started container main","source":{"component":"kubelet","host":"ip-172-31-6-136.eu-central-1.compute.internal"},"firstTimestamp":"2024-02-26T16:41:48Z","lastTimestamp":"2024-02-26T16:41:48Z","count":1,"type":"Normal","reportingComponent":"kubelet","reportingInstance":"ip-172-31-6-136.eu-central-1.compute.internal"}

@DanielQ-CV
Copy link

i have the same problem with CI Job/Pods.
The Pod has do-not-disrupt & do-not-evict annotations.
The Pods is started some seconds before Karpenter try to consolidate the node
""message":"disrupting via consolidation delete, terminating 1 candidates "
The Pod will Evicted

@JacobHenner
Copy link

I wouldn't expect to see the behavior that you're describing. I could definitely see this race occurring when there is a pod that is just coming up and we don't see the binding to the new node yet, but it's odd that we would have a node that is blocked by a pod and then one second later we see that that same pod is no longer blocking (but no bindings have changed).

We're also seeing frequent issues under almost the same circumstances as #651 (comment) - GitLab CI executor pods being evicted frequently even though they have the do-not-disrupt annotation specified at pod creation time. Anecdotally (I don't have before/after metrics), the frequency of this issue increased after upgrading from v0.33.x to 0.35.0.

@alex-berger
Copy link

Same here, also with GitLab CI Job execution pods. Just chiming in to underline that this is a serious bug seriously affecting (karpenter) users in real life :-).

@ospiegel91
Copy link

My team (big global Company) is affected by this while using argo workflows (job templates behind the scenes).
Pods gets eviction notice before even entering init stage. do-not-disrupt annotation doesnt help

@jmdeal
Copy link
Member

jmdeal commented Apr 5, 2024

I'm going to wager we're seeing this occur more frequently as users migrate to versions of Karpenter with parallel disruption. More rapid, parallel consolidation == more occurrences of this race condition. We've been punting solving this issue for a full, airtight taint strategy redesign but given the increased prevalence we should take a temporary measure that may or not be amended when we do do the taint redesign addressing #624.

Realistically, I think @wmgroot's solution is a good route to go within our existing tainting behavior. I am curious if we really need a full 30 second waiting period, given that the pod list is eventually consistent we will need some sufficiently large waiting period. CAS implements a similar wait period that is defaulted to 5 seconds but is configurable via the node-delete-delay-after-taint flag, though I haven't seen any guidance on when a user might need to tune this.

@wmgroot
Copy link
Contributor

wmgroot commented Apr 5, 2024

Making it configurable just covers the case where someone might need to increase it in the future. I agree that 30s is likely much longer than it needs to be.

@gnuletik
Copy link

gnuletik commented Apr 8, 2024

@wmgroot I tested your branch and it seems to have greatly improved the situation (I just made small changes: rebase from the latest published tag, restore the pdb exclusion and exclude failed / succeeded pods when comparing annotations).

However, I'm still seeing a few nodes being evicted when Karpenter is starting, here are the logs:

Starting workers
tainted node
deleted node
deleted nodeclaim

During the startup, I don't see the following message that was introduced in the PR:

waiting for 30 seconds to check for do-not-disrupt or do-not-consolidate pods that may have scheduled...

I can see it when the disruption controller is attempting to disrupt a node though.

Is there another place where nodes are being disrupted and we need to call the ValidateNoScheduleTaint() func?

Thanks!

@jmdeal
Copy link
Member

jmdeal commented Apr 8, 2024

These log lines should be coming from the termination controller which handles the eventual deletion of nodes instead of the disruption controller. If the disruption controller was responsible for deleting the nodes (rather than a manual user action) they should have already been tainted and validated by the disruption controller. I imagine what happened is when Karpenter was restarted a disruption decision had been made, candidates were tainted and validated, and then nodes had their deletion timestamps set. Karpenter restarted before the termination controller processed all of the nodes and it continued after it came back up. Does this line up with what you saw @gnuletik? Or were there instances of nodes with do-not-disrupt pods being disrupted?

@gnuletik
Copy link

gnuletik commented Apr 9, 2024

Thanks for the feedback @jmdeal. Thanks also for the explanation of the termination controller.

There was no manual deletion of the nodes.
There was instances of nodes with do-not-disrupt pods that was disrupted.
I checked the logs before Karpenter restarted and I didn't see logs coming from the disruption controller that deleted the node.

However, before Karpenter was restarted, there was multiple occurrences of:

  • created nodeclaim
  • launched nodeclaim
  • registered nodeclaim

Is that possible that if Karpenter is restarted during the process of creating a new node, this lead to a pod eviction?

Also, Karpenter was restarted because there was a panic while trying to renew the lock (the apiserver was slow to answer because of heavy usage during this period):

panic: leader election lost

@JacobHenner
Copy link

As a temporary workaround I've implemented a validating webhook that rejects eviction requests from Karpenter if the pod is appropriately annotated. This seems to work, but is detrimental because (as I understand it) eviction will now be blocked for pods that are interrupted (e.g. spot reclaim), which is undesirable (but acceptable short-term in one particular case I'm dealing with).

@jmdeal do you have an estimate for when a short-term fix and a long-term fix will be available? We're looking to rapidly expand our use of Karpenter within our infrastructure and this issue is making it difficult to do so. Having an idea of when a short-term/long-term fix will be available will help us determine what our next steps should be.

@DanielQ-CV
Copy link

Hello thank you for providing the fix. We have installed the release 0.36.1 on Friday and the whole weekend there were no crashes in our CI/CD agents. The last weekend we had an interruption almost every hour. The fix seems to be working well. Thank you very much !

@armondressler
Copy link

We've upgraded to 0.36.1 around 6 days ago, the issue hasn't reoccurred as of yet. Appreciate your work, thanks guys.

@JacobHenner
Copy link

I've also not received any reports of unexpected consolidation disruptions since upgrading to 0.36.1. However, I've seen drift disruptions on nodes with the do-not-disrupt annotation on one or more assigned pods, which I believe is unexpected. I've worked around this for now by disabling drift disruption.

@jmdeal
Copy link
Member

jmdeal commented May 7, 2024

Definitely not expected, how frequently are you seeing this occur? Do you have logs you can share? Maybe worth opening a separate issue?

@armondressler
Copy link

We've upgraded to 0.36.1 around 6 days ago, the issue hasn't reoccurred as of yet. Appreciate your work, thanks guys.

Looks like we're still seeing pods getting evicted (despite do-not-disrupt annotation) immediately after scheduling due to node consolidation, even when running v0.36.1.

Karpenter log entries for the node as follows:

{"level":"INFO","time":"2024-05-24T11:26:25.504Z","logger":"controller.nodeclaim.lifecycle","message":"registered nodeclaim","commit":"fb4d75f","nodeclaim":"application-amd64-bottlerocket-on-demand-v1-bzkbj","provider-id":"aws:///eu-central-1c/i-0c349508d7a941d34","node":"ip-10-48-37-168.eu-central-1.compute.internal"}
{"level":"INFO","time":"2024-05-24T11:26:55.171Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"fb4d75f","nodeclaim":"application-amd64-bottlerocket-on-demand-v1-bzkbj","provider-id":"aws:///eu-central-1c/i-0c349508d7a941d34","node":"ip-10-48-37-168.eu-central-1.compute.internal","allocatable":{"cpu":"15890m","ephemeral-storage":"60711501722","hugepages-1Gi":"0","hugepages-2Mi":"0","memory":"28604942517","pods":"234"}}
{"level":"INFO","time":"2024-05-24T11:30:59.945Z","logger":"controller.disruption","message":"disrupting via consolidation delete, terminating 1 nodes (0 pods) ip-10-48-37-168.eu-central-1.compute.internal/c5a.4xlarge/on-demand","commit":"fb4d75f","command-id":"e2b9463c-d676-4c3e-9df8-1c00b5101946"}
{"level":"INFO","time":"2024-05-24T11:31:00.412Z","logger":"controller.node.termination","message":"tainted node","commit":"fb4d75f","node":"ip-10-48-37-168.eu-central-1.compute.internal"}
{"level":"INFO","time":"2024-05-24T11:31:06.694Z","logger":"controller.node.termination","message":"deleted node","commit":"fb4d75f","node":"ip-10-48-37-168.eu-central-1.compute.internal"}
{"level":"INFO","time":"2024-05-24T11:31:07.213Z","logger":"controller.nodeclaim.termination","message":"deleted nodeclaim","commit":"fb4d75f","nodeclaim":"application-amd64-bottlerocket-on-demand-v1-bzkbj","node":"ip-10-48-37-168.eu-central-1.compute.internal","provider-id":"aws:///eu-central-1c/i-0c349508d7a941d34"}

The k8s event log regarding the pod in question is rather straightforward:

10m         Normal    Scheduled                    pod/runner-svhlvxier-project-30216564-concurrent-0-r6wflsak                           Successfully assigned gitlab-runner/runner-svhlvxier-project-30216564-concurrent-0-r6wflsak to ip-10-48-37-168.eu-central-1.compute.internal
10m         Normal    Evicted                      pod/runner-svhlvxier-project-30216564-concurrent-0-r6wflsak   

I'd estimate this occurs once every 50 CI jobs. Do you require any other logs or loglevels to better analyze the issue?

@jmdeal
Copy link
Member

jmdeal commented Jun 12, 2024

Hmm, that seems pretty clear cut. There are some inflight changes that will mitigate the impact of these slipping through, #916 includes a change that will prevent pods with the do-not-disrupt annotation from being evicted during our drain process (with TerminationGracePeriod providing a mechanism for cluster operator to set an upperbound on node lifetime). Would this change be sufficient for your use case? The other option left for us is to have the overlapping taint / validation period, which will come at the cost of increased disruption latency since we would need to wait a few seconds to ensure the kube-scheduler has seen the taint.

@armondressler
Copy link

#916 includes a change that will prevent pods with the do-not-disrupt annotation from being evicted during our drain process

The do-not-disrupt annotation being respected during drain should fix our issue. Let's see how it behaves when #916 gets merged. Thanks for your hard work :)

@Nuru
Copy link

Nuru commented Jun 28, 2024

@jmdeal I'm wondering if we might be subject to this bug, or a similar bug, or if Karpenter properly handles this use case. It is difficult to catch happening, and I have not seen it yet (we only just started using this mechanism), but I think it is likely that Karpenter does not handle this properly, and I want to get it addressed. I don't see an easy way to monitor for it, and I do not want to put the effort into catching the bug if you can confirm from reviewing the code that the bug exists. Please let me know your thoughts.

Update: this is happening

I have now seen this series of events, with short enough intervals that I would call this a race condition:

  1. Pod deployed to Node X without annotation
  2. Pod adds annotation "karpenter.sh/do-not-disrupt": "true" to itself
  3. Karpenter (v0.37.0) decides Node X is underutilized, and logs "disrupting via consolidation delete" despite the annotated Pod
  4. Karpenter taints the Node with NoSchedule
  5. Something sends SIGTERM to the Pod while it is still running

This happens often enough that we cannot use WhenUnderutilized for NodePools running these Pods.

End of Update

We are running ephemeral GitHub Action Runners. These Pods take jobs off a queue, run the job, then exit. They take a while to start up (anywhere from 10 seconds to 3 minutes depending on what else is going on with the Node), so we keep a warm pool of idle runners ready to accept jobs, and then launch new Pods when those idle Pods start doing real work.

Idle Pods can be disrupted, so we schedule and deploy Pods with no annotations. We want the idle Pods to be disrupted if it will allow consolidation, because we can scale from a few to a few hundred and back, and do not want idle Pods scattered across big instances to prevent them from being consolidated to a single small instance.

The catch is, once a Pod takes a job, we want the job to run to completion, and do not want it disrupted. GitHub Action Runners to not have an automatic retry mechanism (in part, because jobs are long and complex and not likely idempotent), so if they are killed, it requires manual intervention to decide what to do about the incomplete job. So to avoid having them interrupted, the first thing the Pod does after taking a job is add the "do-not-disrupt" annotation to itself.

Now in this case, the Pod is already on the Node, unaffected by any NoSchedule taints, and the Node looks like a consolidation candidate until the Pod adds the annotation. So my question is, in this case, will adding the annotation stop a consolidation in progress from going forward? If not, is that a bug?

If this use case is not already properly handled, how can I help get it handled (without writing code)?

@javidaslan
Copy link

Confirming @Nuru 's comment, we are seeing the same behaviour with our airflow jobs on v0.37.0.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 26, 2024
@atsu85
Copy link

atsu85 commented Nov 26, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deprovisioning Issues related to node deprovisioning kind/feature Categorizes issue or PR as related to a new feature. v1 Issues requiring resolution by the v1 milestone
Projects
None yet