-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
karpenter not provisioning new nodes with proper capacity-spread label #3397
Comments
Seeing two interesting things in your pod describe
Specifically:
We improved logging on this very recently: https://github.com/aws/karpenter-core/pull/192/files |
This should be available in v0.24: https://github.com/aws/karpenter-core/releases/tag/v0.24.0 |
I don't think it's this one, given the events that we're emitting.
Not sure what you mean with this one. |
I think it's likely that you're running into well known rollout issues w/ topology spread (more common at large scale). This is solved by this KEP: https://github.com/denkensk/enhancements/blob/master/keps/sig-scheduling/3243-respect-pod-topology-spread-after-rolling-upgrades/README.md |
Thank you for your reply. Explaining doubts:
After the karpenter upgrade to the 0.24.0 version, it was possible to repeat the problem on another deployment. Analyzing pod: namespace1teservicependingapp-main-6569859df4-bqz67 0/2 Pending 0 7m42s In Karpenter's logs we could notice: incompatible with provisioner "namespace1-karpenternamespace1teservice-dk5b2mvusd-spot-on-demand", unsatisfiable topology constraint for topology spread, key=capacity-spread (counts = map[1:10 2:20 3:0 4:0 5:0], podDomains = capacity-spread Exists, nodeDomains = capacity-spread In [1]);
incompatible with provisioner "namespace1-karpenternamespace1teservice-dk5b2mvusd-on-demand", unsatisfiable topology constraint for topology spread, key=capacity-spread (counts = map[1:10 2:20 3:0 4:0 5:0], podDomains = capacity-spread Exists, nodeDomains = capacity-spread In [2]); Looking at the part of the log (if I understand it correctly): "key=capacity-spread (counts = map[1:10 2:20 3:0 4:0 5:0]" and nodeSelector is defined as: nodeSelector:
tech.nodegroup/name: karpenternamespace1teservice
tech.nodegroup/namespace: namespace1 it is interesting that the count of the nodes filtered by label tech.nodegroup/name=karpenternamespace1teservice: kubectl get node -l tech.nodegroup/name=karpenternamespace1teservice it was equal to 15. kubectl get node -l tech.nodegroup/namespace=namespace1,karpenter.sh/initialized=true (karpenter.sh/initialized=true to filter only karpenter nodes) was equal to 30. Inside namespace, nodes labeled with key:capacity-spread has values only from list ["1", "2"]. I don't know if I interpret it well, but according to Karpenter's logs, it sees a larger number of nodes and possible values of label key: capacity-spread than it should filter by assuming a nodeSelector limit. |
We are able to reproduce the issue in the isolated env now with two pairs of provisions with a different lists of values for the label 'capacity-spread'. Configuration below. 2 provisioners labeled tech.nodegroup/name: testbacground with capacity-spread in values ['1', '2', '3', '4', '5']: ...
labels:
tech.nodegroup/namespace: test
tech.nodegroup/name: testbacground
...
requirements:
- key: karpenter.sh/capacity-type
operator: In
values:
- spot
- on-demand
- key: capacity-spread
operator: In
values:
- '1'
- '2'
- '3'
...
---
...
labels:
tech.nodegroup/namespace: test
tech.nodegroup/name: testbacground
...
requirements:
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
- key: capacity-spread
operator: In
values:
- '4'
- '5'
... Deployment for first provisioners pair. ...
replicas: 5
...
nodeSelector:
tech.nodegroup/namespace: test
tech.nodegroup/name: testbacground
topologySpreadConstraints:
- maxSkew: 1
topologyKey: "capacity-spread"
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app.kubernetes.io/name: nginx-bacground It properly created 5 pods/nodes with capacity-spread label values from ['1', '2', '3', '4', '5']. Second provisioners pair labeled tech.nodegroup/name: testpending with capacity-spread in values ['1', '2']: labels:
rasp.nodegroup/name: testpending
rasp.nodegroup/namespace: test
...
requirements:
- key: karpenter.sh/capacity-type
operator: In
values:
- spot
- on-demand
- key: capacity-spread
operator: In
values:
- '1'
...
---
...
labels:
rasp.nodegroup/name: testpending
rasp.nodegroup/namespace: test
...
requirements:
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
- key: capacity-spread
operator: In
values:
- '2'
... Deployment for second provisioners pair. ...
replicas: 5
...
nodeSelector:
rasp.nodegroup/namespace: test
rasp.nodegroup/name: testpending
topologySpreadConstraints:
- maxSkew: 1
topologyKey: "capacity-spread"
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app.kubernetes.io/name: nginx-pending Pod status: pod/nginx-pending-696998c96f-7gbcx 1/1 Running 0 19m
pod/nginx-pending-696998c96f-n7rnr 1/1 Running 0 19m
pod/nginx-pending-696998c96f-4wbbr 0/1 Pending 0 19m
pod/nginx-pending-696998c96f-pd7w6 0/1 Pending 0 19m
pod/nginx-pending-696998c96f-z665x 0/1 Pending 0 19m Karpenter logs: incompatible with provisioner "test-testpending-d2pv3ur8dm-on-demand", unsatisfiable topology constraint for topology spread, key=capacity-spread (counts = map[1:1 2:1 3:0 4:0 5:0], podDomains = capacity-spread Exists, nodeDomains = capacity-spread In [2]);
incompatible with provisioner "test-testpending-d2pv3ur8dm-spot-on-demand", unsatisfiable topology constraint for topology spread, key=capacity-spread (counts = map[1:1 2:1 3:0 4:0 5:0], podDomains = capacity-spread Exists, nodeDomains = capacity-spread In [1]); |
Using a node selector to limit a pod to a particular provisioner won't limit the range of labels that Karpenter sees for that pod. That's why it's failing to schedule, it needs to spread across the domains evenly but is restricted to a provisioner that can't do it. We follow the standard K8s rules here (see https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/#example-topologyspreadconstraints-with-nodeaffinity). To accomplish what you're trying to do, you'll need to add a required node affinity to the pod instead that limits the pod to only scheduling across the capacities that you want it to schedule against. The topology spread will then only be calculated against those capacities. |
Fair, but as far as I understand the kube-scheduler implements a similar filtering mechanism. And yes we do not set nodeAffinity, but we have configured the taints/tolerations that limits schedulable nodes (even before the score part), so in the end... IMHO Karpenter should look for the label's values only of schedulable nodes as the kube-scheduler does. |
Hi, When I see this doc: https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/#interaction-with-node-affinity-and-node-selectors |
Karpenter does honor that for nodes that exist. This case is for nodes that don't exist but which could be created in which custom topology domains are being provided by provisioners. You can submit a feature request, but I think it would be a complicated change as we would need to identify the subsets of provisioners that pods can schedule against and treat them as distinct topology groups. |
Is the same applies to all keys/constraints like if we would have provisioner with different sets of AZ |
It's only required if you want to limit a pod to a particular subset of topology domains. The best way to do that is to use a required node affinity that specifically limits to that topology key since it works for Karpenter and kube-scheduler for existing and future nodes as well as clearly states the intent. |
We've changed pod configuration as below: spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: tech.nodegroup/namespace
operator: In
values:
- test
- key: tech.nodegroup/name
operator: In
values:
- testpending
tolerations:
- effect: NoSchedule
key: dedicated
value: test
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
topologySpreadConstraints:
- labelSelector:
matchLabels:
app.kubernetes.io/name: nginx-pending
maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
- labelSelector:
matchLabels:
app.kubernetes.io/name: nginx-pending
maxSkew: 2
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
- labelSelector:
matchLabels:
app.kubernetes.io/name: nginx-pending
maxSkew: 1
topologyKey: capacity-spread
whenUnsatisfiable: DoNotSchedule but still: incompatible with provisioner "test-testpending-d2pv3ur8dm-on-demand", unsatisfiable topology constraint for topology spread, key=capacity-spread (counts = map[1:1 2:1 3:0 4:0 5:0], podDomains = capacity-spread Exists, nodeDomains = capacity-spread In [2]);
incompatible with provisioner "test-testpending-d2pv3ur8dm-spot-on-demand", unsatisfiable topology constraint for topology spread, key=capacity-spread (counts = map[1:1 2:1 3:0 4:0 5:0], podDomains = capacity-spread Exists, nodeDomains = capacity-spread In [1]); |
You need to specify in your |
Thanks for Your help, it is working now. One thing, this is not optimal to define a list of capacity-spread values in two places: provisioner and deployment, because I need to know this list when I create a deployment. - key: capacity-spread
operator: In
values:
- "1"
- "2" |
You only need to put it on the workload if you are trying to further restrict the workload scheduling. The set of all provisioners contain the "universe" of possible labels for capacity-spread, and in this case you are trying to restrict to a subset of those so it must be on the workload. |
I think it's dangerous to require an additional rule to convey a restriction to Karpenter that kube-scheduler understands without the additional rule. I've provided more detailed thoughts in kubernetes-sigs/karpenter#430 (comment) |
Version
Karpenter Version: v0.23.0
Kubernetes Version: v1.22.16
Expected Behavior
karpenter will identify the unscheduled pods and will provision the new nodes with proper capacity-spread label
Actual Behavior
In our setup we have 2 provsioner:
spot only
on-demand
with an addtional requirement capacity-spread and all deployments are enforcing that requirement. And at some point during the rollout many pods stuck in Pending without a reason why karpenter does not scale new nodes.
Steps to Reproduce the Problem
change deployment namespace1controllerpendingapp-main
topologySpreadConstraints topologyKey: capacity-spread
from: whenUnsatisfiable: ScheduleAnyway
to: whenUnsatisfiable: DoNotSchedule
In our case we weren't able to reproduce the issue on the isolated env, attached anonimized logs from production env. Worth mention is that we have 1k nodes and about 50 provisioners and workload deployed via fluxcd/helmcontroller rollout, so we presume either 1) cluster is too large, 2) flux intervals /timeouts introduce too much noise during rolllout and then rollbacks or both.
Resource Specs and Logs
awsnodetemplate.txt
provisioner_on-demand.yaml.txt
provisioner_spot_on-demand.yaml.txt
namespace1application_deployment.yaml.txt
nodes_namespace1_list_with_labels.txt
pod_namespace1application-main-7b6c8f6b9f-9fv6r_describe.txt
karpenter_2.23.0.log.txt.tar.gz
Community Note
The text was updated successfully, but these errors were encountered: