Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

karpenter not provisioning new nodes with proper capacity-spread label #3397

Closed
tarfik opened this issue Feb 14, 2023 · 17 comments
Closed

karpenter not provisioning new nodes with proper capacity-spread label #3397

tarfik opened this issue Feb 14, 2023 · 17 comments
Labels
question Issues that are support related questions

Comments

@tarfik
Copy link

tarfik commented Feb 14, 2023

Version

Karpenter Version: v0.23.0

Kubernetes Version: v1.22.16

Expected Behavior

karpenter will identify the unscheduled pods and will provision the new nodes with proper capacity-spread label

Actual Behavior

In our setup we have 2 provsioner:
spot only
on-demand
with an addtional requirement capacity-spread and all deployments are enforcing that requirement. And at some point during the rollout many pods stuck in Pending without a reason why karpenter does not scale new nodes.

Steps to Reproduce the Problem

change deployment namespace1controllerpendingapp-main
topologySpreadConstraints topologyKey: capacity-spread
from: whenUnsatisfiable: ScheduleAnyway
to: whenUnsatisfiable: DoNotSchedule

In our case we weren't able to reproduce the issue on the isolated env, attached anonimized logs from production env. Worth mention is that we have 1k nodes and about 50 provisioners and workload deployed via fluxcd/helmcontroller rollout, so we presume either 1) cluster is too large, 2) flux intervals /timeouts introduce too much noise during rolllout and then rollbacks or both.

Resource Specs and Logs

awsnodetemplate.txt
provisioner_on-demand.yaml.txt
provisioner_spot_on-demand.yaml.txt
namespace1application_deployment.yaml.txt
nodes_namespace1_list_with_labels.txt
pod_namespace1application-main-7b6c8f6b9f-9fv6r_describe.txt
karpenter_2.23.0.log.txt.tar.gz

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@tarfik tarfik added the bug Something isn't working label Feb 14, 2023
@tarfik tarfik changed the title karpenter is not provisioning new nodes with proper capacity-spread label karpenter not provisioning new nodes with proper capacity-spread label Feb 14, 2023
@ellistarn
Copy link
Contributor

ellistarn commented Feb 14, 2023

Seeing two interesting things in your pod describe

  Normal   NotTriggerScaleUp  86s (x44 over 10m)    cluster-autoscaler  (combined from similar events): pod didn't trigger scale-up: 1 node(s) had taint {dedicated: tracking}, that the pod didn't tolerate, 2 node(s) had taint {dedicated: email}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: vikasa}, that the pod didn't tolerate, 2 node(s) had taint {dedicated: copamarata}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: logging}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: dlembedmodule}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: externalcustomer}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: medpikachu}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: kiskegyed}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: dpp}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: pi}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: rivysaur}, that the pod didn't tolerate, 3 node(s) had taint {dedicated: ocdn}, that the pod didn't tolerate, 8 node(s) didn't match Pod's node affinity/selector, 2 node(s) had taint {dedicated: lakasaabc}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: magic}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: rrzzaonlineapi}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: storyscore}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: uuts4}, that the pod didn't tolerate, 2 node(s) had taint {dedicated: authamadeo}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: storybot}, that the pod didn't tolerate, 2 node(s) had taint {dedicated: zaspomer}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: cmp}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: photoimporter}, that the pod didn't tolerate, 2 node(s) had taint {dedicated: aturbomachine}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: storylinker}, that the pod didn't tolerate, 2 node(s) had taint {dedicated: weather}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: rrzzae2embed}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: zenaapi}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: landingzone}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: maratone}, that the pod didn't tolerate, 2 node(s) had taint {dedicated: skapiec}, that the pod didn't tolerate, 2 node(s) had taint {dedicated: publishing-automation}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: rrzzacms}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: rrzzae2}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: widgets-platform}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: phraselinker}, that the pod didn't tolerate, 2 node(s) had taint {dedicated: zamenix}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: newsletters}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: fia}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: mpikachueasy}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: sese}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: blastoiseapi}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: wspak}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: ivysaur}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: fastoma}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: business}, that the pod didn't tolerate, 3 max node group size reached, 1 node(s) had taint {dedicated: turboaudio}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: monitoturbo}, that the pod didn't tolerate, 1 node(s) had taint {dedicated: scanner}, that the pod didn't tolerate, 2 node(s) had taint {dedicated: turbo}, that the pod didn't tolerate, 2 node(s) had taint {dedicated: buita22}, that the pod didn't tolerate, 2 node(s) had taint {dedicated: tms}, that the pod didn't tolerat
  Warning  FailedScheduling   104s (x66 over 10m)   karpenter           (combined from similar events): Failed to schedule pod, incompatible with provisioner "mega-venusaur-pikachuappspots-d4qmxfokwv-spot-on-demand", did not tolerate dedicated=mega-venusaur:NoSchedule; did not tolerate designation=spotapp:NoSchedule; incompatible with provisioner "charmander-falm-d5fv924aqa-spot", did not tolerate dedicated=charmander:NoSchedule; incompatible with provisioner "caterpie-karp-prod-dp57s51yvb-on-demand", did not tolerate dedicated=caterpie:NoSchedule; incompatible with provisioner "mega-venusaur-o3sfailover-d27641w2o8-spot-on-demand", did not tolerate dedicated=mega-venusaur:NoSchedule; did not tolerate designation=failover:NoSchedule; incompatible with provisioner "venusaur-platform-cp-karpenter-d4dmj25wbr-on-demand", did not tolerate dedicated=venusaur-platform:NoSchedule; incompatible with provisioner "mega-venusaur-pikachuappmixed-dudvritgs3-on-demand", did not tolerate dedicated=mega-venusaur:NoSchedule; did not tolerate designation=mixedapp:NoSchedule; incompatible with provisioner "ivysaur-importer-dv3dswe6q3-spot-on-demand", did not tolerate dedicated=ivysaur:NoSchedule; did not tolerate kind=memory:NoSchedule; incompatible with provisioner "namespace1-karpenternamespace1teservice-dk5b2mvusd-on-demand", incompatible requirements, key tech.nodegroup/name, tech.nodegroup/name In [karpenternamespace1controller] not in tech.nodegroup/name In [karpenternamespace1teservice]; incompatible with provisioner "raichu-raichuhu-d9df6ohc1i-spot-on-demand", did not tolerate dedicated=raichu:NoSchedule; incompatible with provisioner "namespace1-karpenternamespace1apps-d2cseecz6v-spot-on-demand", incompatible requirements, key tech.nodegroup/name, tech.nodegroup/name In [karpenternamespace1controller] not in tech.nodegroup/name In [karpenternamespace1apps]; incompatible with provisioner "namespace1-karpenternamespace1teservice-dk5b2mvusd-spot-on-demand", incompatible requirements, key tech.nodegroup/name, tech.nodegroup/name In [karpenternamespace1controller] not in tech.nodegroup/name In [karpenternamespace1teservice]; incompatible with provisioner "ivysaur-t3-de5ukgfkx1-spot-on-demand", did not tolerate dedicated=ivysaur:NoSchedule; incompatible with provisioner "nidoking-nidokingeksprovisioner-de8sa6wbhv-on-demand", did not tolerate dedicated=nidoking:NoSchedule; incompatible with provisioner "monitoturbo-firstspot-dqdwoaa33n-spot-on-demand", did not tolerate dedicated=monitoturbo:NoSchedule; incompatible with provisioner "psyduck-prodkarp-d8ow34mx51-spot-on-demand", did not tolerate dedicated=psyduck:NoSchedule; incompatible with provisioner "psyduck-prodkarp-d8ow34mx51-on-demand", did not tolerate dedicated=psyduck:NoSchedule; incompatible with provisioner "nidoking-nidokingeksprovisioner-de8sa6wbhv-spot-on-demand", did not tolerate dedicated=nidoking:NoSchedule; incompatible with provisioner "blastoiseapi-blastoise-karpent-dyzbsfr433-spot-on-demand", did not tolerate dedicated=blastoiseapi:NoSchedule; incompatible with provisioner "mankey-karpenter-d24jon35w3-on-demand", did not tolerate dedicated=mankey:NoSchedule; all available instance types exceed provisioner limits; incompatible with provisioner "primeape-primeape-testprovisioner-d259cfaxkz-spot-on-demand", did not tolerate dedicated=primeape:NoSchedule; incompatible with provisioner "mankey-karpenter-d24jon35w3-spot-on-demand", did not tolerate dedicated=mankey:NoSchedule; incompatible with provisioner "namespace1-karpenterdeliveryapi-dun4yc53ae-spot-on-demand", incompatible requirements, key tech.nodegroup/name, tech.nodegroup/name In [karpenternamespace1controller] not in tech.nodegroup/name In [karpenterdeliveryapi]; incompatible with provisioner "namespace1-karpenternamespace1controller-d2hy7xum8s-on-demand", unsatisfiable topology constraint for topology spread, key=capacity-spread; incompatible with provisioner "bulbasaur-main-d4qvnksp4u-on-demand", did not tolerate dedicated=bulbasaur:NoSchedule; incompatible with provisioner "arcanine-arcaninekpt-d6dmoqf2xx-spot-on-demand", did not tolerate dedicated=arcanine:NoSchedule; incompatible with provisioner "poliwag-karpenter1-dz8wkpbsx5-on-demand", did not tolerate dedicated=poliwag:NoSchedule; incompatible with provisioner "abra-karp1-dxiwuv66sg-on-demand", did not tolerate dedicated=abra:NoSchedule; incompatible with provisioner "charmander-falm-dfi57w2tvh-on-demand", did not tolerate dedicated=charmander:NoSchedule; incompatible with provisioner "charmander-falm-dfi57w2tvh-spot-on-demand", did not tolerate dedicated=charmander:NoSchedule; incompatible with provisioner "namespace1-karpenternamespace1apps-d2cseecz6v-on-demand", incompatible requirements, key tech.nodegroup/name, tech.nodegroup/name In [karpenternamespace1controller] not in tech.nodegroup/name In [karpenternamespace1apps]; incompatible with provisioner "ivysaur-core-d215i1ht7c-spot-on-demand", did not tolerate dedicated=ivysaur:NoSchedule; incompatible with provisioner "kadabra-t3-d2j1vya47p-spot-on-demand", did not tolerate dedicated=kadabra:NoSchedule; incompatible with provisioner "raichu-raichuhu-d9df6ohc1i-on-demand", did not tolerate dedicated=raichu:NoSchedule; incompatible with provisioner "ivysaur-events-dgn1eq8srb-spot-on-demand", did not tolerate dedicated=ivysaur:NoSchedule; incompatible with provisioner "mega-venusaur-pikachuappmixed-dudvritgs3-spot-on-demand", did not tolerate dedicated=mega-venusaur:NoSchedule; did not tolerate designation=mixedapp:NoSchedule; incompatible with provisioner "bulbasaur-main-d4qvnksp4u-spot", did not tolerate dedicated=bulbasaur:NoSchedule; incompatible with provisioner "ivysaur-memory-dauaxibq5d-spot-on-demand", did not tolerate dedicated=ivysaur:NoSchedule; incompatible with provisioner "notifications-center-small-dmph234swc-spot-on-demand", did not tolerate dedicated=notifications-center:NoSchedule; incompatible with provisioner "namespace1-karpenterdeliveryapi-dun4yc53ae-on-demand", incompatible requirements, key tech.nodegroup/name, tech.nodegroup/name In [karpenternamespace1controller] not in tech.nodegroup/name In [karpenterdeliveryapi]; incompatible with provisioner "venusaur-platform-cp-karpenter-d4dmj25wbr-spot-on-demand", did not tolerate dedicated=venusaur-platform:NoSchedule; incompatible with provisioner "charmander-falm-d5fv924aqa-on-demand", did not tolerate dedicated=charmander:NoSchedule; incompatible with provisioner "charmeleon-main-dgp99u8v48-spot-on-demand", did not tolerate dedicated=charmeleon:NoSchedule; incompatible with provisioner "squirtle-spot-first-d2eb56b5ze-spot-on-demand", did not tolerate dedicated=squirtle:NoSchedule; incompatible with provisioner "charmeleon-main-dgp99u8v48-on-demand", did not tolerate dedicated=charmeleon:NoSchedule; incompatible with provisioner "charizard-main-d2ic4i1vth-spot-on-demand", did not tolerate dedicated=charizard:NoSchedule; incompatible with provisioner "wartortle-wartortleprovisioner-dy687sn65k-spot-on-demand", did not tolerate dedicated=wartortle:NoSchedule; incompatible with provisioner "namespace1-karpenternamespace1controller-d2hy7xum8s-spot-on-demand", unsatisfiable topology constraint for topology spread, key=capacity-spread; incompatible with provisioner "blastoiseapi-blastoise-karpent-dyzbsfr433-on-demand", did not tolerate dedicated=blastoiseapi:NoSchedule; incompatible with provisioner "caterpie-karp-prod-dp57s51yvb-spot-on-demand", did not tolerate dedicated=caterpie:NoSchedule; incompatible with provisioner "wartortle-wartortleprovisioner-dy687sn65k-on-demand", did not tolerate dedicated=wartortle:NoSchedule

Specifically:

 incompatible with provisioner "namespace1-karpenternamespace1controller-d2hy7xum8s-spot-on-demand", unsatisfiable topology constraint for topology spread, key=capacity-spread;

We improved logging on this very recently: https://github.com/aws/karpenter-core/pull/192/files

@ellistarn
Copy link
Contributor

This should be available in v0.24: https://github.com/aws/karpenter-core/releases/tag/v0.24.0

@ellistarn
Copy link
Contributor

  1. cluster is too large,

I don't think it's this one, given the events that we're emitting.

  1. flux intervals /timeouts introduce too much noise during rolllout and then rollbacks or both.

Not sure what you mean with this one.

@ellistarn
Copy link
Contributor

I think it's likely that you're running into well known rollout issues w/ topology spread (more common at large scale). This is solved by this KEP: https://github.com/denkensk/enhancements/blob/master/keps/sig-scheduling/3243-respect-pod-topology-spread-after-rolling-upgrades/README.md

@tarfik
Copy link
Author

tarfik commented Feb 16, 2023

Thank you for your reply.

Explaining doubts:

  1. We are migrating to Karpenter but cluster-autoscaler is working in the cluster because we still have nodegroups.

  2. "flux intervals /timeouts introduce too much noise..." maybe changes in the cluster are too frequent and there is no chance to get a stable state.

After the karpenter upgrade to the 0.24.0 version, it was possible to repeat the problem on another deployment.
Deployment files and provisioners:
awsnodetemplate.txt
provisioner_on-demand.yaml.txt
provisioner_spot_on-demand.yaml.txt
namespace1teservicependingapp-main-6569859df4-bqz67_yaml.txt
namespace1teservicependingapp-main-6569859df4-bqz67_describe.txt
namespace1teservicependingapp_nodes.txt
karpenter_2.24.0.log.txt.tar.gz

Analyzing pod:

    namespace1teservicependingapp-main-6569859df4-bqz67    0/2     Pending    0   7m42s

In Karpenter's logs we could notice:

    incompatible with provisioner "namespace1-karpenternamespace1teservice-dk5b2mvusd-spot-on-demand", unsatisfiable topology constraint for topology spread, key=capacity-spread (counts = map[1:10 2:20 3:0 4:0 5:0], podDomains = capacity-spread Exists, nodeDomains = capacity-spread In [1]);
    incompatible with provisioner "namespace1-karpenternamespace1teservice-dk5b2mvusd-on-demand", unsatisfiable topology constraint for topology spread, key=capacity-spread (counts = map[1:10 2:20 3:0 4:0 5:0], podDomains = capacity-spread Exists, nodeDomains = capacity-spread In [2]);

Looking at the part of the log (if I understand it correctly):

    "key=capacity-spread (counts = map[1:10 2:20 3:0 4:0 5:0]"

and nodeSelector is defined as:

  nodeSelector:
    tech.nodegroup/name: karpenternamespace1teservice
    tech.nodegroup/namespace: namespace1

it is interesting that the count of the nodes filtered by label tech.nodegroup/name=karpenternamespace1teservice:

    kubectl get node -l tech.nodegroup/name=karpenternamespace1teservice

it was equal to 15.
count all nodes in namespace

    kubectl get node -l tech.nodegroup/namespace=namespace1,karpenter.sh/initialized=true

(karpenter.sh/initialized=true to filter only karpenter nodes) was equal to 30.

Inside namespace, nodes labeled with key:capacity-spread has values only from list ["1", "2"].
In the whole cluster we have other values for this label in ["1", "2", "3", "4", "5"]

I don't know if I interpret it well, but according to Karpenter's logs, it sees a larger number of nodes and possible values of label key: capacity-spread than it should filter by assuming a nodeSelector limit.

@tarfik
Copy link
Author

tarfik commented Feb 27, 2023

We are able to reproduce the issue in the isolated env now with two pairs of provisions with a different lists of values for the label 'capacity-spread'. Configuration below.

2 provisioners labeled tech.nodegroup/name: testbacground with capacity-spread in values ['1', '2', '3', '4', '5']:

...
  labels:
    tech.nodegroup/namespace: test
    tech.nodegroup/name: testbacground
...
  requirements:
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - spot
    - on-demand
  - key: capacity-spread
    operator: In
    values:
    - '1'
    - '2'
    - '3'
...

---

...
  labels:
    tech.nodegroup/namespace: test
    tech.nodegroup/name: testbacground
...
  requirements:
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - on-demand
  - key: capacity-spread
    operator: In
    values:
    - '4'
    - '5'
...

Deployment for first provisioners pair.

...
replicas: 5
...
      nodeSelector:
        tech.nodegroup/namespace: test
        tech.nodegroup/name: testbacground
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: "capacity-spread"
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app.kubernetes.io/name: nginx-bacground

It properly created 5 pods/nodes with capacity-spread label values from ['1', '2', '3', '4', '5'].

Second provisioners pair labeled tech.nodegroup/name: testpending with capacity-spread in values ['1', '2']:

  labels:
    rasp.nodegroup/name: testpending
    rasp.nodegroup/namespace: test
...
  requirements:
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - spot
    - on-demand
  - key: capacity-spread
    operator: In
    values:
    - '1'
...

---

...
  labels:
    rasp.nodegroup/name: testpending
    rasp.nodegroup/namespace: test
...
  requirements:
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - on-demand
  - key: capacity-spread
    operator: In
    values:
    - '2'
...

Deployment for second provisioners pair.

...
replicas: 5
...
     nodeSelector:
        rasp.nodegroup/namespace: test
        rasp.nodegroup/name: testpending
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: "capacity-spread"
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app.kubernetes.io/name: nginx-pending

Pod status:

pod/nginx-pending-696998c96f-7gbcx           1/1     Running   0                19m
pod/nginx-pending-696998c96f-n7rnr           1/1     Running   0                19m
pod/nginx-pending-696998c96f-4wbbr           0/1     Pending   0                19m
pod/nginx-pending-696998c96f-pd7w6           0/1     Pending   0                19m
pod/nginx-pending-696998c96f-z665x           0/1     Pending   0                19m

Karpenter logs:

  incompatible with provisioner "test-testpending-d2pv3ur8dm-on-demand", unsatisfiable topology constraint for topology spread, key=capacity-spread (counts = map[1:1 2:1 3:0 4:0 5:0], podDomains = capacity-spread Exists, nodeDomains = capacity-spread In [2]); 
  incompatible with provisioner "test-testpending-d2pv3ur8dm-spot-on-demand", unsatisfiable topology constraint for topology spread, key=capacity-spread (counts = map[1:1 2:1 3:0 4:0 5:0], podDomains = capacity-spread Exists, nodeDomains = capacity-spread In [1]); 

@tzneal
Copy link
Contributor

tzneal commented Feb 27, 2023

I don't know if I interpret it well, but according to Karpenter's logs, it sees a larger number of nodes and possible values of label key: capacity-spread than it should filter by assuming a nodeSelector limit.

Using a node selector to limit a pod to a particular provisioner won't limit the range of labels that Karpenter sees for that pod. That's why it's failing to schedule, it needs to spread across the domains evenly but is restricted to a provisioner that can't do it.

We follow the standard K8s rules here (see https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/#example-topologyspreadconstraints-with-nodeaffinity). To accomplish what you're trying to do, you'll need to add a required node affinity to the pod instead that limits the pod to only scheduling across the capacities that you want it to schedule against. The topology spread will then only be calculated against those capacities.

@kwarunek
Copy link

kwarunek commented Feb 27, 2023

Fair, but as far as I understand the kube-scheduler implements a similar filtering mechanism. And yes we do not set nodeAffinity, but we have configured the taints/tolerations that limits schedulable nodes (even before the score part), so in the end... IMHO Karpenter should look for the label's values only of schedulable nodes as the kube-scheduler does.

@tarfik
Copy link
Author

tarfik commented Feb 27, 2023

Hi,
Thanks for Your answer.
I know now how to fix our configuration.

When I see this doc: https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/#interaction-with-node-affinity-and-node-selectors
Kubernetes skips the non-matching nodes from the skew calculations also for spec.nodeSelector. Are there plans that Karpenter will honor nodeSelector?

@tzneal
Copy link
Contributor

tzneal commented Feb 27, 2023

Karpenter does honor that for nodes that exist. This case is for nodes that don't exist but which could be created in which custom topology domains are being provided by provisioners. You can submit a feature request, but I think it would be a complicated change as we would need to identify the subsets of provisioners that pods can schedule against and treat them as distinct topology groups.

@tzneal tzneal added question Issues that are support related questions and removed bug Something isn't working labels Feb 27, 2023
@kwarunek
Copy link

Is the same applies to all keys/constraints like if we would have provisioner with different sets of AZ

@tzneal
Copy link
Contributor

tzneal commented Feb 27, 2023

It's only required if you want to limit a pod to a particular subset of topology domains. The best way to do that is to use a required node affinity that specifically limits to that topology key since it works for Karpenter and kube-scheduler for existing and future nodes as well as clearly states the intent.

@tarfik
Copy link
Author

tarfik commented Feb 28, 2023

We've changed pod configuration as below:

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: tech.nodegroup/namespace
            operator: In
            values:
            - test
          - key: tech.nodegroup/name
            operator: In
            values:
            - testpending
  tolerations:
  - effect: NoSchedule
    key: dedicated
    value: test
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  topologySpreadConstraints:
  - labelSelector:
      matchLabels:
        app.kubernetes.io/name: nginx-pending
    maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: ScheduleAnyway
  - labelSelector:
      matchLabels:
        app.kubernetes.io/name: nginx-pending
    maxSkew: 2
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: ScheduleAnyway
  - labelSelector:
      matchLabels:
        app.kubernetes.io/name: nginx-pending
    maxSkew: 1
    topologyKey: capacity-spread
    whenUnsatisfiable: DoNotSchedule

but still:

  incompatible with provisioner "test-testpending-d2pv3ur8dm-on-demand", unsatisfiable topology constraint for topology spread, key=capacity-spread (counts = map[1:1 2:1 3:0 4:0 5:0], podDomains = capacity-spread Exists, nodeDomains = capacity-spread In [2]); 
  incompatible with provisioner "test-testpending-d2pv3ur8dm-spot-on-demand", unsatisfiable topology constraint for topology spread, key=capacity-spread (counts = map[1:1 2:1 3:0 4:0 5:0], podDomains = capacity-spread Exists, nodeDomains = capacity-spread In [1]); 

@tzneal

@tzneal
Copy link
Contributor

tzneal commented Feb 28, 2023

You need to specify in your nodeAffinity a requirement for the capacity-spread that is limited to what you want the workload to schedule against.

@tarfik
Copy link
Author

tarfik commented Feb 28, 2023

Thanks for Your help, it is working now.

One thing, this is not optimal to define a list of capacity-spread values in two places: provisioner and deployment, because I need to know this list when I create a deployment.
Operator "Exists" not working.

          - key: capacity-spread
            operator: In
            values:
            - "1"
            - "2"

@tzneal
Copy link
Contributor

tzneal commented Feb 28, 2023

You only need to put it on the workload if you are trying to further restrict the workload scheduling. The set of all provisioners contain the "universe" of possible labels for capacity-spread, and in this case you are trying to restrict to a subset of those so it must be on the workload.

@JacobHenner
Copy link

You need to specify in your nodeAffinity a requirement for the capacity-spread that is limited to what you want the workload to schedule against.

I think it's dangerous to require an additional rule to convey a restriction to Karpenter that kube-scheduler understands without the additional rule. I've provided more detailed thoughts in kubernetes-sigs/karpenter#430 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Issues that are support related questions
Projects
None yet
Development

No branches or pull requests

5 participants