Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mega Issue: Preferences #666

Open
ellistarn opened this issue Oct 24, 2023 · 26 comments
Open

Mega Issue: Preferences #666

ellistarn opened this issue Oct 24, 2023 · 26 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@ellistarn
Copy link
Contributor

ellistarn commented Oct 24, 2023

Description

What problem are you trying to solve?

Karpenter attempts to treat preferences as requirements, but if the pods cannot be scheduled, they are relaxed and tried again without the preferences. If we were respect preferences at provisioning time, but ignore them for consolidation time, you would regularly see consolidation occurring after provisioning. These two control loops need to agree on the behavior, but the Kubernetes API doesn't define the behavior.

Preferences are challenging in k8s.

There are three actors in play:

  • Kube Scheduler
  • Provisioning
  • Consolidation

And 4 types of preferences:

  • Node Affinity
  • Pod Affinity
  • Pod Antiaffinity
  • Pod Topology

Further complicated by:

  1. Node Affinity has "weights", but other preferences dont.
  2. NodePools have weights, and allow preferences at the node level.

We need to invest more deeply in defining Karpenter's interactions with Kubernetes preferences. While we do, it would be great if users could share their preference-based requirements, so we can build a coherent model that attempts to capture them all.

How important is this feature to you?

@brito-rafa
Copy link

Arguably there will be the need to address another preference, taint PreferNoSchedule:

PreferNoSchedule
PreferNoSchedule is a "preference" or "soft" version of NoSchedule. The control plane will try to avoid placing a Pod that does not tolerate the taint on the node, but it is not guaranteed.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 31, 2024
@Bryce-Soghigian
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 31, 2024
@jonathan-innis
Copy link
Member

One thought here to float around: allowing Karpenter to just completely ignore preferences, similar to CAS. In general, my biggest worry here is that I could see this regressing back to what your cluster would look like if you didn't specify preferences at all, since consolidation would just try to consolidate nodes based on the required constraints, which may cause it to pack the cluster tighter than you intend.

I do think that this idea gets slightly more interesting when it comes to thinking about how ignoring preferences might work with a cost threshold mechanism for consolidation (#795). The main reason why I think that this "ignoring preferences" mechanism works in CAS is because they have the concept of utilization thresholding. That effectively gates considering a node that might be eligible for consolidation if we just looked at the required constraints but might help spread the nodes enough for the kube-scheduler to be able to take advantage of the preferred constraints.

@jonathan-innis
Copy link
Member

Realistically, the use-case that we hear most often from users is that: I want Karpenter to spread my pods as much as possible when it's launching, but then consolidate nodes down if they become underutilized enough where it's no longer worth it to me just from a raw cost perspective to have a single (or small number of pods) run on that node. Essentially, I'm defining a cost threshold for ignoring preferences.

I'm not sure that we quite want to do that exactly (i.e. have a configurable parameter for pricing that says, "this is the amount of money that I have to gain for my preferences to be no longer worth it") but I do think that there might be something in there that we should think about.

@garvinp-stripe
Copy link
Contributor

garvinp-stripe commented Mar 15, 2024

Realistically, the use-case that we hear most often from users is that: I want Karpenter to spread my pods as much as possible when it's launching, but then consolidate nodes down if they become underutilized enough where it's no longer worth it to me just from a raw cost perspective to have a single (or small number of pods) run on that node. Essentially, I'm defining a cost threshold for ignoring preferences.

K8s has methods of doing "spread at launch" that make more sense to delegate to K8s rather than on Karpenter. I want my pods to spread at launch however the reality is tightening prefer to strict at Karpenter won't spread my pods across nodes. Because nodes don't come up at the same time, the scheduler would just place majority of pods onto the same node or first few nodes and we end up churning nodes #517. This is made worse for certain domains like hostname where if the batch of pods that are in Pending are from the same deployment, you can N nodes for N pods even though effectively you won't utilize a majority of nodes. In addition, you end up wasting far more cost waiting for nodes to consolidate which is extremely slow since they are new nodes.

Given current K8s limitation, spread at launch can only be done with strict only or min domain because Karpenter doesn't schedule the pods which imo means Karpenter shouldn't do anything but respect preference at pod launch.

I think where things make sense is during consolidation. If Karpenter allows configuring "IgnorePreference" or "RespectPreference" on consolidation, it would allow users to choose if when calculating if a node can be consolidated should Karpenter tighten or not preferences. This allows either maximize cost, ignore preference, or maximize safety, tighten preference.

Note that skew that happens because of Karpenter launching nodes without considering Preference is okay IMO. The goal of Karpenter should be best effort when it is Preferred and users should understand that when setting to prefer you will get skew and they can pick up other solutions like min domain or descheduler to fix the skew.

@cebernardi
Copy link

We are experiencing similar pain points with soft constraints, but in our use case, strictly honouring soft constraints is not only a cost problem, but it's creating stability issues.

When large deployments create hundreds of pods that express soft spread constraint over hostname, Karpenter provisions hundreds of nodes. On clusters managed by central a platform team (like in our case) often there are several daemonsets running on every node, which stretches the pod limits of the cluster (as for every "user pod" running alone on a node, there are multiple other "platform pods" running)

Maybe the k8s scheduler doesn't schedule exactly 1 pod per node (as already mentioned before, it depends on when the nodes become available), but whatever topology is determined by the k8s scheduler, it's never consolidated by Karpenter.

We basically can't instruct the k8s scheduler to spread pods on the existing capacity, without incurring in this issue.

The issue for us is compounded by the fact that we run both CAS and Karpenter and the 2 components behave differently on this aspect.

Also, our users operate on the basis that soft and hard constraints are honoured differently (as they source their information from Kubernetes documentation), and it's confusing for them (and for us!) to watch the nodes being provisioned with that cardinality.

I think where things make sense is during consolidation. If Karpenter allows configuring "IgnorePreference" or "RespectPreference" on consolidation, it would allow users to choose if when calculating if a node can be consolidated should Karpenter tighten or not preferences. This allows either maximize cost, ignore preference, or maximize safety, tighten preference.

For us it would make sense to be able to configure "IgnorePreference" also in the provisioning phase.

@ellistarn
Copy link
Contributor Author

Warning: I haven't fully thought this through, but it was buzzing around in my head.

Does anyone have a use case that would not be met if Karpenter did:

  1. Treat pod topology preferences as required
  2. Treat pod affinity preferences as required
  3. Treat node affinity preferences as best-effort (current behavior)

@sftim
Copy link

sftim commented Mar 24, 2024

I can imagine a use case; it's obviously fictional but here goes.

I would like to run a workload with a particular CPU instruction, but it can fall back to generic execution if needed (preferredDuringSchedulingIgnoredDuringExecution). As an aside, I also require the node OS to be Linux (requiredDuringSchedulingIgnoredDuringExecution). The performance gain from the instruction set preference is substantial and I would prefer to launch a whole new node to benefit from it, even if nothing else needs to schedule there. (I might use a custom scheduler to help with this).

However, if there's no capacity to provide a node with that particular CPU instruction, I still want the Pod to run somewhere.

We could perhaps add an annotation that gives Karpenter a hint about how to satisfy the node affinity preference. If we do, let's try to make it work with a future preferredDuringSchedulingPreferredDuringExecution node affinity term.

@ellistarn
Copy link
Contributor Author

This is case 3, node preferential fallback. We have use cases like arm -> amd, or newer generation vs older generation CPU.

The cases that are really under question are things like preferential pod spread across hostname, or zone.

@sftim
Copy link

sftim commented Mar 25, 2024

Preferential pod spread across hostname? My use case would be “running Karpenter”: I'd prefer to run the controller on Karpenter-managed nodes if possible (yes, I know) and would only fall back to some other form of node if there's a problem (quite plausible given the scenario). I don't expect Karpenter to manage the fallback case.

However, I would also like Karpenter to prefer running on a node that doesn't already have Karpenter running on it, for resilience and for reduced recovery time. If Karpenter is able to make a new node to run the extra copy of Karpenter, I'd like that to happen.

@sftim
Copy link

sftim commented Mar 25, 2024

Anyone reading #666 (comment) and thinking “well, that's not how I run Karpenter”: sure. It's not a recommendation, it's a use case.

@cebernardi
Copy link

cebernardi commented Mar 25, 2024

The cases that are really under question are things like preferential pod spread across hostname, or zone.

We're definitely in this bucket.

All of our use cases revolves around spreading pods across the topology key "hostname", and the difference between

  • whenUnsatisfiable: ScheduleAnyway
  • whenUnsatisfiable: DoNotSchedule

(taking as an example topologySpreadConstraints, but the same applies for podAntiAffinity)

If we think at a manifest with the following spread constraints:

spec:
  replicas: 100
  selector:
    matchLabels:
      app: sleep
  template:
    metadata:
      labels:
        app: sleep
    spec:
      nodeSelector:
        karpenter.sh/nodepool: cbernardi
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchLabels:
              app: sleep
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: sleep

We're expressing something among the lines of "spread this as much as possible across the existing capacity (optional), AND absolutely distribute evenly across availability zones, both existing and new capacity (mandatory).

This interpretation is taken from Kubernetes documentation: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/ (scheduling and not autoscaling, but it's the reference documentation for this API).

  • If the workload select a capped instance type (which is surely the case for AWS, I guess other providers would have capped node types as well?) we could hit unnecessarily capacity limits.
  • If the replicas attribute was set to > 5000 (or if we have other large workloads configured with similar spread constraint), we could hit Kubernetes node limits (this is a function of the batch size and how quickly the different nodes join the cluster and are available to the k8s scheduler, but we have observed that in this situation, Karpenter spins up a node per replica, meaning that the node will be picked to fit that pod only + the daemonset, and it is possible, depending from the workload resource requests, that nothing else can schedule on that tiny node, bringing in the picture > 5000 very small, unconsolidatable nodes)
  • Whatever topology is decided by the k8s scheduler, Karpenter is not able to consolidate it

Ultimately it boils down to the fact that preferred and required pod spread constraints are interpreted by 3 different actors:

  • k8s scheduler
  • CAS
  • Karpenter

and it seems that Karpenter has diverged from the other 2 controllers treating soft constraints like mandatory, in particular because from the documentation and the scheduler code, soft constraints only apply to existing capacity.

@garvinp-stripe
Copy link
Contributor

Treat pod topology preferences as required
Treat pod affinity preferences as required

I think these are still problematic because scheduler doesn't know better and continues to lead to node churn just to a lesse degree than spread on hostname. And I also think having inconsistent behavior is confusing to users. I don't think Karpenter should be trying to do things that the scheduler won't be able to take advantage of, the extra nodes from trying to bring up the right amount of AZ won't necessary make all the pods spread across those nodes because scheduler doesn't try to do that. Unless Karpenter prebinds or is able to hint at scheduler a pod should land on a specific node I am not sure the tightening does anything,

@ellistarn
Copy link
Contributor Author

Unless Karpenter prebinds or is able to hint at scheduler a pod should land on a specific node I am not sure the tightening does anything

I'm not sure I agree. IIUC, my proposal above supports @cebernardi's use case, without any loss for existing use cases (which are undefined).

@cebernardi
Copy link

@ellistarn sorry, what is your proposal? This one?

@Bryce-Soghigian
Copy link
Member

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Apr 11, 2024
@jonathan-innis
Copy link
Member

and it seems that Karpenter has diverged from the other 2 controllers treating soft constraints like mandatory, in particular because from the documentation and the scheduler code, soft constraints only apply to existing capacity

I think it's less about Karpenter diverging from the upstream concept of preferences and more about Karpenter having a different concept of "what's available". If we take the example of how preferences are treated from the kube-scheduler scenario, it can only see the existing nodes on the cluster, so satisfying preferences just looks like spreading across the existing nodes as much as possible and then falling back when this is no longer possible. From Karpenter's perspective, it can see all existing nodes on the cluster plus any additional "theoretical capacity" that is available to it in the cloud. This means that it can continue satisfying preferences so long as there's available theoretical capacity.

From my perspective, I've heard the "I want my autoscaler to spread me a bit, but not too much across hostname" quite a bit but I honestly don't think there's a good representation today in the kube-scheduler of this semantic. Any way that we are achieving this in CAS is mostly by luck of the draw.

As an example: Let's assume that we want to schedule a set of 500 pod replicas and let's assume that the capacity that these pods are going to schedule to is completely isolated. Let's also assume that we only have a single instance type option for this capacity and therefore can only schedule 100 pods per node given the resource requests that we have. When we ignore preferences, it's effectively the same as not having those preferences specified at all, even when the kube-scheduler gets involved. There are always going to get 100 pods per node and 5 nodes total, which may not have been our intention when we set topology spread preferences on hostname. I'll admit that the reverse behavior (treating preferences as required) is also not desirable here, but it's slightly better, at least for the initial scale-up. Since you'll get spread based on what the scheduler sees when the nodes start to join, most likely nodes will come in at different times and you'll get spread in a way that should give you less pods on each node.

In reality, what we were trying to do was get some minimum amount of spread for our pods, say get 5 pods for the application on each node, max, but there's currently no way to represent this in the scheduling semantic. This seems like a conversation that we should collaborate with across CAS, kube-scheduler, and Karpenter to really have a strong solve for this particular "spread me a bit" use-case.

@cebernardi
Copy link

cebernardi commented Apr 11, 2024

What are the downsides of making this behaviour configurable?

If none of the options on the table are fully satisfying, maybe users should choose the scenario that best suites their need, until a more holistic approach is decided upstream.

@sftim
Copy link

sftim commented Apr 11, 2024

upstream

This is the Kubernetes project (Karpenter was previously AWS open source).

@jonathan-innis
Copy link
Member

What are the downsides of making this behaviour configurable

I think the worst-case scenario here is that we make this behavior user-configurable and the net result of it is that it doesn't do what the user intends e.g. it goes the other direction and acts as if the preferences don't exist at all. I'm not against opening up configuration surface for us to do something like this, but I'd like to thoroughly explore the options and think through them in a design before we decide to open up surface like this.

@cebernardi
Copy link

I'm re-reading your message @jonathan-innis , and I think I miss some of the context about the original decision.

Would you mind to expand a bit about the problem that Karpenter is solving factoring in soft constraints in the provisioning phase?
Thank you in advance

@duxing
Copy link

duxing commented Jun 7, 2024

Should the "preferred" affinity / pod<->node placement configuration be consumed by more than 1 entity (kube-scheduler)?

"required" affinity is a strong constraint without room for inconsistency but this doesn't apply to "preferred" affinity.
multiple entities may not interpret the "preferred" affinity consistently and they have different contexts for interpretation.

Unlike karpenter, kube-scheduler is not aware of nodeclaims (including pending nodes) and doesn't (and possibly won't) have the ability to create new nodes/nodeclaims

I've thought about a few ideas but all come with possible side effects in certain use cases, and they can all be summarized into one of the 2 suggestions below:

  • ignore "preferred" and only honor "required"
  • honor "preferred" as "required"

@garvinp-stripe
Copy link
Contributor

Would you mind to expand a bit about the problem that Karpenter is solving factoring in soft constraints in the provisioning phase?

At a high level / highly simplified explanation I think is Karpenter want to ensure when you need new nodes for a set of pods, Karpenter gives you the set of nodes that satisfy all constraint, soft or hard.

For example, you have 3 pending pods from the same deployment and you have a soft constraint on zones with min skew of 1. Karpenter will spill up 3 nodes to ensure you get a chance for all 3 pods to land on different nodes. However the problem lies in what kube-scheduler does. Because nodes are unlikely to be created shows up at the same time, the first node that appears will likely get all 3 pods because scheduler sees preferred constraint on zone.

This works slightly better in the consolidation side, let say you have 3 pods in 3 nodes with soft constraint. Karpenter treats these are hard constraint so won't consolidate your pods into 1 node and leave the 3 nodes around so you can maintain the spread.

IMO the main problem is pod topology constraints doesn't have a well defined behavior when it comes to node scaling. It has always been used to describe pods in steady state, no new nodes are needed. However because this is the only mechanism we have, we are now using it for node scaling.

When dealing with pending pods and new nodes, the behavior users want can deviate from what they want the pods to do in steady state. For example, I want my pods to spread across nodes but when scaling up I want you to scale up just enough nodes to satisfy the resource requests. I will reshuffle afterwards because this works in my cluster.

This is the case also for consolidation. Its possible I want consolidation to tighten prefer because I don't want to break up a good spread but I don't want the hard requirement that all pods must spread across zones.

@garvinp-stripe
Copy link
Contributor

@jonathan-innis Not sure where we are with chatting with CAS. Is there any opinion on creating a new CRD similar to KubeSchedulerConfiguration for Node scaling?

For example, if we have a new CRD call KubeProvisioningConfiguration where we can define profiles for NodePools (or NodeGroups for CAS). Within the profile we can leverage the same constraint grammar to specify how we want provisioning and deprovisioning to happen.

For example for hard constraint only on consolidation:

apiVersion: provisioning.k8s.io/v1beta1
kind: KubeProvisioningConfiguration
profiles:
  - provisionerName: NodePool1
    config:
      deprovisioningConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone

Or hard constraint for both

apiVersion: provisioning.k8s.io/v1beta1
kind: KubeProvisioningConfiguration
profiles:
  - provisionerName: NodePool1
    config:
      provisioningConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
      deprovisioningConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone

@sftim
Copy link

sftim commented Sep 16, 2024

IMO the main problem is pod topology constraints doesn't have a well defined behavior when it comes to node scaling. It has always been used to describe pods in steady state, no new nodes are needed. However because this is the only mechanism we have, we are now using it for node scaling.

If NodeClaim, or something similar, were a formal Kubernetes API, then the scheduler could account for the NodeClaims representing pending nodes, and perhaps even plan based on a controller-provided ETA for the new Node becoming ready. Seeing a pending NodeClaim, the scheduler might opt to leave a Pod pending an extra 10s to see if a new node turns up per claim (and, if not, file an Event on the Pod to explain that it has given up waiting).

Alternatively, maybe you need a custom scheduler to get the benefit. That's more work but still feasible.

“Implementing the alternative scheduler is left as an exercise for the reader.”

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

10 participants