-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mega Issue: Preferences #666
Comments
Arguably there will be the need to address another preference, taint
|
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
One thought here to float around: allowing Karpenter to just completely ignore preferences, similar to CAS. In general, my biggest worry here is that I could see this regressing back to what your cluster would look like if you didn't specify preferences at all, since consolidation would just try to consolidate nodes based on the required constraints, which may cause it to pack the cluster tighter than you intend. I do think that this idea gets slightly more interesting when it comes to thinking about how ignoring preferences might work with a cost threshold mechanism for consolidation (#795). The main reason why I think that this "ignoring preferences" mechanism works in CAS is because they have the concept of utilization thresholding. That effectively gates considering a node that might be eligible for consolidation if we just looked at the required constraints but might help spread the nodes enough for the kube-scheduler to be able to take advantage of the preferred constraints. |
Realistically, the use-case that we hear most often from users is that: I want Karpenter to spread my pods as much as possible when it's launching, but then consolidate nodes down if they become underutilized enough where it's no longer worth it to me just from a raw cost perspective to have a single (or small number of pods) run on that node. Essentially, I'm defining a cost threshold for ignoring preferences. I'm not sure that we quite want to do that exactly (i.e. have a configurable parameter for pricing that says, "this is the amount of money that I have to gain for my preferences to be no longer worth it") but I do think that there might be something in there that we should think about. |
K8s has methods of doing "spread at launch" that make more sense to delegate to K8s rather than on Karpenter. I want my pods to spread at launch however the reality is tightening prefer to strict at Karpenter won't spread my pods across nodes. Because nodes don't come up at the same time, the scheduler would just place majority of pods onto the same node or first few nodes and we end up churning nodes #517. This is made worse for certain domains like hostname where if the batch of pods that are in Pending are from the same deployment, you can N nodes for N pods even though effectively you won't utilize a majority of nodes. In addition, you end up wasting far more cost waiting for nodes to consolidate which is extremely slow since they are new nodes. Given current K8s limitation, spread at launch can only be done with strict only or min domain because Karpenter doesn't schedule the pods which imo means Karpenter shouldn't do anything but respect preference at pod launch. I think where things make sense is during consolidation. If Karpenter allows configuring "IgnorePreference" or "RespectPreference" on consolidation, it would allow users to choose if when calculating if a node can be consolidated should Karpenter tighten or not preferences. This allows either maximize cost, ignore preference, or maximize safety, tighten preference. Note that skew that happens because of Karpenter launching nodes without considering Preference is okay IMO. The goal of Karpenter should be best effort when it is Preferred and users should understand that when setting to prefer you will get skew and they can pick up other solutions like min domain or descheduler to fix the skew. |
We are experiencing similar pain points with soft constraints, but in our use case, strictly honouring soft constraints is not only a cost problem, but it's creating stability issues. When large deployments create hundreds of pods that express soft spread constraint over Maybe the k8s scheduler doesn't schedule exactly 1 pod per node (as already mentioned before, it depends on when the nodes become available), but whatever topology is determined by the k8s scheduler, it's never consolidated by Karpenter. We basically can't instruct the k8s scheduler to spread pods on the existing capacity, without incurring in this issue. The issue for us is compounded by the fact that we run both CAS and Karpenter and the 2 components behave differently on this aspect. Also, our users operate on the basis that soft and hard constraints are honoured differently (as they source their information from Kubernetes documentation), and it's confusing for them (and for us!) to watch the nodes being provisioned with that cardinality.
For us it would make sense to be able to configure "IgnorePreference" also in the provisioning phase. |
Warning: I haven't fully thought this through, but it was buzzing around in my head. Does anyone have a use case that would not be met if Karpenter did:
|
I can imagine a use case; it's obviously fictional but here goes. I would like to run a workload with a particular CPU instruction, but it can fall back to generic execution if needed ( However, if there's no capacity to provide a node with that particular CPU instruction, I still want the Pod to run somewhere. We could perhaps add an annotation that gives Karpenter a hint about how to satisfy the node affinity preference. If we do, let's try to make it work with a future |
This is case 3, node preferential fallback. We have use cases like arm -> amd, or newer generation vs older generation CPU. The cases that are really under question are things like preferential pod spread across hostname, or zone. |
Preferential pod spread across hostname? My use case would be “running Karpenter”: I'd prefer to run the controller on Karpenter-managed nodes if possible (yes, I know) and would only fall back to some other form of node if there's a problem (quite plausible given the scenario). I don't expect Karpenter to manage the fallback case. However, I would also like Karpenter to prefer running on a node that doesn't already have Karpenter running on it, for resilience and for reduced recovery time. If Karpenter is able to make a new node to run the extra copy of Karpenter, I'd like that to happen. |
Anyone reading #666 (comment) and thinking “well, that's not how I run Karpenter”: sure. It's not a recommendation, it's a use case. |
We're definitely in this bucket. All of our use cases revolves around spreading pods across the topology key "hostname", and the difference between
(taking as an example If we think at a manifest with the following spread constraints:
We're expressing something among the lines of "spread this as much as possible across the existing capacity (optional), AND absolutely distribute evenly across availability zones, both existing and new capacity (mandatory). This interpretation is taken from Kubernetes documentation: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/ (scheduling and not autoscaling, but it's the reference documentation for this API).
Ultimately it boils down to the fact that preferred and required pod spread constraints are interpreted by 3 different actors:
and it seems that Karpenter has diverged from the other 2 controllers treating soft constraints like mandatory, in particular because from the documentation and the scheduler code, soft constraints only apply to existing capacity. |
I think these are still problematic because scheduler doesn't know better and continues to lead to node churn just to a lesse degree than spread on hostname. And I also think having inconsistent behavior is confusing to users. I don't think Karpenter should be trying to do things that the scheduler won't be able to take advantage of, the extra nodes from trying to bring up the right amount of AZ won't necessary make all the pods spread across those nodes because scheduler doesn't try to do that. Unless Karpenter prebinds or is able to hint at scheduler a pod should land on a specific node I am not sure the tightening does anything, |
I'm not sure I agree. IIUC, my proposal above supports @cebernardi's use case, without any loss for existing use cases (which are undefined). |
@ellistarn sorry, what is your proposal? This one? |
/lifecycle frozen |
I think it's less about Karpenter diverging from the upstream concept of preferences and more about Karpenter having a different concept of "what's available". If we take the example of how preferences are treated from the kube-scheduler scenario, it can only see the existing nodes on the cluster, so satisfying preferences just looks like spreading across the existing nodes as much as possible and then falling back when this is no longer possible. From Karpenter's perspective, it can see all existing nodes on the cluster plus any additional "theoretical capacity" that is available to it in the cloud. This means that it can continue satisfying preferences so long as there's available theoretical capacity. From my perspective, I've heard the "I want my autoscaler to spread me a bit, but not too much across hostname" quite a bit but I honestly don't think there's a good representation today in the kube-scheduler of this semantic. Any way that we are achieving this in CAS is mostly by luck of the draw. As an example: Let's assume that we want to schedule a set of 500 pod replicas and let's assume that the capacity that these pods are going to schedule to is completely isolated. Let's also assume that we only have a single instance type option for this capacity and therefore can only schedule 100 pods per node given the resource requests that we have. When we ignore preferences, it's effectively the same as not having those preferences specified at all, even when the kube-scheduler gets involved. There are always going to get 100 pods per node and 5 nodes total, which may not have been our intention when we set topology spread preferences on hostname. I'll admit that the reverse behavior (treating preferences as required) is also not desirable here, but it's slightly better, at least for the initial scale-up. Since you'll get spread based on what the scheduler sees when the nodes start to join, most likely nodes will come in at different times and you'll get spread in a way that should give you less pods on each node. In reality, what we were trying to do was get some minimum amount of spread for our pods, say get 5 pods for the application on each node, max, but there's currently no way to represent this in the scheduling semantic. This seems like a conversation that we should collaborate with across CAS, kube-scheduler, and Karpenter to really have a strong solve for this particular "spread me a bit" use-case. |
What are the downsides of making this behaviour configurable? If none of the options on the table are fully satisfying, maybe users should choose the scenario that best suites their need, until a more holistic approach is decided upstream. |
This is the Kubernetes project (Karpenter was previously AWS open source). |
I think the worst-case scenario here is that we make this behavior user-configurable and the net result of it is that it doesn't do what the user intends e.g. it goes the other direction and acts as if the preferences don't exist at all. I'm not against opening up configuration surface for us to do something like this, but I'd like to thoroughly explore the options and think through them in a design before we decide to open up surface like this. |
I'm re-reading your message @jonathan-innis , and I think I miss some of the context about the original decision. Would you mind to expand a bit about the problem that Karpenter is solving factoring in soft constraints in the provisioning phase? |
Should the "preferred" affinity / pod<->node placement configuration be consumed by more than 1 entity ( "required" affinity is a strong constraint without room for inconsistency but this doesn't apply to "preferred" affinity. Unlike I've thought about a few ideas but all come with possible side effects in certain use cases, and they can all be summarized into one of the 2 suggestions below:
|
At a high level / highly simplified explanation I think is Karpenter want to ensure when you need new nodes for a set of pods, Karpenter gives you the set of nodes that satisfy all constraint, soft or hard. For example, you have 3 pending pods from the same deployment and you have a soft constraint on zones with min skew of 1. Karpenter will spill up 3 nodes to ensure you get a chance for all 3 pods to land on different nodes. However the problem lies in what kube-scheduler does. Because nodes are unlikely to be created shows up at the same time, the first node that appears will likely get all 3 pods because scheduler sees preferred constraint on zone. This works slightly better in the consolidation side, let say you have 3 pods in 3 nodes with soft constraint. Karpenter treats these are hard constraint so won't consolidate your pods into 1 node and leave the 3 nodes around so you can maintain the spread. IMO the main problem is pod topology constraints doesn't have a well defined behavior when it comes to node scaling. It has always been used to describe pods in steady state, no new nodes are needed. However because this is the only mechanism we have, we are now using it for node scaling. When dealing with pending pods and new nodes, the behavior users want can deviate from what they want the pods to do in steady state. For example, I want my pods to spread across nodes but when scaling up I want you to scale up just enough nodes to satisfy the resource requests. I will reshuffle afterwards because this works in my cluster. This is the case also for consolidation. Its possible I want consolidation to tighten prefer because I don't want to break up a good spread but I don't want the hard requirement that all pods must spread across zones. |
@jonathan-innis Not sure where we are with chatting with CAS. Is there any opinion on creating a new CRD similar to For example, if we have a new CRD call For example for hard constraint only on consolidation:
Or hard constraint for both
|
If NodeClaim, or something similar, were a formal Kubernetes API, then the scheduler could account for the NodeClaims representing pending nodes, and perhaps even plan based on a controller-provided ETA for the new Node becoming ready. Seeing a pending NodeClaim, the scheduler might opt to leave a Pod pending an extra 10s to see if a new node turns up per claim (and, if not, file an Event on the Pod to explain that it has given up waiting). Alternatively, maybe you need a custom scheduler to get the benefit. That's more work but still feasible. “Implementing the alternative scheduler is left as an exercise for the reader.” |
Description
What problem are you trying to solve?
Karpenter attempts to treat preferences as requirements, but if the pods cannot be scheduled, they are relaxed and tried again without the preferences. If we were respect preferences at provisioning time, but ignore them for consolidation time, you would regularly see consolidation occurring after provisioning. These two control loops need to agree on the behavior, but the Kubernetes API doesn't define the behavior.
Preferences are challenging in k8s.
There are three actors in play:
And 4 types of preferences:
Further complicated by:
We need to invest more deeply in defining Karpenter's interactions with Kubernetes preferences. While we do, it would be great if users could share their preference-based requirements, so we can build a coherent model that attempts to capture them all.
How important is this feature to you?
Consolidation inefficient when using topologySpreadConstraints with ScheduleAnyway #667
Define NodePool Weight/Pod Preference Interactions aws/karpenter-provider-aws#3714
Related KEP: https://github.com/kubernetes/enhancements/blob/5318bd550b34f3315f0877ac9c0252c0f93cba37/keps/sig-scheduling/3990-pod-topology-spread-fallback-mode/README.md
Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment
The text was updated successfully, but these errors were encountered: