-
Notifications
You must be signed in to change notification settings - Fork 983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Schedule and binpack #1561
Schedule and binpack #1561
Conversation
This is a larger PR, but 500+ of the new lines of code are all new unit tests, for non-test code only it's 15 files changed, 353 insertions(+), 405 deletions(-)/ |
✅ Deploy Preview for karpenter-docs-prod canceled.
|
754893a
to
91a8512
Compare
5c31be9
to
ca2fc1e
Compare
b827465
to
705ae2a
Compare
pkg/controllers/provisioning/scheduling/scheduling_benchmark_test.go
Outdated
Show resolved
Hide resolved
logging.FromContext(ctx).Errorf("Could not pack pods, %s", err) | ||
return | ||
if err := p.launch(ctx, nodes[i]); err != nil { | ||
logging.FromContext(ctx).Errorf("Could not launch node, %s", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
- I think err prints poorly with the %s directive
- I'm a fan of thinking of the log level as part of the message, e.g. "ERROR Launching node, for reasons"
logging.FromContext(ctx).Errorf("Could not launch node, %s", err) | |
logging.FromContext(ctx).Errorf("Launching node, %s", err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This is ok as far as I know, see https://pkg.go.dev/fmt, specifically:
If the format (which is implicitly %v for Println etc.) is valid for a string (%s %q %v %x %X), the following two rules apply:
*If an operand implements the error interface, the Error method will be invoked to convert the object to a string, which will then be formatted as required by the verb (if any).
If an operand implements method String() string, that method will be invoked to convert the object to a string, which will then be formatted as required by the verb (if any). *
The code is here: https://github.com/golang/go/blob/80a7504a13a5dccb60757d1fc66d71bcba359799/src/fmt/print.go#L611 , %s
and %v
with the error only as the arg are the same as calling err.Error()
- SGTM
In this rework I updated the log message for attempting to launch a node, it looks like this now:
|
23e63e2
to
9434fab
Compare
return true | ||
} | ||
|
||
func (n *Node) RequiredResources() v1.ResourceList { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is only used by logging, WDYT about making node a Stringer and just log launching node %s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modified node to have a String() method and limited instance types to 5:
karpenter-69b966c4df-2jdh8 controller 2022-03-27T02:18:36.443Z INFO controller.provisioning Launched instance: i-0db3804e44367a4ed, hostname: ip-192-168-135-166.us-west-2.compute.internal, type: t3a.2xlarge, zone: us-west-2c, capacityType: on-demand {"commit": "c456cf5", "provisioner": "default"}
karpenter-69b966c4df-2jdh8 controller 2022-03-27T02:18:36.443Z INFO controller.provisioning Launched with 5 pods using resources cpu: 5315m memory: 1093Mi ephemeral-storage: 5Gi from types c1.xlarge, c3.2xlarge, c4.2xlarge, c5ad.2xlarge, c6i.2xlarge and 205 others {"commit": "c456cf5", "provisioner": "default"}
karpenter-69b966c4df-2jdh8 controller 2022-03-27T02:18:36.536Z INFO controller.provisioning Bound 5 pod(s) to node ip-192-168-135-166.us-west-2.compute.internal {"commit": "c456cf5", "provisioner": "default"}
@@ -102,3 +170,15 @@ func (n Node) hasCompatibleResources(resourceList v1.ResourceList, it cloudprovi | |||
} | |||
return true | |||
} | |||
|
|||
func (n *Node) recalculateDaemonResources() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that this code intends to remove daemons from the overhead calculation once constraints are tightened. I'm not sure we want to do this for a couple of reasons:
- I'm concerned about the performance impact of recomputing requirements on every loop
- It's rare in practice for daemons to have large resource requests or complicated scheduling requirements
- Even with all of this additional math, we can't get it right in all cases. Consider the case where a daemonset only runs on arm64, but pods are compatible with either arm64 or amd64. We don't know the architecture until after the node is launched, at which point we can't change our binpacking decision.
- In the worst case scenario, we're creating nodes with additional room if it turns out the daemon can't schedule.
Instead, does it make sense to simply check the requirements once during NewNode
. This will apply provisioner level requirements to the daemons, which is an easy to explain contract to customers. Further, if an edge use cases is causing undesirable behavior, the user has the ability to align the constraints of their provisioner and their daemonsets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM, I implemented it because the old code had this behavior. It didn't bin-pack until it had accumulated a set of the compatible pods, so it only did it once. It's a separate commit on this PR as it requires removing an old test (from old hash bashed scheduling + bin-packing) that verified that this was performed.
I did go one step further and just computed daemon set resources given provisioner constraints as we only need to do it once that way.
@@ -45,11 +47,28 @@ func NewComplementSet(values ...string) Set { | |||
} | |||
} | |||
|
|||
// Hash provides a hash function so we can generate a good hash for Set which has no public fields. | |||
func (s Set) Hash() (uint64, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but I left it in thinking we might hash something with a Set() later. If we did, the hash library would just ignore the Set type and not issue any errors.
} | ||
return hashstructure.Hash(key, hashstructure.FormatV2, nil) | ||
} | ||
|
||
// DeepCopy creates a deep copy of the set object | ||
// It is required by the Kubernetes CRDs code generation | ||
func (s Set) DeepCopy() Set { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remind me where we're deep copying the set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generated DeepCopy method for Requirements calls Set's DeepCopy method.
|
||
cpuCmp := resources.Cmp(lhs[v1.ResourceCPU], rhs[v1.ResourceCPU]) | ||
if cpuCmp < 0 { | ||
// LHS has less CPU, so it should be sorted after |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why comment here and not on memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CPU check follows the same pattern of lesser-value => sorts after the other, figured documenting it once was enough.
"github.com/aws/karpenter/pkg/apis/provisioning/v1alpha5" | ||
) | ||
|
||
type NodeSet struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not tracking what this abstraction gets us. Can't we just call getDaemons
in the scheduler and then maintain a []Node
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I planned on hanging the topology information on the nodeset when implementing pod affinity/anti-affinity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit nervous about predicting the future too much due to #975. Squinting at it, I'm not sure that this abstraction will survive review, and would generally opt to not try to predict abstractions without a use case. Given that you have a stack of PRs, I'm happy to move forward with it to keep us moving, but I remain to be convinced.
@@ -59,34 +65,67 @@ func NewScheduler(kubeClient client.Client) *Scheduler { | |||
func (s *Scheduler) Solve(ctx context.Context, provisioner *v1alpha5.Provisioner, instanceTypes []cloudprovider.InstanceType, pods []*v1.Pod) ([]*Node, error) { | |||
defer metrics.Measure(schedulingDuration.WithLabelValues(injection.GetNamespacedName(ctx).Name))() | |||
constraints := provisioner.Spec.Constraints.DeepCopy() | |||
start := time.Now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This time is already being measured by line 65.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That data isn't readily available without running prometheus though, right? It's measured, but I wanted to know how long scheduling was taking and normally just tail the controller logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to maintain the separation of concerns between metrics and logging. I'd hate to be in a habit where our logs contain a bunch of metrics data, which might encourage users to parse it rather than consume the metrics directly. I'd love to get to a point where we have canonical karpenter dashboards where we can view the breakdown of the entire algorithm.
nodes = append(nodes, NewNode(constraints, instanceTypes, pod)) | ||
n, err := NewNode(constraints, nodeSet.daemons, instanceTypes, pod) | ||
if err != nil { | ||
logging.FromContext(ctx).With("pod", client.ObjectKeyFromObject(pod)).Errorf("scheduling pod, %s", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on putting this on line 89 so any logs can benefit?
ctx = logging.WithLogger(ctx, logging.FromContext(ctx).With("pod", client.ObjectKeyFromObject(pod)))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only log is on line 98 and the ctx
isn't passed anywhere that can use it.
This solves some issues with topologies and prepares for pod-affinity in that a separate bin-packing pass would break apart pods that the scheduler intended to be packed together. By combining bin-packing with scheduling the 'scheduling nodes' created correspond to actual K8s nodes that will be created via the cloud provider, eliminating any errors in counting hostname topologies.
2a4ab18
to
318c70c
Compare
318c70c
to
3bd976e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for so many comments. I noticed a bunch of things that seems odd, duplicated, or unnecessarily complex. Happy to discuss on a call.
// will be compatible with this node | ||
for _, p := range pods { | ||
n.Add(p) | ||
} | ||
return n | ||
} | ||
|
||
func (n Node) Compatible(pod *v1.Pod) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't remember if I've mentioned this already (and can't find it), but it strikes me that Compatible and Add may be able to be collapsed into a single method to avoid some duplicated logic. Apologies if we've already gone over this,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want compatible to be separate for the next PR so we can implement preferred pod affinity and be a bit smarter about spreading pods around the nodes we already have to create anyway due to hostname topology spreads and pod anti-affinity (e.g. create 3 m5.8xlarge instead of a m5.24xlarge and 2x m5.large).
"github.com/aws/karpenter/pkg/apis/provisioning/v1alpha5" | ||
) | ||
|
||
type NodeSet struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit nervous about predicting the future too much due to #975. Squinting at it, I'm not sure that this abstraction will survive review, and would generally opt to not try to predict abstractions without a use case. Given that you have a stack of PRs, I'm happy to move forward with it to keep us moving, but I remain to be convinced.
@@ -59,34 +65,67 @@ func NewScheduler(kubeClient client.Client) *Scheduler { | |||
func (s *Scheduler) Solve(ctx context.Context, provisioner *v1alpha5.Provisioner, instanceTypes []cloudprovider.InstanceType, pods []*v1.Pod) ([]*Node, error) { | |||
defer metrics.Measure(schedulingDuration.WithLabelValues(injection.GetNamespacedName(ctx).Name))() | |||
constraints := provisioner.Spec.Constraints.DeepCopy() | |||
start := time.Now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to maintain the separation of concerns between metrics and logging. I'd hate to be in a habit where our logs contain a bunch of metrics data, which might encourage users to parse it rather than consume the metrics directly. I'd love to get to a point where we have canonical karpenter dashboards where we can view the breakdown of the entire algorithm.
return nil | ||
} | ||
} | ||
return errors.New("no matching instance type found") | ||
} | ||
|
||
func (n Node) reservedResources(it cloudprovider.InstanceType) v1.ResourceList { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another helper that could be deduplicated if we combine Add/Compatible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above regarding intentional separation.
func (n *Node) Add(pod *v1.Pod) error { | ||
n.Requirements = n.Requirements.Add(v1alpha5.NewPodRequirements(pod).Requirements...) | ||
|
||
podRequests := resources.RequestsForPods(pod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it simplify our max pods check if we include a resource request for 1 pod in podRequest
? Then we can just treat it like all other requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that, but we can't mutate the pod itself. I added it in resources.RequestsForPods/LimitsForPods.
aa8c3fd
to
c18f206
Compare
c18f206
to
0a44b01
Compare
@@ -65,3 +74,24 @@ func Quantity(value string) *resource.Quantity { | |||
func IsZero(r resource.Quantity) bool { | |||
return r.IsZero() | |||
} | |||
|
|||
func Cmp(lhs resource.Quantity, rhs resource.Quantity) int { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized that this helper is now only used in Fits, so we could potentially remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked and the scheduler uses it too:
func byCPUAndMemoryDescending(pods []*v1.Pod) func(i int, j int) bool {
return func(i, j int) bool {
lhs := resources.RequestsForPods(pods[i])
rhs := resources.RequestsForPods(pods[j])
cpuCmp := resources.Cmp(lhs[v1.ResourceCPU], rhs[v1.ResourceCPU])
if cpuCmp < 0 {
// LHS has less CPU, so it should be sorted after
return false
} else if cpuCmp > 0 {
return true
}
memCmp := resources.Cmp(lhs[v1.ResourceMemory], rhs[v1.ResourceMemory])
if memCmp < 0 {
return false
} else if memCmp > 0 {
return true
}
return false
}
}
1. Issue, if available:
N/A
2. Description of changes:
Merge bin-packing and scheduling to prepare for pod affinity work.
3. How was this change tested?
Unit test & live on EKS
4. Does this change impact docs?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Prices are for us-west-2 from https://aws.amazon.com/ec2/pricing/on-demand/
Inflate = 20 replicas
Karpenter (combined scheduling + bin-packing) - m5zn.6xlarge $1.982 24 96 GiB EBS Only 50 Gigabit
This first tried to allocate a c5a.8xlarge and a c5a.12xlarge which failed due to availability before creating an m5zn.6xlarge .
c5a.8xlarge $1.232 32 64 GiB EBS Only 10 Gigabit
c5a.12xlarge $1.848 48 96 GiB EBS Only 12 Gigabit
Karpenter v0.7.0 - m5zn.6xlarge $1.982 24 96 GiB EBS Only 50 Gigabit
Inflate = 8 replicas
Karpenter (combined scheduling + bin-packing) c6a.4xlarge $0.612 16 32 GiB EBS Only Up to 12500 Megabit
Karpenter v0.7.0 - m5a.4xlarge $0.688 16 64 GiB EBS Only Up to 10 Gigabit
Inflate = 4 replicas
Karpenter (combined scheduling + bin-packing) - c5d.2xlarge $0.384 8 16 GiB 1 x 200 NVMe SSD Up to 10 Gigabit
This points out a flaw in our current pricing calculation, we thought that this would be cheaper than the t3a.2xlarge due to it having lower memory.
Karpenter v0.7.0 - t3a.2xlarge $0.3008 8 32 GiB EBS Only Up to 5 Gigabit
Inflate = 1 replica
Karpenter (combined scheduling + bin-packing) - t3a.micro $0.0094 2 1 GiB EBS Only Up to 5 Gigabit
Karpenter v0.7.0 - t3a.micro $0.0094 2 1 GiB EBS Only Up to 5 Gigabit