track volume mounts per node #1927

tzneal · 2022-06-13T20:03:06Z

1. Issue, if available:
Fixes #919 and #1888

2. Description of changes:

track volume mounts per node

Assume nodes can support infinite volumes until they launch. Once the CSI driver is
reporting the current number of mountable volumes, use that value which may require
launching more nodes.

3. How was this change tested?

Unit testing & deployed to EKS with my EBS CSI driver configured with --volume-attach-limit=3. When the driver created the CSINode Karpenter launched another node for the remaining volumes.

4. Does this change impact docs?

Yes, PR includes docs updates
Yes, issue opened: link to issue
No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

netlify · 2022-06-13T20:03:14Z

✅ Deploy Preview for karpenter-docs-prod canceled.

Name	Link
🔨 Latest commit	`3a77cca`
🔍 Latest deploy log	https://app.netlify.com/sites/karpenter-docs-prod/deploys/62b128027561680009b904a0

pkg/utils/options/options.go

ellistarn · 2022-06-13T22:51:30Z

pkg/cloudprovider/types.go

@@ -74,6 +74,8 @@ type InstanceType interface {
 	// Overhead is the amount of resource overhead expected to be used by kubelet and any other system daemons outside
 	// of Kubernetes.
 	Overhead() v1.ResourceList
+	// MaxVolumes returns the maximum number of volumes that can be mounted to the node.
+	MaxVolumes() int


Can this be expressed via Resources()?
From a real node:

capacity: attachable-volumes-aws-ebs: "25" cpu: "8" ephemeral-storage: 83873772Ki hugepages-1Gi: "0" hugepages-2Mi: "0" memory: 32521312Ki pods: "58"

See below, that number is incorrect.

ellistarn · 2022-06-13T22:56:22Z

pkg/controllers/provisioning/scheduling/inflightnode.go

 	node := &InFlightNode{
 		Node:          n.Node,
+		maxVolumes:    n.InstanceType.MaxVolumes(),


Does it not make sense to store the entire state.Node if we're using multiple piece of data from it?

A few of the pieces we use are intentionally deep copied so we can modify them.

pkg/controllers/provisioning/scheduling/inflightnode.go

pkg/scheduling/volumelimits.go

pkg/controllers/provisioning/scheduling/inflightnode.go

pkg/controllers/provisioning/scheduling/node.go

pkg/cloudprovider/aws/instancetype.go

pkg/controllers/provisioning/scheduling/inflightnode.go

pkg/controllers/provisioning/scheduling/node.go

ellistarn · 2022-06-14T21:00:53Z

pkg/controllers/provisioning/scheduling/node.go

@@ -91,7 +96,13 @@ func (n *Node) Add(pod *v1.Pod) error {

 	// Check instance type combinations
 	requests := resources.Merge(n.requests, resources.RequestsForPods(pod))
-	instanceTypes := filterInstanceTypes(n.InstanceTypeOptions, nodeRequirements, requests)
+	// determine the number of volumes that will be mounted if the pod schedules
+	mountedVolumeCount, err := n.volumeUsage.Validate(pod)


We call this variable mountedVolumeCount and volumesPerNode in different parts of the code. Can we simplify to a single concept to minimize cognitive load for the reader? Perhaps just volumeCount with "mounted" and "node" implied.

The results of Validate() is called mountedVolumeCount both times it's used. This is the number of volumes that would be mounted on the node if the pod were added. I think renaming this to volumeCount is slightly less clear. I renamed volumesPerNode to mountedVolumeCount.

ellistarn · 2022-06-14T21:01:44Z

pkg/controllers/provisioning/scheduling/scheduler.go

@@ -196,7 +200,7 @@ func (s *Scheduler) add(pod *v1.Pod) error {
 			}
 		}

-		node := NewNode(nodeTemplate, s.topology, s.daemonOverhead[nodeTemplate], instanceTypes)
+		node := NewNode(s.ctx, s.kubeClient, nodeTemplate, s.topology, s.daemonOverhead[nodeTemplate], instanceTypes)


Why are we using a different context object here? How does this differ from ctx on line 103?

It just wasn't passed down and was only used for logging. Modified to pass it through add().

ellistarn · 2022-06-14T21:04:13Z

pkg/scheduling/hostportusage.go

 }

 // Add adds a port to the HostPortUsage, returning an error in the case of a conflict
-func (u *HostPortUsage) Add(pod *v1.Pod) error {
+func (u *HostPortUsage) Add(pod *v1.Pod) {


Why not pass the context in through the function arguments like we do everywhere else? IIUC, go style recommends passing context through call chains and avoiding setting as an object property. There are a few cases (e.g. provisioner.go) where this pattern must be violated, since it's not triggered by anything, but this isn't the case here.

Modified, I started this code in cluster state which had a stored context already for the same reason provisioner does.

ellistarn · 2022-06-14T21:07:45Z

pkg/scheduling/volumelimits.go

+}
+
+func (u *volumeUsage) insert(c volumeUsage) {
+	for k, v := range c {


I'm having a hard time reading this. What's c stand for?

collection, just renamed it to volumes.

pkg/scheduling/volumelimits.go

ellistarn · 2022-06-14T21:09:11Z

pkg/scheduling/volumelimits.go

+	return cp
+}
+
+func NewVolumeLimits(ctx context.Context, kubeClient client.Client) *VolumeLimits {


likewise

Suggested change

func NewVolumeLimits(ctx context.Context, kubeClient client.Client) *VolumeLimits {

func NewVolumeLimits(ctx context.Context, kubeClient client.Client) VolumeLimits {

VolumeLimits isn't a map type, it's a struct.

pkg/scheduling/volumelimits.go

pkg/controllers/provisioning/scheduling/inflightnode.go

diranged · 2022-06-15T14:15:45Z

pkg/cloudprovider/aws/instancetype.go

+	}
+	// https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/volume_limits.html
+	switch i.Name() {
+	case "d3.8xlarge", "d3en.12xlarge":


Am I right that AWS exposes no API for this?

diranged · 2022-06-15T14:16:29Z

pkg/cloudprovider/aws/instancetype.go

@@ -293,6 +300,36 @@ func (i *InstanceType) eniLimitedPods() int64 {
 	return *i.NetworkInfo.MaximumNetworkInterfaces*(*i.NetworkInfo.Ipv4AddressesPerInterface-1) + 2
 }

+func (i *InstanceType) computeVolumeLimits(ebsVolumeAttachLimit int) map[string]int {
+	result := map[string]int{
+		ebsCsiProvisioner: 27,


This default only makes sense for Nitro instances. Obviously that's the majority of instances now ... but this might behave funny on an older non-Nitro instance.

diranged · 2022-06-15T14:32:38Z

pkg/cloudprovider/aws/instancetypes.go

@@ -40,6 +40,8 @@ const (
 	InstanceTypeZonesCacheKey          = "zones"
 	InstanceTypesAndZonesCacheTTL      = 5 * time.Minute
 	UnfulfillableCapacityErrorCacheTTL = 3 * time.Minute
+
+	ebsCsiProvisioner = "ebs.csi.aws.com"


Am I missing where this is used?

diranged · 2022-06-15T14:41:43Z

pkg/cloudprovider/aws/instancetype.go

@@ -293,6 +300,36 @@ func (i *InstanceType) eniLimitedPods() int64 {
 	return *i.NetworkInfo.MaximumNetworkInterfaces*(*i.NetworkInfo.Ipv4AddressesPerInterface-1) + 2
 }

+func (i *InstanceType) computeVolumeLimits(ebsVolumeAttachLimit int) map[string]int {


Just a thought here... rather than hard-coding this list, what if Karpenter just used the CSINode data for each node in the cluster to decide whether or not an already-running node was really eligible to launch a new pod?

apiVersion: storage.k8s.io/v1 kind: CSINode metadata: creationTimestamp: "2022-06-14T17:13:09Z" name: ip-100-64-1-228.us-west-2.compute.internal ownerReferences: - apiVersion: v1 kind: Node name: ip-100-64-1-228.us-west-2.compute.internal uid: 8126d39d-afa1-4967-8b5d-7d181e103545 resourceVersion: "693807159" uid: 95d3d0d2-f92a-47ec-b0df-4abfd33ea3da spec: drivers: - name: efs.csi.aws.com nodeID: i-01b557d4a5dc348ab topologyKeys: null - allocatable: count: 15 name: ebs.csi.aws.com nodeID: i-01b557d4a5dc348ab topologyKeys: - topology.ebs.csi.aws.com/zone

Right now, as I understand it, Karpenter tries to replicate much of the kube-scheduler scheduling logic to determine whether not a Pod is going to be schedulable against the current cluster hardware or not. When Karpenter sees a Pod in status: Pending, it could go through its normal logic loop... but as its iterating over the nodes to determine whether or not the Pod should be schedulable, it could also take into account the allocatable.count values from the CSINodes API.

I guess the disadvantage of this model is that if you launch a large number of pods all at once, Karpenter might under-or-over-launch nodes because it still needs some concept of "default number of volumes that can be mounted". So perhaps you'd still need the ebsVolumeAttachLimit setting.

The advantage though is that you don't have to guess what the volume limit is for existing hosts, because it is exposed via the API.

The downside is that we'll have slower provisioning in the event your pods hit the volume limit, but it probably doesn't matter much as its somewhat unlikely to hit the volume limits. Looking into it.

pkg/controllers/provisioning/scheduling/suite_test.go

pkg/controllers/state/cluster.go

ellistarn · 2022-06-20T16:01:11Z

pkg/controllers/state/cluster.go

+	err = multierr.Append(err, c.populateProvisioner(ctx, node, n))
+	err = multierr.Append(err, c.populateVolumeLimits(ctx, node, n))
+	err = multierr.Append(err, c.populateResourceRequests(ctx, node, n))
+	return n, err


Optional: If you're not familiar with multierr.Combine. It can help you achieve declarative, functional code.

return &Node{ Node: node, HostPortUsage: NewHostPortUsage(), HostPortUsage: scheduling.NewHostPortUsage(), VolumeUsage: scheduling.NewVolumeLimits(c.kubeClient), VolumeLimits: scheduling.VolumeCount{}, podRequests: map[types.NamespacedName]v1.ResourceList{}, }, multierr.Combine( c.populateProvisioner(ctx, node, n) c.populateVolumeLimits(ctx, node, n), c.populateResourceRequests(ctx, node, n), )

ellistarn · 2022-06-20T16:02:52Z

pkg/controllers/state/cluster.go

-func (c *Cluster) getNodeAllocatable(node *v1.Node, provisioner *v1alpha5.Provisioner) v1.ResourceList {
-	instanceType, err := c.getInstanceType(c.ctx, provisioner, node.Labels[v1.LabelInstanceTypeStable])
+func (c *Cluster) getNodeAllocatable(ctx context.Context, node *v1.Node, provisioner *v1alpha5.Provisioner) v1.ResourceList {
+	instanceType, err := c.getInstanceType(ctx, provisioner, node.Labels[v1.LabelInstanceTypeStable])


Is there a reason we're not handling this error and bubbling it up?

If we can't find the instance type, it only matters for not ready nodes. We then treat them as having no capacity, once kubelet is running, not finding an instance type means something weird has happened (instance type is no longer reported by EC2 API or some other odd error), but it doesn't affect anything so we just log it.

ellistarn · 2022-06-20T16:03:49Z

pkg/controllers/state/cluster.go

 	c.mu.Lock()
 	defer c.mu.Unlock()
-	c.nodes[node.Name] = c.newNode(node)
+	var err error
+	c.nodes[node.Name], err = c.newNode(ctx, node)


If an error occurs, will we inadvertently set c.nodes[node.Name] to nil? This will cause it to show up when looping over the nodes.

Previously newNode() always returned a node object, even if it failed to populate some portion of it. I could see it going both ways as some semi-failure modes aren't real failures in most cases (e.g. unable to read the CSINode), but for now I've modified it to return nil in the event of error for consistency sake.

ellistarn · 2022-06-20T16:04:24Z

pkg/controllers/state/cluster.go

 		}

 		// node didn't exist, but creating it will pick up this newly bound pod as well
-		n = c.newNode(&node)
+		var err error
+		n, err = c.newNode(ctx, &node)


same re: setting if err != nil

ellistarn · 2022-06-20T16:05:05Z

pkg/scheduling/hostportusage.go

+func (u *HostPortUsage) Add(ctx context.Context, pod *v1.Pod) {
+	newUsage, err := u.validate(pod)
+	if err != nil {
+		logging.FromContext(ctx).Errorf("invariant violated registering host port usage, %s, please file an issue", err)


I'm wondering if we should do a similar error propagation pattern here. This will bubble up to adding a pod to a node, which shouldn't block other pods.

Assume nodes can support infinite volumes until they launch. Once the CSI driver is reporting the current number of mountable volumes, use that value which may require launching more nodes. Remove the PVC controller since we no longer bind pods to nodes. Fixes aws#919 and aws#1888

dewjam · 2022-06-21T16:32:44Z

This might be a silly question, but do we no longer need the PersistentVolumeClaim controller given we're no longer pre-binding pods?

dewjam · 2022-06-21T16:52:57Z

This might be a silly question, but do we no longer need the PersistentVolumeClaim controller given we're no longer pre-binding pods?

Discussed with @tzneal offline and the short answer is "yes".

dewjam

LGTM

ellistarn · 2022-06-21T17:07:17Z

pkg/controllers/state/cluster.go

-func (c *Cluster) updatePod(pod *v1.Pod) {
-	c.updateNodeUsageFromPod(pod)
+func (c *Cluster) updatePod(ctx context.Context, pod *v1.Pod) error {
+	err := c.updateNodeUsageFromPod(ctx, pod)
 	c.updatePodAntiAffinities(pod)


Should we update if there's an error?

ellistarn · 2022-06-21T17:08:58Z

pkg/test/storage.go

@@ -92,10 +95,13 @@ func StorageClass(overrides ...StorageClassOptions) *storagev1.StorageClass {
 	if options.Zones != nil {
 		allowedTopologies = []v1.TopologySelectorTerm{{MatchLabelExpressions: []v1.TopologySelectorLabelRequirement{{Key: v1.LabelTopologyZone, Values: options.Zones}}}}
 	}
+	if options.Provisioner == nil {


Consider just checking "" and not making options.Provisioner a pointer to avoid type juggling.

tzneal requested a review from a team as a code owner June 13, 2022 20:03

tzneal requested a review from bwagner5 June 13, 2022 20:03

tzneal linked an issue Jun 13, 2022 that may be closed by this pull request

Nodes hitting max-volume limit still considered eligible for pods by Karpenter? #1888

Closed

tzneal force-pushed the respect-volume-limits branch from d75a5a2 to c2388d4 Compare June 13, 2022 20:26

ellistarn reviewed Jun 13, 2022

View reviewed changes

pkg/utils/options/options.go Outdated Show resolved Hide resolved

ellistarn reviewed Jun 13, 2022

View reviewed changes

pkg/utils/options/options.go Outdated Show resolved Hide resolved

ellistarn reviewed Jun 13, 2022

View reviewed changes

pkg/controllers/provisioning/scheduling/inflightnode.go Outdated Show resolved Hide resolved

ellistarn reviewed Jun 13, 2022

View reviewed changes

pkg/scheduling/volumelimits.go Outdated Show resolved Hide resolved

ellistarn reviewed Jun 13, 2022

View reviewed changes

pkg/scheduling/volumelimits.go Outdated Show resolved Hide resolved

ellistarn reviewed Jun 13, 2022

View reviewed changes

pkg/scheduling/volumelimits.go Outdated Show resolved Hide resolved

ellistarn reviewed Jun 13, 2022

View reviewed changes

pkg/controllers/provisioning/scheduling/inflightnode.go Show resolved Hide resolved