Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented topology spread constraints for zone and hostname #619

Merged
merged 4 commits into from
Aug 30, 2021

Conversation

ellistarn
Copy link
Contributor

@ellistarn ellistarn commented Aug 15, 2021

Issue, if available:
#481

Description of changes:

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@netlify
Copy link

netlify bot commented Aug 15, 2021

✔️ Deploy Preview for karpenter-docs-prod canceled.

🔨 Explore the source changes: 43216b8

🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/612d198a40f8d60008313d63

@ellistarn ellistarn force-pushed the nodeaffinity branch 3 times, most recently from ffec861 to cb564dd Compare August 25, 2021 21:37
@ellistarn ellistarn changed the title [WIP][Experimental] Implemented topology spread constraints for zone and hostname Implemented topology spread constraints for zone and hostname Aug 25, 2021
@ellistarn ellistarn force-pushed the nodeaffinity branch 2 times, most recently from cc85f59 to 6589255 Compare August 25, 2021 23:25
@ellistarn ellistarn force-pushed the nodeaffinity branch 4 times, most recently from fe6f800 to 143a20f Compare August 27, 2021 00:19
// getSchedules separates pods into a set of schedules. All pods in each group
// contain compatible scheduling constarints and can be deployed together on the
// same node, or multiple similar nodes if the pods exceed one node's capacity.
func (s *Scheduler) getSchedules(ctx context.Context, provisioner *v1alpha3.Provisioner, pods []*v1.Pod) ([]*Schedule, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in mind, this code is untouched and a direct port of constraints.go (now removed)

}

// computeZonalTopology for the topology group. Zones include viable zones for
// the {cloudprovider, provisioner, pod }. If these zones change over time,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra space }?

}
domain, ok := node.Labels[topologyGroup.Constraint.TopologyKey]
if !ok {
continue // Don't include pods if node doesn't contain domain
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could you change the comment to explain why not including the domain matters? as it is now, the comment basically just restates what the code already says.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a link to all the details

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking - I don't see the update but maybe I'm missing it?

type TopologyGroup struct {
Constraint v1.TopologySpreadConstraint
Pods []*v1.Pod
// spread is an internal field used to track current spread
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but what is spread in the first place, conceptually?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, maybe just clarify you mean the upstream sense of spread?

}
}

// Increment increments a domain if known
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't it increment the "spread" of a domain?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep!

}
}

// Spread chooses a domain that minimizes skew and increments its count
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe call this NextSpread() or something? Right now the name of the function sounds kinda stateless.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha I toyed with the idea of calling it Next() or Spread(). What do you think about NextDomain()

@JacobGabrielson
Copy link
Contributor

Overall I didn't see any problems, just a few comments about documentation/comments.

@@ -46,15 +47,15 @@ import (

const (
maxBatchWindow = 10 * time.Second
batchIdleTimeout = 2 * time.Second
batchIdleTimeout = 1 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥


// 2. Separate pods into schedules of compatible scheduling constraints.
schedules, err := s.getSchedules(ctx, provisioner, pods)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this errors for some reason, then the injected labels do not get removed. It might be fine if we're always retrying a reconciliation so we rebuild everything. If something was parallelized while solving, the pods would be getting additional labels added and removed. This just seems a bit leaky. Do you think copying the pods would be too much overhead? Or maybe defering the label deletion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. It felt a little bit dirty doing this in memory mutation thing in general, but it really really simplified the code. The extra annoying bit is that I can leave the injected zone because it'll end up being the same string, but I can't know the hostname (since I can't specify ec2 instance ids), so I have to rip out the hostname after scheduling.

The key detail is that in the injected NodeSelector is never deleted from the pod in this reconcile loop, but it is rebuilt on the next reconcile loop. I'm wary of the complexity it adds to remove these, but it might be tenable if I do it in this function and don't allow it to leak anywhere else. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth while to remove in this function. I'm not entirely sure where a leak would cause issue, but if it does, it might be difficult to spot.

@JacobGabrielson JacobGabrielson self-requested a review August 27, 2021 22:08
Copy link
Contributor

@bwagner5 bwagner5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a small comment on the leakiness of the label injection to make the topology scheduling simpler, but overall, feel free to merge, looks great!

@ellistarn ellistarn merged commit a3e2d8c into aws:main Aug 30, 2021
@ellistarn ellistarn deleted the nodeaffinity branch August 30, 2021 18:01
gfcroft pushed a commit to gfcroft/karpenter-provider-aws that referenced this pull request Nov 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants