-
Notifications
You must be signed in to change notification settings - Fork 979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS Cloud Provider now uses accurate pod/memory values when constructing node objects #572
Conversation
✔️ Deploy Preview for karpenter-docs-prod canceled. 🔨 Explore the source changes: a8b75d1 🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/610316ff33af5d0008aa79f7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Just a couple of comments.
if err := retry.Do( | ||
func() (err error) { return p.getInstance(ctx, id, instance) }, | ||
retry.Delay(1*time.Second), | ||
retry.Attempts(3), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a reason why you picked 3 attempts of 1 second each? It seems pretty likely for EC2 to throttle for 3 seconds if you're making a lot of requests.
Can we make this a larger interval with more retries? If an instance creation fails, we'll have to do binpacking all over again anyways and attempt to create the instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't seen any throttling failures here, even with some preliminary 1k node scale tests. I've never seen this one fail due to rate limiting, as the describe limits are actually quite high. I'm open to increasing these, but I'm a bit wary of the retries themselves causing cascading effects on other threads. Regardless, we will recover gracefully w/ the reallocator if these retries do fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. We should definitely keep track of this somewhere, especially when we get to larger cluster sizes.
pkg/controllers/allocation/bind.go
Outdated
errs := make([]error, len(pods)) | ||
workqueue.ParallelizeUntil(ctx, len(pods), len(pods), func(index int) { | ||
errs[index] = b.bind(ctx, node, pods[index]) | ||
}) | ||
logging.FromContext(ctx).Infof("Bound %d pod(s) to node %s", len(pods), node.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's say that we had to bind 10 pods, and only 5 succeeded. This would log that we binded 10 pods, right? Can we change this to be more accurate?
pkg/controllers/allocation/bind.go
Outdated
errs := make([]error, len(pods)) | ||
workqueue.ParallelizeUntil(ctx, len(pods), len(pods), func(index int) { | ||
errs[index] = b.bind(ctx, node, pods[index]) | ||
}) | ||
return multierr.Combine(errs...) | ||
err := multierr.Combine(errs...) | ||
logging.FromContext(ctx).Infof("Bound %d/%d pod(s) to node %s", len(pods) - len(multierr.Errors(err)), len(pods), node.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we need fractions here. I think it'd be sufficient to just say how many did bind.
Issue, if available:
Description of changes:
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.