Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS Cloud Provider now uses accurate pod/memory values when constructing node objects #572

Merged
merged 1 commit into from
Jul 29, 2021

Conversation

ellistarn
Copy link
Contributor

@ellistarn ellistarn commented Jul 29, 2021

Issue, if available:

Description of changes:

v1.ResourcePods:   *instanceType.Pods(),
v1.ResourceCPU:    *instanceType.CPU(),
v1.ResourceMemory: *instanceType.Memory(),
  1. Resolves an issue where the scheduler would pod additional pods onto nodes before they come online since the capacity values were way too high.
  2. Deleted the NodeAPI abstraction and folded it into the InstanceProvider

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@netlify
Copy link

netlify bot commented Jul 29, 2021

✔️ Deploy Preview for karpenter-docs-prod canceled.

🔨 Explore the source changes: a8b75d1

🔍 Inspect the deploy log: https://app.netlify.com/sites/karpenter-docs-prod/deploys/610316ff33af5d0008aa79f7

@ellistarn ellistarn changed the title AWS Cloud Provider now uses accurate pod/memory values when construct… AWS Cloud Provider now uses accurate pod/memory values when constructing node objects Jul 29, 2021
Copy link
Contributor

@njtran njtran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Just a couple of comments.

if err := retry.Do(
func() (err error) { return p.getInstance(ctx, id, instance) },
retry.Delay(1*time.Second),
retry.Attempts(3),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a reason why you picked 3 attempts of 1 second each? It seems pretty likely for EC2 to throttle for 3 seconds if you're making a lot of requests.

Can we make this a larger interval with more retries? If an instance creation fails, we'll have to do binpacking all over again anyways and attempt to create the instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen any throttling failures here, even with some preliminary 1k node scale tests. I've never seen this one fail due to rate limiting, as the describe limits are actually quite high. I'm open to increasing these, but I'm a bit wary of the retries themselves causing cascading effects on other threads. Regardless, we will recover gracefully w/ the reallocator if these retries do fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. We should definitely keep track of this somewhere, especially when we get to larger cluster sizes.

pkg/cloudprovider/aws/instance.go Outdated Show resolved Hide resolved
errs := make([]error, len(pods))
workqueue.ParallelizeUntil(ctx, len(pods), len(pods), func(index int) {
errs[index] = b.bind(ctx, node, pods[index])
})
logging.FromContext(ctx).Infof("Bound %d pod(s) to node %s", len(pods), node.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say that we had to bind 10 pods, and only 5 succeeded. This would log that we binded 10 pods, right? Can we change this to be more accurate?

errs := make([]error, len(pods))
workqueue.ParallelizeUntil(ctx, len(pods), len(pods), func(index int) {
errs[index] = b.bind(ctx, node, pods[index])
})
return multierr.Combine(errs...)
err := multierr.Combine(errs...)
logging.FromContext(ctx).Infof("Bound %d/%d pod(s) to node %s", len(pods) - len(multierr.Errors(err)), len(pods), node.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we need fractions here. I think it'd be sufficient to just say how many did bind.

@ellistarn ellistarn merged commit 7ac2ea6 into aws:main Jul 29, 2021
@ellistarn ellistarn deleted the leak branch July 29, 2021 22:19
gfcroft pushed a commit to gfcroft/karpenter-provider-aws that referenced this pull request Nov 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants