Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: create kubernetes events in case of insufficient capacity errors #3955

Closed
wants to merge 2 commits into from

Conversation

Fedosin
Copy link

@Fedosin Fedosin commented May 29, 2023

Fixes #

Description
We need to have improved visibility of ICE events when karpenter encounters them.

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

Release Note

Karpenter creates Kubernetes events in case of insufficient capacity errors.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@Fedosin Fedosin requested a review from a team as a code owner May 29, 2023 13:10
@Fedosin Fedosin requested a review from engedaam May 29, 2023 13:10
@netlify
Copy link

netlify bot commented May 29, 2023

Deploy Preview for karpenter-docs-prod canceled.

Name Link
🔨 Latest commit 784259d
🔍 Latest deploy log https://app.netlify.com/sites/karpenter-docs-prod/deploys/647758f8e9e8f200084b9f5a

We need to have improved visibility of ICE events when karpenter encounters
them.
Copy link
Contributor

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! Just one thought on making this apply generally.

return nil, cloudprovider.NewInsufficientCapacityError(fmt.Errorf("all requested instance types were unavailable during launch"))
}
instance, err := c.instanceProvider.Create(ctx, nodeTemplate, machine, instanceTypes)
if err != nil {
if cloudprovider.IsInsufficientCapacityError(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for us to surface this over in karpenter-core? I'm wondering if this might be a good place to surface the event when launch fails generally for this reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, we were looking at this again today, a couple of things we noticed which we wanted to share with you if we are in the right direction.

What we would like to add

  • the ability to see the instance family, instance type to the event when karpenter is not able to provision for the same.

As from what we could see right now, in karpeneter-core here in the launchMachine step here https://github.com/aws/karpenter-core/blob/v0.28.0-rc.2/pkg/controllers/machine/lifecycle/launch.go#L107, we would not have enough information for the same to be planted inside the k8s event, as we just have access to the machine name there, which we have added as of now.

What we did notice was here in https://github.com/aws/karpenter/blob/main/pkg/cloudprovider/cloudprovider.go#L88, we could possibly make use of the types present there, what do you folks feel about the same? cc @Fedosin

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ability to see the instance family, instance type to the event when karpenter

What's your ultimate goal here? Is it to track down all the instance types and offerings that are ICEd at a given time? If that's the case, then I suspect something like this might be a better mechanism to surface that.

I'm not opposed to surfacing those details also in the K8s event, I'm just curious what your ultimate goal is here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonathan-innis we are buildin a kubernetes controller around these things, so prometheus metrics are not really helpful for us. We'd like to have a kubernetes event object that we can react to.
Technically I don't have preferences where to create it - in karpenter or in karpenter-core. Both ways work for us :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, makes sense. I'm wondering if we should just create a generic event for each individual ICE error based on the instance type rather than creating the event directly on the Machine. It seems like what you are interested in here is getting the high-level eventing detail that there has been a generic ICE error on an instance-type/AZ combination and you aren't necessarily interested that a Machine launch failed generically due to ICE.

What are your thoughts on firing an event with no involved object that details the instance type and az of the instance type that received the ICE?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like what you are interested in here is getting the high-level eventing detail that there has been a generic ICE error on an instance-type/AZ combination and you aren't necessarily interested that a Machine launch failed generically due to ICE.

Hello there, @jonathan-innis . Thanks for the summary, yes, we are trying to have some more visibility added for the ICE event on the AZ/instance type combination.

What we are thinking in the direction further with this event is that, with that visibility, we would want our controller to stop recommending the instance family which we are trying to decide as a scheduling constraint for the application spec. To prevent a scenario where the application would stay in pending state due to non-availability of the instance-types in that instance family.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonathan-innis Hello there, we tried addressing it here #3999, please have a look.

@Fedosin
Copy link
Author

Fedosin commented May 30, 2023

@jonathan-innis Thank you for your suggestion! I moved this code to karpenter-core: kubernetes-sigs/karpenter#356

@Fedosin Fedosin closed this May 31, 2023
@Fedosin Fedosin reopened this May 31, 2023
@bwagner5
Copy link
Contributor

bwagner5 commented Jun 5, 2023

It looks like other PRs are going to implement this change, so closing this one.

@bwagner5 bwagner5 closed this Jun 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants