Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surface health conditions of Karpenter resources as status conditions #493

Closed
4 tasks
Tracked by #1051
njtran opened this issue Aug 29, 2023 · 17 comments
Closed
4 tasks
Tracked by #1051

Surface health conditions of Karpenter resources as status conditions #493

njtran opened this issue Aug 29, 2023 · 17 comments
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. needs-design operational-excellence v1 Issues requiring resolution by the v1 milestone

Comments

@njtran
Copy link
Contributor

njtran commented Aug 29, 2023

Description

What problem are you trying to solve?
Karpenter uses logs and events to alert users about both fleeting and unresolvable errors about their configurations. Logs may be spammy due to the fast-nature of re-queues, and events can give delayed information on when a health issue is actually resolved (waiting until the de-dupe interval is complete).

On top of this concern, it may be useful for a user to monitor their NodePools or NodeClasses to determine if they are all in a healthy state and fire alarms when certain status conditions are not maintained or, more generally, when the Readiness condition for either the NodePools or the NodeClasses isn't satisfied.

High-level, I think we are thinking the following status conditions would be nice to have:

  • NodePools
    • NodeClassReady - Rolled-up status conditions from the nodeClassRef which checks the generic readiness of this object to determine if the NodeClass is usable
    • InstanceTypesExist - Validates whether there are provisionable instance types with available offerings to use
    • NodesDisrupting - Describes whether there are any nodes on the cluster that are actively being disrupted through replacement or deletion. This condition would not have an affect on the NodePool readiness but would be purely informational for alerting and tracking

We have logic that is sitting in the code that ignores certain NodePools for scheduling and have some logic in the code for handling cases where the NodeClass isn't ready. This logic could strictly rely on these new status conditions to ensure that we are always attempting to schedule and provision nodes that are sitting in a ready state.

Additional Info
It's possible that we could then export metrics from these status conditions so that users can alarm/track these metrics.

How important is this feature to you?

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@jackfrancis
Copy link
Contributor

Is the scope of this proposal limited to the Status (type NodePoolStatus) of the type NodePool struct?

Or are there other CRD resources that we want to add/augment aggregated status data for?

@jonathan-innis
Copy link
Member

Or are there other CRD resources that we want to add/augment aggregated status data for

We'd probably also want to scope this proposal to the NodeClass and NodeClaim as well. We have some status conditions that currently sit on the NodeClaim, but I could see us adding more detail here (I know that there was a proposal that the NodeClaim readiness should accurately reflect the Node readiness through its status conditions as well).

For the NodeClass, it was difficult to talk about this in a cloud-neutral context, but I could see us using tangible examples from AWS and Azure to come to some agreed-upon tenents for what the readiness of these objects might look like so that we can rely on that readiness condition inside of the NodePool.

@jackfrancis
Copy link
Contributor

NodeClass seems to just be a redirect to a NodeClaim's NodeClassRef. What integral type definition is a NodeClassRef referring to?

@jonathan-innis
Copy link
Member

It's the CloudProvider-specific parameters for a NodeClaim. We should probably enforce readiness in the status as part of the contract for a NodeClaim and then we can check the readiness of NodeClasses for any arbitrary NodeClass

@jonathan-innis
Copy link
Member

For instance, right now, the AWS-specific NodeClass is the EC2NodeClass

@sadath-12
Copy link
Contributor

sadath-12 commented Oct 24, 2023

InstanceTypesExist part if covered by #617 where we display the resolved instance types itself

@sadath-12
Copy link
Contributor

splitting down the pr's

@jonathan-innis
Copy link
Member

@sadath-12 If you are going to take a stab at this, can you make sure to do a design write-up for this one? I think we need to carefully think through all the statusConditions that we want to surface and make sure that we are covering all of the use-cases that we expect off of it.

Ideally, we can surface enough information through our status conditions that we can form alerting, eventing, and logging logic based purely of the status condition state transitions.

@sadath-12
Copy link
Contributor

Ya sure @jonathan-innis

@billrayburn billrayburn added the v1 Issues requiring resolution by the v1 milestone label Oct 25, 2023
@tallaxes
Copy link

Makes a lot of sense to me. @charliedmcb, @mattchr - note that the plan is to use #786 as a stepping stone.

@sadath-12
Copy link
Contributor

/assign

@sftim
Copy link

sftim commented Jan 9, 2024

  • I'd definitely prefer to have a condition on a NodeClaim that reflects whether Karpenter things it can see a matching Node or not.
  • When a NodePool is close it its capacity limit and all plausible offerings would take the NodePool over that limit, it'd be nice to have that reflected in a condition. Ideally also which limit(s) have been reached. Even if there are no pending Pods to schedule and no demand for new nodes.

@jonathan-innis
Copy link
Member

prefer to have a condition on a NodeClaim that reflects whether Karpenter things it can see a matching Node or not

Is this different than our current Registered condition...or are you thinking more of a live condition that constantly reflects the existence of the Node?

@sftim
Copy link

sftim commented Jan 10, 2024

a live condition that constantly reflects the existence of the Node?

Yes, that. If I force delete an associated Node I'd expect the condition to change moments before, or simultaneous with, the moment NodeClaim starts to get finalized.

@tvonhacht-apple
Copy link
Contributor

Showing a summary of the node claims associated and if they are up to current state or currently in drift state by something like 5/6. Would give an idea if something is happening in the cluster currently.

@tvonhacht-apple
Copy link
Contributor

In order to debug if a nodeclaim will get removed, or currently is getting started, or has expired, it would be great to show that information as part of a -o wide command for debugging a cluster.

@njtran
Copy link
Contributor Author

njtran commented Aug 12, 2024

Fixed as part of #1385 and #1401

@njtran njtran closed this as completed Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. needs-design operational-excellence v1 Issues requiring resolution by the v1 milestone
Projects
None yet
8 participants