Surface health conditions of Karpenter resources as status conditions #493

njtran · 2023-08-29T20:27:11Z

Description

What problem are you trying to solve?
Karpenter uses logs and events to alert users about both fleeting and unresolvable errors about their configurations. Logs may be spammy due to the fast-nature of re-queues, and events can give delayed information on when a health issue is actually resolved (waiting until the de-dupe interval is complete).

On top of this concern, it may be useful for a user to monitor their NodePools or NodeClasses to determine if they are all in a healthy state and fire alarms when certain status conditions are not maintained or, more generally, when the Readiness condition for either the NodePools or the NodeClasses isn't satisfied.

High-level, I think we are thinking the following status conditions would be nice to have:

NodePools
- NodeClassReady - Rolled-up status conditions from the nodeClassRef which checks the generic readiness of this object to determine if the NodeClass is usable
- InstanceTypesExist - Validates whether there are provisionable instance types with available offerings to use
- NodesDisrupting - Describes whether there are any nodes on the cluster that are actively being disrupted through replacement or deletion. This condition would not have an affect on the NodePool readiness but would be purely informational for alerting and tracking

We have logic that is sitting in the code that ignores certain NodePools for scheduling and have some logic in the code for handling cases where the NodeClass isn't ready. This logic could strictly rely on these new status conditions to ensure that we are always attempting to schedule and provision nodes that are sitting in a ready state.

Additional Info
It's possible that we could then export metrics from these status conditions so that users can alarm/track these metrics.

How important is this feature to you?

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

jackfrancis · 2023-10-13T16:35:00Z

Is the scope of this proposal limited to the Status (type NodePoolStatus) of the type NodePool struct?

Or are there other CRD resources that we want to add/augment aggregated status data for?

jonathan-innis · 2023-10-13T18:46:03Z

Or are there other CRD resources that we want to add/augment aggregated status data for

We'd probably also want to scope this proposal to the NodeClass and NodeClaim as well. We have some status conditions that currently sit on the NodeClaim, but I could see us adding more detail here (I know that there was a proposal that the NodeClaim readiness should accurately reflect the Node readiness through its status conditions as well).

For the NodeClass, it was difficult to talk about this in a cloud-neutral context, but I could see us using tangible examples from AWS and Azure to come to some agreed-upon tenents for what the readiness of these objects might look like so that we can rely on that readiness condition inside of the NodePool.

jackfrancis · 2023-10-13T19:07:44Z

NodeClass seems to just be a redirect to a NodeClaim's NodeClassRef. What integral type definition is a NodeClassRef referring to?

jonathan-innis · 2023-10-17T22:25:46Z

It's the CloudProvider-specific parameters for a NodeClaim. We should probably enforce readiness in the status as part of the contract for a NodeClaim and then we can check the readiness of NodeClasses for any arbitrary NodeClass

jonathan-innis · 2023-10-17T22:26:08Z

For instance, right now, the AWS-specific NodeClass is the EC2NodeClass

sadath-12 · 2023-10-24T05:50:14Z

InstanceTypesExist part if covered by #617 where we display the resolved instance types itself

sadath-12 · 2023-10-24T08:11:40Z

splitting down the pr's

jonathan-innis · 2023-10-24T21:05:11Z

@sadath-12 If you are going to take a stab at this, can you make sure to do a design write-up for this one? I think we need to carefully think through all the statusConditions that we want to surface and make sure that we are covering all of the use-cases that we expect off of it.

Ideally, we can surface enough information through our status conditions that we can form alerting, eventing, and logging logic based purely of the status condition state transitions.

sadath-12 · 2023-10-25T08:29:08Z

Ya sure @jonathan-innis

tallaxes · 2023-11-15T22:20:48Z

Makes a lot of sense to me. @charliedmcb, @mattchr - note that the plan is to use #786 as a stepping stone.

sadath-12 · 2023-11-23T16:12:16Z

/assign

sftim · 2024-01-09T23:19:22Z

I'd definitely prefer to have a condition on a NodeClaim that reflects whether Karpenter things it can see a matching Node or not.
When a NodePool is close it its capacity limit and all plausible offerings would take the NodePool over that limit, it'd be nice to have that reflected in a condition. Ideally also which limit(s) have been reached. Even if there are no pending Pods to schedule and no demand for new nodes.

jonathan-innis · 2024-01-10T06:14:35Z

prefer to have a condition on a NodeClaim that reflects whether Karpenter things it can see a matching Node or not

Is this different than our current Registered condition...or are you thinking more of a live condition that constantly reflects the existence of the Node?

sftim · 2024-01-10T09:41:54Z

a live condition that constantly reflects the existence of the Node?

Yes, that. If I force delete an associated Node I'd expect the condition to change moments before, or simultaneous with, the moment NodeClaim starts to get finalized.

tvonhacht-apple · 2024-04-14T05:56:19Z

Showing a summary of the node claims associated and if they are up to current state or currently in drift state by something like 5/6. Would give an idea if something is happening in the cluster currently.

tvonhacht-apple · 2024-04-14T06:04:44Z

In order to debug if a nodeclaim will get removed, or currently is getting started, or has expired, it would be great to show that information as part of a -o wide command for debugging a cluster.

njtran · 2024-08-12T22:32:41Z

Fixed as part of #1385 and #1401

njtran added operational-excellence kind/feature Categorizes issue or PR as related to a new feature. labels Aug 29, 2023

njtran mentioned this issue Aug 29, 2023

fix: emit events when node template fails to resolve aws/karpenter-provider-aws#4512

Merged

3 tasks

This was referenced Oct 2, 2023

Use NodeClass readiness to capture failure scenarios #557

Closed

Use NodePool status conditions to capture failure scenarios #556

Closed

sadath-12 mentioned this issue Oct 24, 2023

feat: added disrupting candidates for nodepool status #634

Closed

jonathan-innis added the needs-design label Oct 24, 2023

billrayburn added the v1 Issues requiring resolution by the v1 milestone label Oct 25, 2023

jonathan-innis mentioned this issue Nov 15, 2023

chore: Add NodePool readiness defined by the CloudProvider #786

Closed

jonathan-innis mentioned this issue Nov 21, 2023

chore: Add support for CloudProvider NodePool readiness aws/karpenter-provider-aws#5094

Closed

3 tasks

k8s-ci-robot assigned sadath-12 Nov 23, 2023

sadath-12 mentioned this issue Nov 24, 2023

feat: Nodeclass readiness into nodepool status #818

Closed

jonathan-innis added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Nov 25, 2023

sadath-12 mentioned this issue Nov 28, 2023

nodeclassreadiness in nodepool status #825

Closed

jonathan-innis mentioned this issue Dec 24, 2023

How to handle unset apiVersion and kind in nodeClassRef #909

Closed

jonathan-innis mentioned this issue Jan 9, 2024

v1 Laundry List #758

Closed

engedaam mentioned this issue Jan 24, 2024

Drift Hash Versioning #957

Closed

jonathan-innis mentioned this issue Feb 13, 2024

v0.33.2 Karpenter won't scale nodes on 1.28 EKS version: aws/karpenter-provider-aws#5594

Closed

This was referenced Feb 28, 2024

docs: RFC Supporting ODCR in Karpenter aws/karpenter-provider-aws#5716

Merged

Mega Issue: Karpenter Observability (metrics, logs, eventing, etc.) #1051

Open

jonathan-innis mentioned this issue Apr 17, 2024

Provide CR status aws/karpenter-provider-aws#5988

Closed

njtran closed this as completed Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Surface health conditions of Karpenter resources as status conditions #493

Surface health conditions of Karpenter resources as status conditions #493

njtran commented Aug 29, 2023 •

edited by jonathan-innis

Loading

jackfrancis commented Oct 13, 2023

jonathan-innis commented Oct 13, 2023

jackfrancis commented Oct 13, 2023

jonathan-innis commented Oct 17, 2023

jonathan-innis commented Oct 17, 2023

sadath-12 commented Oct 24, 2023 •

edited

Loading

sadath-12 commented Oct 24, 2023

jonathan-innis commented Oct 24, 2023

sadath-12 commented Oct 25, 2023

tallaxes commented Nov 15, 2023

sadath-12 commented Nov 23, 2023

sftim commented Jan 9, 2024

jonathan-innis commented Jan 10, 2024

sftim commented Jan 10, 2024

tvonhacht-apple commented Apr 14, 2024

tvonhacht-apple commented Apr 14, 2024

njtran commented Aug 12, 2024

Surface health conditions of Karpenter resources as status conditions #493

Surface health conditions of Karpenter resources as status conditions #493

Comments

njtran commented Aug 29, 2023 • edited by jonathan-innis Loading

Description

jackfrancis commented Oct 13, 2023

jonathan-innis commented Oct 13, 2023

jackfrancis commented Oct 13, 2023

jonathan-innis commented Oct 17, 2023

jonathan-innis commented Oct 17, 2023

sadath-12 commented Oct 24, 2023 • edited Loading

sadath-12 commented Oct 24, 2023

jonathan-innis commented Oct 24, 2023

sadath-12 commented Oct 25, 2023

tallaxes commented Nov 15, 2023

sadath-12 commented Nov 23, 2023

sftim commented Jan 9, 2024

jonathan-innis commented Jan 10, 2024

sftim commented Jan 10, 2024

tvonhacht-apple commented Apr 14, 2024

tvonhacht-apple commented Apr 14, 2024

njtran commented Aug 12, 2024

njtran commented Aug 29, 2023 •

edited by jonathan-innis

Loading

sadath-12 commented Oct 24, 2023 •

edited

Loading