-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Surface health conditions of Karpenter resources as status conditions #493
Comments
Is the scope of this proposal limited to the Or are there other CRD resources that we want to add/augment aggregated status data for? |
We'd probably also want to scope this proposal to the For the |
|
It's the CloudProvider-specific parameters for a NodeClaim. We should probably enforce readiness in the status as part of the contract for a NodeClaim and then we can check the readiness of NodeClasses for any arbitrary NodeClass |
For instance, right now, the AWS-specific NodeClass is the |
|
splitting down the pr's |
@sadath-12 If you are going to take a stab at this, can you make sure to do a design write-up for this one? I think we need to carefully think through all the statusConditions that we want to surface and make sure that we are covering all of the use-cases that we expect off of it. Ideally, we can surface enough information through our status conditions that we can form alerting, eventing, and logging logic based purely of the status condition state transitions. |
Ya sure @jonathan-innis |
Makes a lot of sense to me. @charliedmcb, @mattchr - note that the plan is to use #786 as a stepping stone. |
/assign |
|
Is this different than our current Registered condition...or are you thinking more of a live condition that constantly reflects the existence of the Node? |
Yes, that. If I force delete an associated Node I'd expect the condition to change moments before, or simultaneous with, the moment NodeClaim starts to get finalized. |
Showing a summary of the node claims associated and if they are up to current state or currently in drift state by something like 5/6. Would give an idea if something is happening in the cluster currently. |
In order to debug if a nodeclaim will get removed, or currently is getting started, or has expired, it would be great to show that information as part of a |
Description
What problem are you trying to solve?
Karpenter uses logs and events to alert users about both fleeting and unresolvable errors about their configurations. Logs may be spammy due to the fast-nature of re-queues, and events can give delayed information on when a health issue is actually resolved (waiting until the de-dupe interval is complete).
On top of this concern, it may be useful for a user to monitor their NodePools or NodeClasses to determine if they are all in a healthy state and fire alarms when certain status conditions are not maintained or, more generally, when the Readiness condition for either the NodePools or the NodeClasses isn't satisfied.
High-level, I think we are thinking the following status conditions would be nice to have:
NodeClassReady
- Rolled-up status conditions from thenodeClassRef
which checks the generic readiness of this object to determine if the NodeClass is usableInstanceTypesExist
- Validates whether there are provisionable instance types with available offerings to useNodesDisrupting
- Describes whether there are any nodes on the cluster that are actively being disrupted through replacement or deletion. This condition would not have an affect on the NodePool readiness but would be purely informational for alerting and trackingWe have logic that is sitting in the code that ignores certain NodePools for scheduling and have some logic in the code for handling cases where the NodeClass isn't ready. This logic could strictly rely on these new status conditions to ensure that we are always attempting to schedule and provision nodes that are sitting in a ready state.
Additional Info
It's possible that we could then export metrics from these status conditions so that users can alarm/track these metrics.
How important is this feature to you?
The text was updated successfully, but these errors were encountered: