-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Enhance Resource Health Monitoring within App CR #717
base: develop
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
--- | ||
title: "Enhancing Resource Health Monitoring within App CR" | ||
authors: [ "Varsha Prasad Narsing <[email protected]>" ] | ||
status: "in review" | ||
approvers: [ "Carvel Maintainers" ] | ||
--- | ||
|
||
# Enhancing Resource Health Monitoring within App CR | ||
|
||
## Problem Statement | ||
|
||
After applying the package contents fetched from the source to the cluster through the App CR, it becomes challenging to ascertain the real-time health status of individual resources within the cluster. This lack of visibility poses a significant challenge for SRE/ops teams tasked with overseeing hundreds or even thousands of clusters. Consequently, having this information integrated into the Kapp controller can establish a centralized and reliable source of truth for monitoring purposes. | ||
|
||
## Terminology / Concepts | ||
|
||
## Proposal | ||
|
||
We intend to extend the existing App API by adding a new status condition to expose the system's health. To do so, the following needs to be implemented: | ||
|
||
1. The controller reconciling the App CR needs to dynamically set up watches for the resources being deployed by the package. | ||
2. Introduce a `Healthy` condition in App CR's `status` [field][app_cr_status]. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believed the original suggestion was to introduce conditions per resource as well, is that not required for your use case? - type: HealthCheck/someapi/someversion/somenamespace/resource
status: False
message: "Failed to meet condition: "some more information"" Which could live in a separate in a separate field such as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not convinced we need to report a condition for every resource. Imagine the results of an From there, the user can go investigate further. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 to what Andy mentioned. A single condition, which consists of a consolidated list of unhealthy objects is sufficient. Something like this:
|
||
|
||
This field would be set to: | ||
1. `True`: if all the installed resources are healthy. | ||
2. `False`: if there is any error during the `fetch`, `install`, `deploy` step or if any of the resources are unhelathy. | ||
|
||
The proposed set to criteria to determine if a resource is healthy or not is described below: | ||
|
||
1. A Pod resource will be healthy if/when: | ||
- `.status.Phase` is Running or Completed. | ||
- If `.status.Phase` is Running, "Available" status condition is True. | ||
|
||
2. A ReplicationController resource will be healthy if/when: | ||
- `.spec.replicas` matches `.status.availableReplicas` | ||
- `ReplicaFailure` type condition is not present in the status. | ||
|
||
3. A ReplicaSet resource will be healthy if/when: | ||
- `.spec.replicas` matches `.status.availableReplicas`. | ||
- `ReplicaFailure` type condition is not present in the status. | ||
|
||
3. A Deployment resource will be healthy if/when: | ||
- `.spec.replicas` matches `.status.availableReplicas`. | ||
- `Available` type condition is true. | ||
|
||
4. A StatefulSet resource will be healthy if/when: | ||
- `.spec.replicas` matches `.status.availableReplicas`. | ||
- A DaemonSet resource will be healthy if/when: | ||
`.status.desiredNumberScheduled` matches `.status.numberAvailable` | ||
|
||
5. An APIService resource will be healthy if/when: | ||
- `Available` type condition in status is true. | ||
|
||
6. A CustomResourceDefinition resource will be healthy if/when: | ||
- `StoredVersions` has the expected API version for the CRD. | ||
|
||
7. All other unspecified resources will be considered healthy. | ||
Comment on lines
+50
to
+56
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there any case where 5,6,7 would not lead to a deployment failure? if so do we really need to report health in these resources types? |
||
|
||
If any of the watched resource is unhealthy, the `Message` field of the healthy condition will have the statuses of the unhealthy resources ordered lexicographically. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it would be helpful to not cases where the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See my previous comment re not wanting to list every unhealthy resource |
||
|
||
Since the resources deployed by the App reconciler have informers created for them, any change in the resource state will trigger a reconcile that in turn will re-evaluate the health of all resources. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is value to be able to treat some resources as more critical! (carvel-dev/kapp-controller#1279 comes to mind) Today, in case of failure we have a mechanism which leads to immediate reconciliation on failure. However, in case of repeated failure, the reconciler exponentially backs off. Meaning that it would take longer to reconcile the app again if it has already failed >3 times (for example). This prevents an app or a set of apps that is doomed to fail from hogging the reconciliation queue. Would we want something similar here as well? |
||
|
||
#### Use Case: Monitoring the state of resources | ||
|
||
Kapp currently has the `inspect` command which lists the resources deployed and their current statuses. The output of the command is also printed out as a part of App's status if enabled through `rawOptions` while creating the CR. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To confirm - the output of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is correct, it would only report the health of resources when a reconciliation occurs. One correction (though it is not very relevant to the context) would be that #.....other spec
deploy:
- kapp:
inspect: {}
#.... |
||
|
||
Though this command provides information about the resources created by the respective App CR, it does so by sending API requests during the reconciliation. Instead, using informers provides additional advantages of having real-time updates, efficient resource utilization and reduced load on API server. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we work with informers it might be interesting to see what's an optimal number of resources to be watched we would recommend keeping resource utilisation in mind. |
||
|
||
### Other Approaches: | ||
|
||
## Open Questions: | ||
|
||
1. Can using informers to watch resources increase cache size, potentially impacting the performance? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this will be the case since the same kapp-controller can be in charge of hundreds of apps, and if we do this for all the apps, we might end up getting informers for every resource in the cluster. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can have this as an optional feature to start, similar to how There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If this proposal is implemented, health would ultimately be determined by evaluating only the following kinds:
Which would mean the additional overhead is at most 6 more informers with label selectors limiting the cache contents to just what kapp-controller is managing. We don't have to have informers for the App resources that do not contribute to the health condition, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, it'd only need to be a subset of all APIs There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Based on the discussion in the community meeting today - The two use-cases for setting up informers to watch resources are:
If (2) is not to be addressed, to maintain modularity in terms of kapp controller's functionality, @joaopapereira suggested we explore having a separate controller to monitor the health status of individual resources which can be optionally enabled. |
||
2. Can the output in the `inspect` status field be combined with that of proposed `healthy` condition? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess this is what needs to happen eventually. It doesn't make sense to have two status fields serve similar purpose. The Before refactoring the proposal for the same, I would like to confirm the use case of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The Since it is a separate feature altogether, I think we can work towards having a separate section for the additional information we want to surface. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
If we are watching all the resources in the cluster anyway, and triggering a reconcile - wondering if calling an Additionally, the second aspect is Given: If we decide to support both of them, then probably we should make them exclusive? |
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another open question is if kapp-controller starts reacting to all changes in the cluster, what will happen to performance in general? At this point in time kapp-controller can become the major consumer of CPU of the full cluster. |
||
|
||
[app_cr_status]: https://pkg.go.dev/github.com/vmware-tanzu/carvel-kapp-controller/pkg/apis/kappctrl/v1alpha1#AppStatus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to know more and discuss about how we can enable watches in the App reconciler. Looks like currently, the
kapp
command is called, and based on its output the App status is popoulated:https://github.com/carvel-dev/kapp-controller/blob/df87efdcf0c0c140ff644c8286257cd38a74fd42/pkg/app/app_deploy.go#L25
If we could return the list of resources which are being created (which currently is present in the cmd output) and dynamically set up watches, it would make it easier.
This is how we do in Rukpak for reference: https://github.com/operator-framework/rukpak/blob/6a8a84c9aff05efaba7b05992704ad38462a7ee8/internal/controllers/bundledeployment/bundledeployment.go#L389-L402.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not be ideal, but you can find all the resources by taking the involved GroupKinds from the ConfigMap associated with the kapp app, then listing/watching/informing using the appropriate app label selector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possible approach would be to have
kapp
write some information about specific resources to a file upon reconciliation and have this information copied over to the status.This is how we have information about used group versions and namespaces to the App status, using output from the
--app-metadata-file-output
flag.This however means this information will be reported whenever the App syncs.