-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add component healthcheck api design #34
base: main
Are you sure you want to change the base?
Changes from all commits
cd99db6
14e30a2
9ee3a92
22d9b7e
5a32ea2
82a5060
f92c475
ea97fc8
7a84066
fe5e04b
356976a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,108 @@ | ||
# Components Healthcheck In Dapr | ||
|
||
* Author(s): Deepanshu Agarwal (@DeepanshuA) | ||
* State: Draft | ||
* Updated: 07/20/2023 | ||
|
||
## Overview | ||
As a user, I should be able to enquire the health of dapr system, which is possible currently - but this health check should also optionally include health of connected dapr components as well. | ||
|
||
## Tenets | ||
1. It should be extensible i.e. tomorrow if any feature to check health of Actors etc. is required, it should not require a new endpoint. | ||
|
||
## Current Scenario | ||
There are many components in Dapr which don't yet implement Ping. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Comment from Alessandro (@ItalyPaleAle ): That said, we should review components that don't implement Ping, and see if adding it would be useful. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed |
||
Ping is not mandatory to be implemented by Components, which is the correct behavior, as it could lead to false positives. | ||
For this components health-check, components not implementing Ping will be omitted out. | ||
|
||
## Use-Case | ||
If a mandatory* component fails at the start-up, Dapr will terminate or will move to some non-workable state like CrashLoopBackoff etc., so `healthz` API or any other API can't be used. | ||
|
||
After Dapr has started, if any Mandatory component fails, this healthcheck can be used to determine what component has failed and accordingly some steps can be undertaken. | ||
|
||
If an Optional component fails, either at start-up or afterwards, it can help indicate to App/down-stream user system that what component has failed. | ||
|
||
![App Usecase](./resources/0010-R-components-healthcheck/comp_Healthcheck.jpg) | ||
|
||
Note *: A mandatory component is one which does NOT have proprty `spec.ignoreErrors` set to True. | ||
|
||
## API Design | ||
### Endpoint: | ||
Instead of an additional endpoint, the Approach underneath works with a query parameter, in addition to `healthz` endpoint. | ||
|
||
Following are the possible responses for `healthz` API: | ||
|
||
| HTTP | Response Codes | | ||
| -------- | -------- | | ||
| 204 | dapr is healthy | | ||
| 500 | dapr is not healthy | | ||
|
||
Hence, if dapr is healthy, then the response code changes, as per the enlisted cases below. | ||
If dapr is NOT healthy, then the components Health check should anyways NOT be checked. | ||
|
||
http://localhost:3500/v1.0/healthz?include_components=true | ||
|
||
### Approach: | ||
- Maintain a cache with status of all components loaded successfully and keep updating this cache in a background go routine at a configurable `pingUpdateFrequency`. By default, `pingUpdateFrequency` to be 30 seconds. | ||
If a component is marked un-healthy in cache currently, then `pingUpdateFrequency` to work as given `pingUpdateFrequency` / 3. i.e. In a default case, it would update every 10 seconds. | ||
|
||
- This cache will not start to be built, right at the boot of daprd sidecar. There will be an internal flag (let's say `collectPings`), which will be `false` at the beginning of the daprd sidecar and which will be turned `true`, once all the mandatory components are ready. | ||
Once, `collectPings` is `true`, the cache will start to be populated. | ||
|
||
- But, what happens if a component fails to initialize? | ||
If a mandatory component fails to initialize, then daprd will not come up healthy and will eventually terminate Or be in an inconsistent state like CrashLoopBackoff. | ||
If an optional component fails to initialize OR if it is not healthy afterwards as well, then it is governed by a query parameter `ignore_optional_component`, which is `true` by default. So, as the name suggests, if this query param is not set to `false`, then optional components failure to initialize OR failure to report healthy will not result in a un-healthy status. Rather, the http status code 207 will be reported back. | ||
|
||
Working: | ||
|
||
![Internal Working](./resources/0010-R-components-healthcheck/comp_hcheck_working.jpeg) | ||
|
||
|
||
- For components which don't yet implement Ping, they will be ignored for their health check. | ||
|
||
- Response of healthz endpoint will be always only a http status code. | ||
If App/user wants to enquire about detailed json kind of result, per component - metadata API can be used for this. | ||
This is done, as dapr healthz endpoint is publicly accessible, so for security constraints, it doesn't deem good to provide component names etc. as a response of `healthz` endpoint. | ||
The result i.e. `status` of healthCheck and `errorCode`/`message`, will be provided as part of `metadata` API. | ||
|
||
Currently, this is how metadata response body looks like: | ||
![Metadata](./resources/0010-R-components-healthcheck/metadata_current.jpg) | ||
|
||
If metadata endpoint is queried with a query parameter `components_health` set to true, it will include following: | ||
- Example For a healthy component: | ||
``` | ||
{ | ||
"name": "txnstore", | ||
"type": "state.redis", | ||
"version": "v1", | ||
"capabilities": [ | ||
"ETAG", | ||
"TRANSACTIONAL", | ||
"QUERY_API", | ||
"ACTOR" | ||
], | ||
"status": "OK" | ||
} | ||
``` | ||
- Example for an un-healthy component: | ||
``` | ||
{ | ||
"name": "txnstore", | ||
"type": "state.redis", | ||
"version": "v1", | ||
"capabilities": [ | ||
"ETAG", | ||
"TRANSACTIONAL", | ||
"QUERY_API", | ||
"ACTOR" | ||
], | ||
"status": "NOT_OK", | ||
"errorMessage": "redis store: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused" | ||
} | ||
``` | ||
- For a componnet not implementing `Ping`, `status` will not be included. | ||
|
||
|
||
- To implement only http endpoint, at least for the first version of this API. | ||
|
||
- This endpoint is not supposed to support query for a particular component Health check. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment from Alessandro (@ItalyPaleAle ): I don't know if making Ping mandatory is needed. A lot of components are stateless (for example, they don't maintain persistent connections with a remove service). IMHO it's fine to include Ping in an optional interface.
Ref: https://hackmd.io/MaNpUYRyQqe-0eqCsSUiIg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it is Optional only right now. And, that is the correct state in my opinion too. The doc also doesn't recommend it to make mandatory.