Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add component healthcheck api design #34

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
51 changes: 51 additions & 0 deletions 0010-R-components-healthcheck.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Components Healthcheck In Dapr

* Author(s): Deepanshu Agarwal (@DeepanshuA)
* State: Draft
* Updated: 07/20/2023

## Overview
As a user, I should be able to enquire the health of dapr system, which is possible currently - but this health check should also include health of connected dapr components as well.

## Tenets
1. It should be extensible i.e. tomorrow if any feature to check health of Actors etc. is required, it should not require a new endpoint.

## Current Scenario
There are many components in Dapr which don't yet implement Ping.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment from Alessandro (@ItalyPaleAle ): I don't know if making Ping mandatory is needed. A lot of components are stateless (for example, they don't maintain persistent connections with a remove service). IMHO it's fine to include Ping in an optional interface.

Ref: https://hackmd.io/MaNpUYRyQqe-0eqCsSUiIg

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is Optional only right now. And, that is the correct state in my opinion too. The doc also doesn't recommend it to make mandatory.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment from Alessandro (@ItalyPaleAle ): That said, we should review components that don't implement Ping, and see if adding it would be useful.

Ref: https://hackmd.io/MaNpUYRyQqe-0eqCsSUiIg

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

Ping is not mandatory to be implemented by Components, which is the correct behavior, as it could lead to false positives.
For this components health-check, components not implementing Ping will be omitted out.

## API Design
### Endpoint:
Instead of an additional endpoint, the Approach underneath works with a query parameter, in addition to `healthz` endpoint.

Following are the possible responses for `healthz` API:

| HTTP | Response Codes |
| -------- | -------- |
| 204 | dapr is healthy |
| 500 | dapr is not healthy |

Hence, if dapr is healthy, then the response code changes, as per the enlisted cases below.
If dapr is NOT healthy, then the compoenents Health check should anyways NOT be checked.

http://localhost:3500/v1.0/healthz?include_components=true

### Approach:
- Maintain a cache with status of all components loaded successfully and keep updating this cache in a background go routine at a configurable `pingUpdateFrequency`. By default, `pingUpdateFrequency` to be 5 minutes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts:

  • 5 mins seems a lot? That would mean the time to detect a failure can be as high as 5 mins. Maybe that's just my opinion.
  • Ping updates may be more frequent if the component is un-healthy, since we want to see if we can recover faster

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I agree that 5 minutes seems higher. I have updated that to 30 seconds.
  • I like this point but do we need another configuration for this? One more config will be too much, so rather gave logic that in case of unhealthy, it will Ping every pingUpdateFrequency / 3.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If un-healthy we should use an exponential backoff (with a limit of pingUpdateFrequency) IMHO, to avoid overloading a service.


- This cache will not start to be built, right at the boot of daprd sidecar. There will be flag (let's say `collectPings`), which will be `false` at the beginning of the daprd sidecar and which will be turned `true`, once all the components are ready.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are these options passed? Are they in the Configuration CRD?

Copy link
Author

@DeepanshuA DeepanshuA Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collectPings is an Internal flag.
For pingUpdateFrequency, yeah Configuration CRD would be the place.

Once, `collectPings` is `true`, the cache will start to be populated.

- But, what happens if a component fails to initialize?
If a mandatory component fails to initialize, then daprd will not come up healthy and thus, anyways health check will report unhealthy.
If an optional component fails to initialize OR if it is not healthy afterwards as well, then it is governed by a query parameter `ignoreOptionalComponent`, which is `true` by default. So, as the name suggests, if this query param is not set to `false`, then optional components failure to initialize OR failure to report healthy will not result in a un-healthy status. Rather, the http status code 207 will be reported back.
ItalyPaleAle marked this conversation as resolved.
Show resolved Hide resolved

- For components which don't yet implement Ping, they will be ignored for their health check.

- Response of healthz endpoint will be always only a http status code.
The result i.e. `healthCheckStatus` and `errorCode`/`message`, if required will be provided as part of `metadata` API. This is done, as dapr healthz endpoint is publicly accessible, so for security constraints, it doesn't deem good to provide component names etc. as a response of `healthz` endpoint.

- To implement only http endpoint, at least for the first version of this API.

- This endpoint is not supposed to support query for a particular component Health check.