Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add component healthcheck api design #34

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 16 additions & 5 deletions 0010-R-components-healthcheck.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
* Updated: 07/20/2023

## Overview
As a user, I should be able to enquire the health of dapr system, which is possible currently - but this health check should also include health of connected dapr components as well.
As a user, I should be able to enquire the health of dapr system, which is possible currently - but this health check should also optionally include health of connected dapr components as well.

## Tenets
1. It should be extensible i.e. tomorrow if any feature to check health of Actors etc. is required, it should not require a new endpoint.
Expand All @@ -15,6 +15,16 @@ There are many components in Dapr which don't yet implement Ping.
Ping is not mandatory to be implemented by Components, which is the correct behavior, as it could lead to false positives.
For this components health-check, components not implementing Ping will be omitted out.

## Use-Case
If a mandatory* component fails at the start-up, Dapr will terminate or will move to some non-workable state like CrashLoopBackoff etc., so `healthz` API or any other API can't be used.

After Dapr has started, if any Mandatory component fails, this healthcheck can be used to determine what component has failed and accordingly some steps acam be undertaken.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After Dapr has started, if any Mandatory component fails, this healthcheck can be used to determine what component has failed and accordingly some steps can be undertaken.

During a period of time where the component health check is failing, what will be the effect on the sidecars operation?

Will it be the same effect as with App Health Check i.e.

Taken from the App Healthcheck Docs

When it detects a failure in the app’s health, Dapr stops accepting new work on behalf of the application by:

  • Unsubscribing from all pub/sub subscriptions
  • Stopping all input bindings
  • Short-circuiting all service-invocation requests, which terminate in the Dapr runtime and are not forwarded to the application

These changes are meant to be temporary, and Dapr resumes normal operations once it detects that the application is responsive again.

Copy link

@olitomlinson olitomlinson Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just my initial take

If a component health check is failed, the sidecar should go into the same state as with a failed App Health Check.

This helps to keeps a consistent model, which then makes it easier to reason about the behaviour of the sidecar during the period of time where one or more various probes/checks are failing.

Copy link
Author

@DeepanshuA DeepanshuA Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of App health, Dapr sidecar need to stop accepting events to process ( i.e. why subscriptions, input bindings or service invocation requests need to be stopped), as App itself is not Healthy and can't process these events. So, Dapr sidecar doesn't know what to do with these events.

In case of a mandatory component health being reported as unhealthy, some features or All features of this component would be already un-usable. So, App can use this piece of information in a quick way to report back this status to 1. either some automated downstream to fix this issue or 2. Devops, which may consider some manual intervention.

But, here in case of a mandatory component being unhealthy, Dapr itself will not do any other operation. Here, rather App can decide to stop sending/receiving events via this component until it comes back.

Copy link

@olitomlinson olitomlinson Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective, if the dapr sidecar knows that one or more mandatory components are not healthy, it makes little sense to allow the App to be invoked via PubSub/Service Invocation etc (I see this as inviting a preventable failure to occur)

Which is the same principle of App Health Checks -- if you know the App (or some downstream dependency) is not healthy, don't invoke the App.

I absolutely see that some of those components may not always be needed, depending on the code path that is taken, so to enact the same behaviour as a failing App Health Check may seem heavy handed / aggressive.

However, I prefer an aggressive health check strategy -- "the sidecar is only as healthy as its weakest link" :)


The positive side effect of all of this is it will encourage operators to ensure that Components are scoped accordingly to the Apps that depend on them, rather than having no scope and Components being applied to every App.


If an Optional component fails, either at start-up or afterwards, it can help indicate to App/down-stream user system that what component has failed.

![App Usecase](./resources/0010-R-components-healthcheck/comp_Healthcheck.jpg)

Note *: A mandatory component is one which does NOT have proprty `spec.ignoreErrors` set to True.
## API Design
### Endpoint:
Instead of an additional endpoint, the Approach underneath works with a query parameter, in addition to `healthz` endpoint.
Expand All @@ -27,19 +37,20 @@ Following are the possible responses for `healthz` API:
| 500 | dapr is not healthy |

Hence, if dapr is healthy, then the response code changes, as per the enlisted cases below.
If dapr is NOT healthy, then the compoenents Health check should anyways NOT be checked.
If dapr is NOT healthy, then the components Health check should anyways NOT be checked.

http://localhost:3500/v1.0/healthz?include_components=true

### Approach:
- Maintain a cache with status of all components loaded successfully and keep updating this cache in a background go routine at a configurable `pingUpdateFrequency`. By default, `pingUpdateFrequency` to be 5 minutes.
- Maintain a cache with status of all components loaded successfully and keep updating this cache in a background go routine at a configurable `pingUpdateFrequency`. By default, `pingUpdateFrequency` to be 30 seconds.
If a component is marked un-healthy in cache currently, then `pingUpdateFrequency` to work as given `pingUpdateFrequency` / 3. i.e. In a default case, it would update every 10 seconds.

- This cache will not start to be built, right at the boot of daprd sidecar. There will be flag (let's say `collectPings`), which will be `false` at the beginning of the daprd sidecar and which will be turned `true`, once all the components are ready.
- This cache will not start to be built, right at the boot of daprd sidecar. There will be an internal flag (let's say `collectPings`), which will be `false` at the beginning of the daprd sidecar and which will be turned `true`, once all the components are ready.
Once, `collectPings` is `true`, the cache will start to be populated.

- But, what happens if a component fails to initialize?
If a mandatory component fails to initialize, then daprd will not come up healthy and thus, anyways health check will report unhealthy.
If an optional component fails to initialize OR if it is not healthy afterwards as well, then it is governed by a query parameter `ignoreOptionalComponent`, which is `true` by default. So, as the name suggests, if this query param is not set to `false`, then optional components failure to initialize OR failure to report healthy will not result in a un-healthy status. Rather, the http status code 207 will be reported back.
If an optional component fails to initialize OR if it is not healthy afterwards as well, then it is governed by a query parameter `ignore_optional_component`, which is `true` by default. So, as the name suggests, if this query param is not set to `false`, then optional components failure to initialize OR failure to report healthy will not result in a un-healthy status. Rather, the http status code 207 will be reported back.

- For components which don't yet implement Ping, they will be ignored for their health check.

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.