Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

♻️ Refactors webserver's healthcheck #2910

Merged

Conversation

pcrespov
Copy link
Member

@pcrespov pcrespov commented Mar 21, 2022

What do these changes do?

Adds a healthcheck app state instance that allows every plugin to append a check function that can determine the service health. This healthcheck instance is run at the handler of /health that is added by the rest plugin

With this new design:

  • health route is added during the rest core plugin setup and only there
  • the implementation of the app health-check can be extended by every plugin. For instance the db plugin could add a check about db responsiveness, the diagnostics plugin (when enabled) will include the result of the responsiveness analysis in the healthcheck and so on with other plugins.
  • reduces coupling between core and addon plugins, i.e. rest plugin (a core plugin) knows nothing about diagnostics plugin (an addon plugin) but the other way around.
  • 📝 SEE interesting article: How should I answer a health check? by J.Pallari

Related issue/s

This PR follows as an improvement based on this review comment from PR #2906.

How to test

pytest -vv tests/**/test_diag*.py

Checklist

@codecov
Copy link

codecov bot commented Mar 21, 2022

Codecov Report

Merging #2910 (bb151f1) into master (5c5f2b7) will increase coverage by 4.7%.
The diff coverage is 92.0%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master   #2910     +/-   ##
========================================
+ Coverage    74.8%   79.5%   +4.7%     
========================================
  Files         670     671      +1     
  Lines       27737   27787     +50     
  Branches     3220    3224      +4     
========================================
+ Hits        20764   22112   +1348     
+ Misses       6247    4922   -1325     
- Partials      726     753     +27     
Flag Coverage Δ
integrationtests 65.6% <64.0%> (-0.1%) ⬇️
unittests 75.2% <92.0%> (+5.1%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
.../simcore_service_webserver/application_settings.py 95.0% <76.9%> (+8.1%) ⬆️
.../src/simcore_service_webserver/rest_healthcheck.py 90.3% <90.3%> (ø)
...erver/src/simcore_service_webserver/diagnostics.py 100.0% <100.0%> (ø)
.../simcore_service_webserver/diagnostics_handlers.py 47.3% <100.0%> (+2.5%) ⬆️
...mcore_service_webserver/diagnostics_healthcheck.py 90.4% <100.0%> (ø)
...imcore_service_webserver/diagnostics_monitoring.py 100.0% <100.0%> (+9.5%) ⬆️
...s/web/server/src/simcore_service_webserver/rest.py 82.3% <100.0%> (+0.5%) ⬆️
...ver/src/simcore_service_webserver/rest_handlers.py 100.0% <100.0%> (+13.3%) ⬆️
...ector_v2/modules/comp_scheduler/background_task.py 83.3% <0.0%> (-8.4%) ⬇️
.../simcore_service_catalog/db/repositories/groups.py 72.9% <0.0%> (-5.5%) ⬇️
... and 67 more

@pcrespov pcrespov changed the title WIP: ♻️ Maintenance/webserver healthcheck ♻️ Maintenance/webserver healthcheck Mar 21, 2022
@pcrespov pcrespov changed the title ♻️ Maintenance/webserver healthcheck ♻️ Webserver's healthcheck Mar 21, 2022
@pcrespov pcrespov changed the title ♻️ Webserver's healthcheck ♻️ Refactors webserver's healthcheck Mar 21, 2022
@pcrespov pcrespov self-assigned this Mar 21, 2022
@pcrespov pcrespov added the a:webserver issue related to the webserver service label Mar 21, 2022
@pcrespov pcrespov added this to the E.Shackleton milestone Mar 21, 2022
@pcrespov pcrespov marked this pull request as ready for review March 21, 2022 22:15
Copy link
Contributor

@GitHK GitHK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please find some questions and comments below

Comment on lines +234 to +236
@validator("SC_HEALTHCHECK_TIMEOUT", pre=True)
@classmethod
def get_healthcheck_timeout_in_seconds(cls, v):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I think it's ok to give flexibility, I don't see the benefits here.

I'd rather go for the best unit here, which looks to be milliseconds. At that point I'd rename the env var to SC_HEALTHCHECK_TIMEOUT_MS (pattern already used) and add a note in the description.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that heathchecks we use are of the order of seconds or tens of seconds. I do not see the need to set this in ms

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are confusing me. That why even support minutes and milliseconds below since you would like to express everything in seconds?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it would be wise to use the same timekeeing unit (MS) throughout the app. But large integer numbers also are not very clear. Difficult decision.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are confusing me. That why even support minutes and milliseconds below since you would like to express everything in seconds?

Let's see. Here are two facts that let me take this decision:

  1. https://docs.docker.com/engine/reference/builder/#healthcheck options use units
  2. this timeout is used in an asyncio.wait_for(coro, timeout) where the parameter timeout is in secods

services/web/server/src/simcore_service_webserver/rest.py Outdated Show resolved Hide resolved
Copy link
Member

@sanderegg sanderegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!
so the docker engine is healthchecking every X interval seconds on the /health entrypoint (not the / entrypoint which is the one that should be constant speed right?) is that correct?

Then that health entrypoint will run some internal code with a timeout+10% of the one in the Dockerfile. What is the point of the 10% here? if you have a timeout you anyway return a 5xx or something similar right?

services/storage/tests/unit/test_rest.py Outdated Show resolved Hide resolved
services/web/Dockerfile Show resolved Hide resolved
try:
heath_report: Dict[str, Any] = self.get_app_info(app)

# TODO: every signal could return some info on the health on each part
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I get what this TODO is about.
each healtchcheck slot does append to the report right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the moment they do not append anything to the report, but yes, the idea is that in the future they can do so

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still unclear why a TODO is needed for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what we append to _on_healthcheck are Callables that return None.
The TODO is an idea that we could (or not) implement in the future: instead of returning None we can return a Dict that can append current health_report. This way every healthcheck can add a section.

@pcrespov pcrespov force-pushed the maintenance/webserver-healthcheck branch from 6d98eac to 2a95541 Compare March 22, 2022 19:04
@pcrespov pcrespov requested a review from GitHK March 24, 2022 00:46
Copy link
Contributor

@GitHK GitHK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still have my doubts about the unit conversion. The rest is better now.

Comment on lines +234 to +236
@validator("SC_HEALTHCHECK_TIMEOUT", pre=True)
@classmethod
def get_healthcheck_timeout_in_seconds(cls, v):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are confusing me. That why even support minutes and milliseconds below since you would like to express everything in seconds?

Comment on lines +234 to +236
@validator("SC_HEALTHCHECK_TIMEOUT", pre=True)
@classmethod
def get_healthcheck_timeout_in_seconds(cls, v):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it would be wise to use the same timekeeing unit (MS) throughout the app. But large integer numbers also are not very clear. Difficult decision.

@@ -111,7 +108,7 @@ def assert_healthy_app(app: web.Application) -> None:
max_delay,
max_delay_allowed,
)
raise HealthError(msg)
raise HealthCheckFailed(msg)

# CRITERIA 2: Mean latency of the last N request slower than 1 sec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to have especially the maximum allowed mean value not hardcoded as a magic number but settable via config. If this is the case, refer to the ENV-VAR that controls this in the comment instead of "1 sec"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was tuned after some experiments we did at a time.. I am open to change it when needed

@pcrespov pcrespov merged commit a920852 into ITISFoundation:master Mar 24, 2022
@pcrespov pcrespov deleted the maintenance/webserver-healthcheck branch March 24, 2022 12:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:webserver issue related to the webserver service
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants