-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
♻️ Refactors webserver's healthcheck #2910
♻️ Refactors webserver's healthcheck #2910
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2910 +/- ##
========================================
+ Coverage 74.8% 79.5% +4.7%
========================================
Files 670 671 +1
Lines 27737 27787 +50
Branches 3220 3224 +4
========================================
+ Hits 20764 22112 +1348
+ Misses 6247 4922 -1325
- Partials 726 753 +27
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please find some questions and comments below
@validator("SC_HEALTHCHECK_TIMEOUT", pre=True) | ||
@classmethod | ||
def get_healthcheck_timeout_in_seconds(cls, v): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I think it's ok to give flexibility, I don't see the benefits here.
I'd rather go for the best unit here, which looks to be milliseconds. At that point I'd rename the env var to SC_HEALTHCHECK_TIMEOUT_MS
(pattern already used) and add a note in the description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that heathchecks we use are of the order of seconds or tens of seconds. I do not see the need to set this in ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are confusing me. That why even support minutes and milliseconds below since you would like to express everything in seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it would be wise to use the same timekeeing unit (MS) throughout the app. But large integer numbers also are not very clear. Difficult decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are confusing me. That why even support minutes and milliseconds below since you would like to express everything in seconds?
Let's see. Here are two facts that let me take this decision:
- https://docs.docker.com/engine/reference/builder/#healthcheck options use units
- this timeout is used in an
asyncio.wait_for(coro, timeout)
where the parameter timeout is in secods
services/web/server/src/simcore_service_webserver/rest_handlers.py
Outdated
Show resolved
Hide resolved
services/web/server/src/simcore_service_webserver/rest_handlers.py
Outdated
Show resolved
Hide resolved
services/web/server/src/simcore_service_webserver/rest_handlers.py
Outdated
Show resolved
Hide resolved
services/web/server/src/simcore_service_webserver/rest_healthcheck.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
so the docker engine is healthchecking every X interval seconds on the /health entrypoint (not the / entrypoint which is the one that should be constant speed right?) is that correct?
Then that health entrypoint will run some internal code with a timeout+10% of the one in the Dockerfile. What is the point of the 10% here? if you have a timeout you anyway return a 5xx or something similar right?
services/web/server/src/simcore_service_webserver/application_settings.py
Show resolved
Hide resolved
services/web/server/src/simcore_service_webserver/diagnostics_healthcheck.py
Outdated
Show resolved
Hide resolved
services/web/server/src/simcore_service_webserver/rest_handlers.py
Outdated
Show resolved
Hide resolved
try: | ||
heath_report: Dict[str, Any] = self.get_app_info(app) | ||
|
||
# TODO: every signal could return some info on the health on each part |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I get what this TODO is about.
each healtchcheck slot does append to the report right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the moment they do not append anything to the report, but yes, the idea is that in the future they can do so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still unclear why a TODO is needed for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what we append to _on_healthcheck
are Callable
s that return None
.
The TODO is an idea that we could (or not) implement in the future: instead of returning None
we can return a Dict that can append current health_report
. This way every healthcheck can add a section.
services/web/server/src/simcore_service_webserver/rest_handlers.py
Outdated
Show resolved
Hide resolved
6d98eac
to
2a95541
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still have my doubts about the unit conversion. The rest is better now.
@validator("SC_HEALTHCHECK_TIMEOUT", pre=True) | ||
@classmethod | ||
def get_healthcheck_timeout_in_seconds(cls, v): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are confusing me. That why even support minutes and milliseconds below since you would like to express everything in seconds?
@validator("SC_HEALTHCHECK_TIMEOUT", pre=True) | ||
@classmethod | ||
def get_healthcheck_timeout_in_seconds(cls, v): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it would be wise to use the same timekeeing unit (MS) throughout the app. But large integer numbers also are not very clear. Difficult decision.
services/web/server/src/simcore_service_webserver/application_settings.py
Show resolved
Hide resolved
@@ -111,7 +108,7 @@ def assert_healthy_app(app: web.Application) -> None: | |||
max_delay, | |||
max_delay_allowed, | |||
) | |||
raise HealthError(msg) | |||
raise HealthCheckFailed(msg) | |||
|
|||
# CRITERIA 2: Mean latency of the last N request slower than 1 sec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to have especially the maximum allowed mean value not hardcoded as a magic number but settable via config. If this is the case, refer to the ENV-VAR that controls this in the comment instead of "1 sec"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was tuned after some experiments we did at a time.. I am open to change it when needed
What do these changes do?
Adds a healthcheck app state instance that allows every plugin to append a check function that can determine the service health. This healthcheck instance is run at the handler of
/health
that is added by therest
pluginWith this new design:
rest
core plugin setup and only theredb
plugin could add a check about db responsiveness, thediagnostics
plugin (when enabled) will include the result of the responsiveness analysis in the healthcheck and so on with other plugins.rest
plugin (a core plugin) knows nothing about diagnostics plugin (an addon plugin) but the other way around.Related issue/s
This PR follows as an improvement based on this review comment from PR #2906.
How to test
Checklist