-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
healthcheck status events have inconsistent states #19237
Comments
There is no `health_state` event after the first health check run, and the first `health_event` value is inconsistent with the actual state. See containers/podman#19237 Work around that by running the health check more often, and waiting long enough to catch the second one.
These are internal transient states which don't need to reflect in the UI. They happen quickly in bursts, with a "permanent state" event following such as "create", "died", or "remove". This helps to reduce the API calls and thus mitigates out-of-order results; see containers/podman#19124 We are not really interested in `podman exec` events, so we would like to ignore `exec_died` along with `exec`. However, it is the only thing that saves us from inconsistent `health_state` events (see containers/podman#19237). So we cannot rely on the latter event, but instead have to do a full update after each `exec_died`, as some of them are the health checks. Also fix the alphabetical sorting of the remaining events.
The current workaround for us is to listen to |
These are internal transient states which don't need to reflect in the UI. They happen quickly in bursts, with a "permanent state" event following such as "create", "died", or "remove". This helps to reduce the API calls and thus mitigates out-of-order results; see containers/podman#19124 We are not really interested in `podman exec` events, so we would like to ignore `exec_died` along with `exec`. However, it is the only thing that saves us from inconsistent `health_state` events (see containers/podman#19237). So we cannot rely on the latter event, but instead have to do a full update after each `exec_died`, as some of them are the health checks. Also fix the alphabetical sorting of the remaining events.
That workaround is still imperfect, though: We are in a state with one health check run. Then the user clicks "Run health check" (equivalent of "State": {
"OciVersion": "1.1.0-rc.1",
"Status": "running",
"Running": true,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 15520,
"ConmonPid": 15518,
"ExitCode": 0,
"Error": "",
"StartedAt": "2023-07-14T10:07:14.064090116Z",
"FinishedAt": "0001-01-01T00:00:00Z",
"Health": {
"Status": "healthy",
"FailingStreak": 0,
"Log": [
{
"Start": "2023-07-14T10:07:14.151866216Z",
"End": "2023-07-14T10:07:14.261819433Z",
"ExitCode": 0,
"Output": ""
}
]
}, (There's only one |
podman upstream report: containers/podman#19237 known issue cockpit-project#5003
I'll look into this today. |
This may be an independent issue. If you'd like me to file it separately, I'm happy to. Thanks! |
podman upstream report: containers/podman#19237 known issue #5003
It's a race condition - writing the actual health status logs to disk is happening after the event is generated. I think it's probably part of the same issue. |
@martinpitt Can you test #19245 - I think I got most of your problems. Lingering issues: |
I ran
in a clean F38 VM to get the packages from #19245, and re-did the reproducer above. Indeed this works much better now. For the failing case, I get
right after startup, and after 30s
For the succeeding case, I get
immediately after startup, and again after 30s.
Ack, broken out as issue #19272 Thanks! |
I need to read the docs a bit more to see if the |
A friendly reminder that this issue had no activity for 30 days. |
bump |
HC events were firing as part of the `exec` call, before it had even been decided whether the HC succeeded or failed. As such, the status was not going to be correct any time there was a change (e.g. the first event after a container went healthy to unhealthy would still read healthy). Move the event into the actual Healthcheck function and throw it in a defer to make sure it happens at the very end, after logs are written. Ignores several conditions that did not log previously (container in question does not have a healthcheck, or an internal failure that should not really happen). Still not a perfect solution. This relies on the HC log being written, when instead we could just get the status straight from the function writing the event - so if we fail to write the log, we can still report a bad status. But if the log wasn't written, we're in bad shape regardless - `podman ps` would disagree with the event written, for example. Fixes containers#19237 Signed-off-by: Matt Heon <[email protected]>
Issue Description
I am investigating cockpit-podman's healthcheck tests, which currently fail very often. The API sends out the
health_status
events, but their reported values seem to lag behind what the state should be. It also sometimes differs frompodman ps
, and that by itself is inconsistent.Steps to reproduce the issue
Open three terminals, set up some watching in two of them:
Failing health check
Succeeding health check
Describe the results you received
Failing healthcheck
Right after start,
podman ps
shows the health status as "starting", which is okay. It should not have done anything yet, due to--health-start-period=5s
:Apart from the usual init/start/pull/etc. events (I'm leaving out most of them, just keep
init
to get time reference), it immediately sends out astarting
health check event, which is fine:After 5 seconds, there is no event, and
ps -a
remains at "Up 12 seconds (starting)".After 30 seconds, there is a new
health_status
event, but with the wrong value:whereas
ps -a
shows it as "unhealthy" already (which is expected):Only after another 30s the next event has the correct value:
Passing healthcheck
For this,
podman ps
immediately shows "healthy", although I'd expect "starting" for the first 5s (as the health check should only run after 5 s):From then on, it never changes any more, e.g. I got "Up 8 seconds (healthy) or " Up 33 seconds (healthy)".
The events show an immediate
health_status starting
event right after startup:This is inconsistent with
podman ps
which already shows it as "healthy".But there is no event after 5s which would indicate that the health check ran and turn the status into "healthy". The next event only happens after 30s and then has the expected state "healthy":
And then I get the same event every 30s.
Describe the results you expected
The previous paragraph already explained my expectations. One issue is that it's unclear what
--health-start-period=5s
actually does -- it has no observable effect. The first health check run seems to happen immediately. It is documented aswhich certainly does not directly say that it will delay the first health check run by that amount, but that would be my expectation.
Also, the value in the
health_check
event should always agree with the actual value from the API andpodman ps
.Finally there is a missing event after the first health check runs (after 0s or health-start-period). I.e. after "starting" there should be a first run whose result gets reported, either as "healthy" or "unhealthy". Our tests set the --health-startup-interval to 5 minutes to get a stable UI state (we only check the first run), so we don't see the second event (which then has the correct value).
podman info output
The text was updated successfully, but these errors were encountered: