healthcheck status events have inconsistent states #19237

martinpitt · 2023-07-14T07:18:04Z

Issue Description

I am investigating cockpit-podman's healthcheck tests, which currently fail very often. The API sends out the health_status events, but their reported values seem to lag behind what the state should be. It also sometimes differs from podman ps, and that by itself is inconsistent.

Steps to reproduce the issue

Open three terminals, set up some watching in two of them:

 watch -n1 podman ps -a
 clear; podman events

Failing health check

podman run --rm --health-cmd=false --health-start-period=5s --health-startup-timeout=1 --health-retries=1  quay.io/prometheus/busybox sleep infinity

Succeeding health check

podman run --rm --health-cmd=true --health-start-period=5s --health-startup-timeout=1 --health-retries=1  quay.io/prometheus/busybox sleep infinity

Describe the results you received

Failing healthcheck

Right after start, podman ps shows the health status as "starting", which is okay. It should not have done anything yet, due to --health-start-period=5s:

fdae48c5aa11  quay.io/prometheus/busybox:latest  sleep infinity  2 seconds ago  Up 2 seconds (starting)              kind_clarke

Apart from the usual init/start/pull/etc. events (I'm leaving out most of them, just keep init to get time reference), it immediately sends out a starting health check event, which is fine:

2023-07-14 06:57:27.098931797 +0000 UTC container init fdae48c5aa11b9947bfbbdda36fc6c33e9cfffeba6f7dc5383cf6ecd9b93d3b0 (image=quay.io/prometheus/busybox:latest, name=kind_clarke)
2023-07-14 06:57:27.164906579 +0000 UTC container health_status fdae48c5aa11b9947bfbbdda36fc6c33e9cfffeba6f7dc5383cf6ecd9b93d3b0 (image=quay.io/prometheus/busybox:latest, name=kind_clarke, health_status=starting)

After 5 seconds, there is no event, and ps -a remains at "Up 12 seconds (starting)".

After 30 seconds, there is a new health_status event, but with the wrong value:

2023-07-14 06:57:57.504435576 +0000 UTC container health_status fdae48c5aa11b9947bfbbdda36fc6c33e9cfffeba6f7dc5383cf6ecd9b93d3b0 (image=quay.io/prometheus/busybox:latest, name=kind_clarke, health_status=starting)

whereas ps -a shows it as "unhealthy" already (which is expected):

fdae48c5aa11  quay.io/prometheus/busybox:latest  sleep infinity  40 seconds ago  Up 40 seconds (unhealthy)              kind_clarke

Only after another 30s the next event has the correct value:

2023-07-14 06:58:27.773311668 +0000 UTC container health_status fdae48c5aa11b9947bfbbdda36fc6c33e9cfffeba6f7dc5383cf6ecd9b93d3b0 (image=quay.io/prometheus/busybox:latest, name=kind_clarke, health_status=unhealthy)

fdae48c5aa11  quay.io/prometheus/busybox:latest  sleep infinity  About a minute ago  Up About a minute (unhealthy)              kind_clarke

Passing healthcheck

For this, podman ps immediately shows "healthy", although I'd expect "starting" for the first 5s (as the health check should only run after 5 s):

be79f8a88c0f  quay.io/prometheus/busybox:latest  sleep infinity  Less than a second ago  Up Less than a second (healthy)              nervous_chaplygin

From then on, it never changes any more, e.g. I got "Up 8 seconds (healthy) or " Up 33 seconds (healthy)".

The events show an immediate health_status starting event right after startup:

2023-07-14 07:05:42.082377526 +0000 UTC container init be79f8a88c0f520606b9979be7dc17b45c5bf4b4ee47d079bdeb04d600aede9c (image=quay.io/prometheus/busybox:latest, name=nervous_chaplygin)
2023-07-14 07:05:42.147524232 +0000 UTC container health_status be79f8a88c0f520606b9979be7dc17b45c5bf4b4ee47d079bdeb04d600aede9c (image=quay.io/prometheus/busybox:latest, name=nervous_chaplygin, health_status=starting)

This is inconsistent with podman ps which already shows it as "healthy".

But there is no event after 5s which would indicate that the health check ran and turn the status into "healthy". The next event only happens after 30s and then has the expected state "healthy":

2023-07-14 07:06:12.776050715 +0000 UTC container health_status be79f8a88c0f520606b9979be7dc17b45c5bf4b4ee47d079bdeb04d600aede9c (image=quay.io/prometheus/busybox:latest, name=nervous_chaplygin, health_status=healthy)

And then I get the same event every 30s.

Describe the results you expected

The previous paragraph already explained my expectations. One issue is that it's unclear what --health-start-period=5s actually does -- it has no observable effect. The first health check run seems to happen immediately. It is documented as

the initialization time needed for a container to bootstrap (default "0s")

which certainly does not directly say that it will delay the first health check run by that amount, but that would be my expectation.

Also, the value in the health_check event should always agree with the actual value from the API and podman ps.

Finally there is a missing event after the first health check runs (after 0s or health-start-period). I.e. after "starting" there should be a first run whose result gets reported, either as "healthy" or "unhealthy". Our tests set the --health-startup-interval to 5 minutes to get a stable UI state (we only check the first run), so we don't see the second event (which then has the correct value).

podman info output

host:
  arch: amd64
  buildahVersion: 1.30.0
  cgroupControllers:
  - cpu
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.7-2.fc38.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.7, commit: '
  cpuUtilization:
    idlePercent: 93.55
    systemPercent: 2.93
    userPercent: 3.53
  cpus: 1
  databaseBackend: boltdb
  distribution:
    distribution: fedora
    variant: cloud
    version: "38"
  eventLogger: journald
  hostname: fedora-38-127-0-0-2-2201
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 524288
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 524288
      size: 65536
  kernel: 6.3.12-200.fc38.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 326119424
  memTotal: 1123942400
  networkBackend: netavark
  ociRuntime:
    name: crun
    package: crun-1.8.5-1.fc38.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.8.5
      commit: b6f80f766c9a89eb7b1440c0a70ab287434b17ed
      rundir: /run/user/1000/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  remoteSocket:
    exists: true
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.0-12.fc38.x86_64
    version: |-
      slirp4netns version 1.2.0
      commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.3
  swapFree: 1078194176
  swapTotal: 1123020800
  uptime: 1h 46m 33.00s (Approximately 0.04 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  configFile: /home/admin/.config/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/admin/.local/share/containers/storage
  graphRootAllocated: 12798898176
  graphRootUsed: 2206924800
  graphStatus:
    Backing Filesystem: btrfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 3
  runRoot: /tmp/containers-user-1000/containers
  transientStore: false
  volumePath: /home/admin/.local/share/containers/storage/volumes
version:
  APIVersion: 4.5.1
  Built: 1685123928
  BuiltTime: Fri May 26 17:58:48 2023
  GitCommit: ""
  GoVersion: go1.20.4
  Os: linux
  OsArch: linux/amd64
  Version: 4.5.1



### Podman in a container

No

### Privileged Or Rootless

Rootless

### Upstream Latest Release

No

### Additional environment details

Happens on bare metal and in VM, it also happens as root and user. We also see it across all operating systems, Ubuntu 22.04 LTS over CentOS latest Fedora 38.

### Additional information

_No response_

The text was updated successfully, but these errors were encountered:

There is no `health_state` event after the first health check run, and the first `health_event` value is inconsistent with the actual state. See containers/podman#19237 Work around that by running the health check more often, and waiting long enough to catch the second one.

These are internal transient states which don't need to reflect in the UI. They happen quickly in bursts, with a "permanent state" event following such as "create", "died", or "remove". This helps to reduce the API calls and thus mitigates out-of-order results; see containers/podman#19124 We are not really interested in `podman exec` events, so we would like to ignore `exec_died` along with `exec`. However, it is the only thing that saves us from inconsistent `health_state` events (see containers/podman#19237). So we cannot rely on the latter event, but instead have to do a full update after each `exec_died`, as some of them are the health checks. Also fix the alphabetical sorting of the remaining events.

martinpitt · 2023-07-14T09:00:42Z

The current workaround for us is to listen to exec_died instead; health checks are implemened as podman exec, so each run triggers an exec/exec_died pair. After exec_died, we make a full /containers/ID/json inspection call, which picks up the correct health state. But this is expensive.

These are internal transient states which don't need to reflect in the UI. They happen quickly in bursts, with a "permanent state" event following such as "create", "died", or "remove". This helps to reduce the API calls and thus mitigates out-of-order results; see containers/podman#19124 We are not really interested in `podman exec` events, so we would like to ignore `exec_died` along with `exec`. However, it is the only thing that saves us from inconsistent `health_state` events (see containers/podman#19237). So we cannot rely on the latter event, but instead have to do a full update after each `exec_died`, as some of them are the health checks. Also fix the alphabetical sorting of the remaining events.

martinpitt · 2023-07-14T13:24:40Z

That workaround is still imperfect, though: We are in a state with one health check run. Then the user clicks "Run health check" (equivalent of podman healthcheck run). This triggers a health_status event (broken as above), followed by an exec_died event, but then the /containers/ID/json inspection still does not contain the log of the second run:

    "State": {    
        "OciVersion": "1.1.0-rc.1",    
        "Status": "running",    
        "Running": true,        
        "Paused": false,    
        "Restarting": false,    
        "OOMKilled": false,    
        "Dead": false,    
        "Pid": 15520,    
        "ConmonPid": 15518,    
        "ExitCode": 0,     
        "Error": "",    
        "StartedAt": "2023-07-14T10:07:14.064090116Z",    
        "FinishedAt": "0001-01-01T00:00:00Z",    
        "Health": {    
            "Status": "healthy",      
            "FailingStreak": 0,        
            "Log": [       
                {    
                    "Start": "2023-07-14T10:07:14.151866216Z",    
                    "End": "2023-07-14T10:07:14.261819433Z",    
                    "ExitCode": 0,    
                    "Output": ""    
                }    
            ]    
        },

(There's only one Log entry, the old one)

podman upstream report: containers/podman#19237 known issue cockpit-project#5003

mheon · 2023-07-14T13:33:00Z

I'll look into this today.

martinpitt · 2023-07-14T14:48:27Z

This triggers a health_status event (broken as above), followed by an exec_died event, but then the /containers/ID/json inspection still does not contain the log of the second run

This may be an independent issue. If you'd like me to file it separately, I'm happy to. Thanks!

podman upstream report: containers/podman#19237 known issue #5003

mheon · 2023-07-14T16:17:42Z

It's a race condition - writing the actual health status logs to disk is happening after the event is generated. I think it's probably part of the same issue.

mheon · 2023-07-14T16:44:07Z

@martinpitt Can you test #19245 - I think I got most of your problems.

Lingering issues: --health-start-period still won't work. I think we might want a separate issue just for that, it seems like a much bigger fix.

martinpitt · 2023-07-18T08:13:48Z

I ran

dnf copr enable rhcontainerbot/packit-builds
dnf install podman{,-gvproxy}-102:4.5.0.rc1-1.20230714164821967548.pr19245.1088.gd277a7e7e.x86_64

in a clean F38 VM to get the packages from #19245, and re-did the reproducer above. Indeed this works much better now. For the failing case, I get

2023-07-18 07:59:35.460444324 +0000 UTC container health_status d9e4c05818cfff6f71af8c4770d26015ab039bfa84bfa49869a9457ef4566c0b (image=quay.io/prometheus/busybox:latest, name=vibrant_curran, health_status=starting)

right after startup, and after 30s

2023-07-18 08:00:37.406262236 +0000 UTC container health_status d9e4c05818cfff6f71af8c4770d26015ab039bfa84bfa49869a9457ef4566c0b (image=quay.io/prometheus/busybox:latest, name=vibrant_curran, health_status=unhealthy)

For the succeeding case, I get

2023-07-18 08:01:29.789596689 +0000 UTC container health_status 054df70feec0e93b98f996a3c010b84b8f51fb871bb7e4f049e7b01fb9ba46fb (image=quay.io/prometheus/busybox:latest, name=amazing_stonebraker, health_status=healthy)

immediately after startup, and again after 30s.

Lingering issues: --health-start-period still won't work.

Ack, broken out as issue #19272

Thanks!

mheon · 2023-07-18T14:04:46Z

I need to read the docs a bit more to see if the starting bit is expected, I would have thought we defaulted that to 0 and as such we'd jump immediately to unhealthy

github-actions · 2023-08-18T00:05:45Z

A friendly reminder that this issue had no activity for 30 days.

martinpitt · 2023-08-18T07:02:33Z

bump

HC events were firing as part of the `exec` call, before it had even been decided whether the HC succeeded or failed. As such, the status was not going to be correct any time there was a change (e.g. the first event after a container went healthy to unhealthy would still read healthy). Move the event into the actual Healthcheck function and throw it in a defer to make sure it happens at the very end, after logs are written. Ignores several conditions that did not log previously (container in question does not have a healthcheck, or an internal failure that should not really happen). Still not a perfect solution. This relies on the HC log being written, when instead we could just get the status straight from the function writing the event - so if we fail to write the log, we can still report a bad status. But if the log wasn't written, we're in bad shape regardless - `podman ps` would disagree with the event written, for example. Fixes containers#19237 Signed-off-by: Matt Heon <[email protected]>

martinpitt added the kind/bug Categorizes issue or PR as related to a bug. label Jul 14, 2023

martinpitt mentioned this issue Jul 14, 2023

Debug flakes cockpit-project/cockpit-podman#1324

Closed

martinpitt mentioned this issue Jul 14, 2023

Fix race condition in testLifecycleOperations, Serialize updateContainer() calls, Ignore irrelevant container events cockpit-project/cockpit-podman#1353

Merged

martinpitt mentioned this issue Jul 14, 2023

podman status sometimes does not contain recent health check log cockpit-project/bots#5003

Closed

martinpitt added a commit to martinpitt/bots that referenced this issue Jul 14, 2023

naughty: Add pattern for podman missing health check events

b524711

podman upstream report: containers/podman#19237 known issue cockpit-project#5003

martinpitt mentioned this issue Jul 14, 2023

naughty: Add pattern for podman missing health check events cockpit-project/bots#5004

Merged

mheon self-assigned this Jul 14, 2023

skobyda pushed a commit to cockpit-project/bots that referenced this issue Jul 14, 2023

naughty: Add pattern for podman missing health check events

8e0a912

podman upstream report: containers/podman#19237 known issue #5003

mheon mentioned this issue Jul 14, 2023

Ensure HC events fire after logs are written #19245

Merged

martinpitt mentioned this issue Jul 18, 2023

--health-start-period option does not seem to do anything, or is underdocumented #19272

Closed

github-actions bot added the stale-issue label Aug 18, 2023

github-actions bot removed the stale-issue label Aug 19, 2023

openshift-merge-robot closed this as completed in #19245 Sep 11, 2023

martinpitt mentioned this issue Oct 11, 2023

Nightly tests did not succeed on fedora-38/podman-next: testHealthcheckUser timeout cockpit-project/cockpit-podman#1443

Closed

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Dec 11, 2023

github-actions bot locked as resolved and limited conversation to collaborators Dec 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

healthcheck status events have inconsistent states #19237

healthcheck status events have inconsistent states #19237

martinpitt commented Jul 14, 2023

martinpitt commented Jul 14, 2023

martinpitt commented Jul 14, 2023

mheon commented Jul 14, 2023

martinpitt commented Jul 14, 2023

mheon commented Jul 14, 2023

mheon commented Jul 14, 2023

martinpitt commented Jul 18, 2023

mheon commented Jul 18, 2023

github-actions bot commented Aug 18, 2023

martinpitt commented Aug 18, 2023

healthcheck status events have inconsistent states #19237

healthcheck status events have inconsistent states #19237

Comments

martinpitt commented Jul 14, 2023

Issue Description

Steps to reproduce the issue

Failing health check

Succeeding health check

Describe the results you received

Failing healthcheck

Passing healthcheck

Describe the results you expected

podman info output

martinpitt commented Jul 14, 2023

martinpitt commented Jul 14, 2023

mheon commented Jul 14, 2023

martinpitt commented Jul 14, 2023

mheon commented Jul 14, 2023

mheon commented Jul 14, 2023

martinpitt commented Jul 18, 2023

mheon commented Jul 18, 2023

github-actions bot commented Aug 18, 2023

martinpitt commented Aug 18, 2023