Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

healthcheck status events have inconsistent states #19237

Closed
martinpitt opened this issue Jul 14, 2023 · 10 comments · Fixed by #19245
Closed

healthcheck status events have inconsistent states #19237

martinpitt opened this issue Jul 14, 2023 · 10 comments · Fixed by #19245
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@martinpitt
Copy link
Contributor

Issue Description

I am investigating cockpit-podman's healthcheck tests, which currently fail very often. The API sends out the health_status events, but their reported values seem to lag behind what the state should be. It also sometimes differs from podman ps, and that by itself is inconsistent.

Steps to reproduce the issue

Open three terminals, set up some watching in two of them:

 watch -n1 podman ps -a
 clear; podman events

Failing health check

podman run --rm --health-cmd=false --health-start-period=5s --health-startup-timeout=1 --health-retries=1  quay.io/prometheus/busybox sleep infinity

Succeeding health check

podman run --rm --health-cmd=true --health-start-period=5s --health-startup-timeout=1 --health-retries=1  quay.io/prometheus/busybox sleep infinity

Describe the results you received

Failing healthcheck

Right after start, podman ps shows the health status as "starting", which is okay. It should not have done anything yet, due to --health-start-period=5s:

fdae48c5aa11  quay.io/prometheus/busybox:latest  sleep infinity  2 seconds ago  Up 2 seconds (starting)              kind_clarke

Apart from the usual init/start/pull/etc. events (I'm leaving out most of them, just keep init to get time reference), it immediately sends out a starting health check event, which is fine:

2023-07-14 06:57:27.098931797 +0000 UTC container init fdae48c5aa11b9947bfbbdda36fc6c33e9cfffeba6f7dc5383cf6ecd9b93d3b0 (image=quay.io/prometheus/busybox:latest, name=kind_clarke)
2023-07-14 06:57:27.164906579 +0000 UTC container health_status fdae48c5aa11b9947bfbbdda36fc6c33e9cfffeba6f7dc5383cf6ecd9b93d3b0 (image=quay.io/prometheus/busybox:latest, name=kind_clarke, health_status=starting)

After 5 seconds, there is no event, and ps -a remains at "Up 12 seconds (starting)".

After 30 seconds, there is a new health_status event, but with the wrong value:

2023-07-14 06:57:57.504435576 +0000 UTC container health_status fdae48c5aa11b9947bfbbdda36fc6c33e9cfffeba6f7dc5383cf6ecd9b93d3b0 (image=quay.io/prometheus/busybox:latest, name=kind_clarke, health_status=starting)

whereas ps -a shows it as "unhealthy" already (which is expected):

fdae48c5aa11  quay.io/prometheus/busybox:latest  sleep infinity  40 seconds ago  Up 40 seconds (unhealthy)              kind_clarke

Only after another 30s the next event has the correct value:

2023-07-14 06:58:27.773311668 +0000 UTC container health_status fdae48c5aa11b9947bfbbdda36fc6c33e9cfffeba6f7dc5383cf6ecd9b93d3b0 (image=quay.io/prometheus/busybox:latest, name=kind_clarke, health_status=unhealthy)

fdae48c5aa11  quay.io/prometheus/busybox:latest  sleep infinity  About a minute ago  Up About a minute (unhealthy)              kind_clarke

Passing healthcheck

For this, podman ps immediately shows "healthy", although I'd expect "starting" for the first 5s (as the health check should only run after 5 s):

be79f8a88c0f  quay.io/prometheus/busybox:latest  sleep infinity  Less than a second ago  Up Less than a second (healthy)              nervous_chaplygin

From then on, it never changes any more, e.g. I got "Up 8 seconds (healthy) or " Up 33 seconds (healthy)".

The events show an immediate health_status starting event right after startup:

2023-07-14 07:05:42.082377526 +0000 UTC container init be79f8a88c0f520606b9979be7dc17b45c5bf4b4ee47d079bdeb04d600aede9c (image=quay.io/prometheus/busybox:latest, name=nervous_chaplygin)
2023-07-14 07:05:42.147524232 +0000 UTC container health_status be79f8a88c0f520606b9979be7dc17b45c5bf4b4ee47d079bdeb04d600aede9c (image=quay.io/prometheus/busybox:latest, name=nervous_chaplygin, health_status=starting)

This is inconsistent with podman ps which already shows it as "healthy".

But there is no event after 5s which would indicate that the health check ran and turn the status into "healthy". The next event only happens after 30s and then has the expected state "healthy":

2023-07-14 07:06:12.776050715 +0000 UTC container health_status be79f8a88c0f520606b9979be7dc17b45c5bf4b4ee47d079bdeb04d600aede9c (image=quay.io/prometheus/busybox:latest, name=nervous_chaplygin, health_status=healthy)

And then I get the same event every 30s.

Describe the results you expected

The previous paragraph already explained my expectations. One issue is that it's unclear what --health-start-period=5s actually does -- it has no observable effect. The first health check run seems to happen immediately. It is documented as

the initialization time needed for a container to bootstrap (default "0s")

which certainly does not directly say that it will delay the first health check run by that amount, but that would be my expectation.

Also, the value in the health_check event should always agree with the actual value from the API and podman ps.

Finally there is a missing event after the first health check runs (after 0s or health-start-period). I.e. after "starting" there should be a first run whose result gets reported, either as "healthy" or "unhealthy". Our tests set the --health-startup-interval to 5 minutes to get a stable UI state (we only check the first run), so we don't see the second event (which then has the correct value).

podman info output

host:
  arch: amd64
  buildahVersion: 1.30.0
  cgroupControllers:
  - cpu
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.7-2.fc38.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.7, commit: '
  cpuUtilization:
    idlePercent: 93.55
    systemPercent: 2.93
    userPercent: 3.53
  cpus: 1
  databaseBackend: boltdb
  distribution:
    distribution: fedora
    variant: cloud
    version: "38"
  eventLogger: journald
  hostname: fedora-38-127-0-0-2-2201
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 524288
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 524288
      size: 65536
  kernel: 6.3.12-200.fc38.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 326119424
  memTotal: 1123942400
  networkBackend: netavark
  ociRuntime:
    name: crun
    package: crun-1.8.5-1.fc38.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.8.5
      commit: b6f80f766c9a89eb7b1440c0a70ab287434b17ed
      rundir: /run/user/1000/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  remoteSocket:
    exists: true
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.0-12.fc38.x86_64
    version: |-
      slirp4netns version 1.2.0
      commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.3
  swapFree: 1078194176
  swapTotal: 1123020800
  uptime: 1h 46m 33.00s (Approximately 0.04 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  configFile: /home/admin/.config/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/admin/.local/share/containers/storage
  graphRootAllocated: 12798898176
  graphRootUsed: 2206924800
  graphStatus:
    Backing Filesystem: btrfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 3
  runRoot: /tmp/containers-user-1000/containers
  transientStore: false
  volumePath: /home/admin/.local/share/containers/storage/volumes
version:
  APIVersion: 4.5.1
  Built: 1685123928
  BuiltTime: Fri May 26 17:58:48 2023
  GitCommit: ""
  GoVersion: go1.20.4
  Os: linux
  OsArch: linux/amd64
  Version: 4.5.1


### Podman in a container

No

### Privileged Or Rootless

Rootless

### Upstream Latest Release

No

### Additional environment details

Happens on bare metal and in VM, it also happens as root and user. We also see it across all operating systems, Ubuntu 22.04 LTS over CentOS latest Fedora 38.

### Additional information

_No response_
@martinpitt martinpitt added the kind/bug Categorizes issue or PR as related to a bug. label Jul 14, 2023
martinpitt added a commit to martinpitt/cockpit-podman that referenced this issue Jul 14, 2023
There is no `health_state` event after the first health check run, and
the first `health_event` value is inconsistent with the actual state.
See containers/podman#19237

Work around that by running the health check more often, and waiting
long enough to catch the second one.
martinpitt added a commit to martinpitt/cockpit-podman that referenced this issue Jul 14, 2023
These are internal transient states which don't need to reflect in the
UI. They happen quickly in bursts, with a "permanent state" event
following such as "create", "died", or "remove".

This helps to reduce the API calls and thus mitigates out-of-order
results; see containers/podman#19124

We are not really interested in `podman exec` events, so
we would like to ignore `exec_died` along with `exec`. However, it is
the only thing that saves us from inconsistent `health_state` events
(see containers/podman#19237). So we cannot
rely on the latter event, but instead have to do a full update after
each `exec_died`, as some of them are the health checks.

Also fix the alphabetical sorting of the remaining events.
@martinpitt
Copy link
Contributor Author

The current workaround for us is to listen to exec_died instead; health checks are implemened as podman exec, so each run triggers an exec/exec_died pair. After exec_died, we make a full /containers/ID/json inspection call, which picks up the correct health state. But this is expensive.

martinpitt added a commit to cockpit-project/cockpit-podman that referenced this issue Jul 14, 2023
These are internal transient states which don't need to reflect in the
UI. They happen quickly in bursts, with a "permanent state" event
following such as "create", "died", or "remove".

This helps to reduce the API calls and thus mitigates out-of-order
results; see containers/podman#19124

We are not really interested in `podman exec` events, so
we would like to ignore `exec_died` along with `exec`. However, it is
the only thing that saves us from inconsistent `health_state` events
(see containers/podman#19237). So we cannot
rely on the latter event, but instead have to do a full update after
each `exec_died`, as some of them are the health checks.

Also fix the alphabetical sorting of the remaining events.
@martinpitt
Copy link
Contributor Author

That workaround is still imperfect, though: We are in a state with one health check run. Then the user clicks "Run health check" (equivalent of podman healthcheck run). This triggers a health_status event (broken as above), followed by an exec_died event, but then the /containers/ID/json inspection still does not contain the log of the second run:

    "State": {    
        "OciVersion": "1.1.0-rc.1",    
        "Status": "running",    
        "Running": true,        
        "Paused": false,    
        "Restarting": false,    
        "OOMKilled": false,    
        "Dead": false,    
        "Pid": 15520,    
        "ConmonPid": 15518,    
        "ExitCode": 0,     
        "Error": "",    
        "StartedAt": "2023-07-14T10:07:14.064090116Z",    
        "FinishedAt": "0001-01-01T00:00:00Z",    
        "Health": {    
            "Status": "healthy",      
            "FailingStreak": 0,        
            "Log": [       
                {    
                    "Start": "2023-07-14T10:07:14.151866216Z",    
                    "End": "2023-07-14T10:07:14.261819433Z",    
                    "ExitCode": 0,    
                    "Output": ""    
                }    
            ]    
        },    

(There's only one Log entry, the old one)

@mheon
Copy link
Member

mheon commented Jul 14, 2023

I'll look into this today.

@mheon mheon self-assigned this Jul 14, 2023
@martinpitt
Copy link
Contributor Author

This triggers a health_status event (broken as above), followed by an exec_died event, but then the /containers/ID/json inspection still does not contain the log of the second run

This may be an independent issue. If you'd like me to file it separately, I'm happy to. Thanks!

skobyda pushed a commit to cockpit-project/bots that referenced this issue Jul 14, 2023
@mheon
Copy link
Member

mheon commented Jul 14, 2023

It's a race condition - writing the actual health status logs to disk is happening after the event is generated. I think it's probably part of the same issue.

@mheon
Copy link
Member

mheon commented Jul 14, 2023

@martinpitt Can you test #19245 - I think I got most of your problems.

Lingering issues: --health-start-period still won't work. I think we might want a separate issue just for that, it seems like a much bigger fix.

@martinpitt
Copy link
Contributor Author

I ran

dnf copr enable rhcontainerbot/packit-builds
dnf install podman{,-gvproxy}-102:4.5.0.rc1-1.20230714164821967548.pr19245.1088.gd277a7e7e.x86_64

in a clean F38 VM to get the packages from #19245, and re-did the reproducer above. Indeed this works much better now. For the failing case, I get

2023-07-18 07:59:35.460444324 +0000 UTC container health_status d9e4c05818cfff6f71af8c4770d26015ab039bfa84bfa49869a9457ef4566c0b (image=quay.io/prometheus/busybox:latest, name=vibrant_curran, health_status=starting)

right after startup, and after 30s

2023-07-18 08:00:37.406262236 +0000 UTC container health_status d9e4c05818cfff6f71af8c4770d26015ab039bfa84bfa49869a9457ef4566c0b (image=quay.io/prometheus/busybox:latest, name=vibrant_curran, health_status=unhealthy)

For the succeeding case, I get

2023-07-18 08:01:29.789596689 +0000 UTC container health_status 054df70feec0e93b98f996a3c010b84b8f51fb871bb7e4f049e7b01fb9ba46fb (image=quay.io/prometheus/busybox:latest, name=amazing_stonebraker, health_status=healthy)

immediately after startup, and again after 30s.

Lingering issues: --health-start-period still won't work.

Ack, broken out as issue #19272

Thanks!

@mheon
Copy link
Member

mheon commented Jul 18, 2023

I need to read the docs a bit more to see if the starting bit is expected, I would have thought we defaulted that to 0 and as such we'd jump immediately to unhealthy

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@martinpitt
Copy link
Contributor Author

bump

mheon added a commit to mheon/libpod that referenced this issue Sep 11, 2023
HC events were firing as part of the `exec` call, before it had
even been decided whether the HC succeeded or failed. As such,
the status was not going to be correct any time there was a
change (e.g. the first event after a container went healthy to
unhealthy would still read healthy). Move the event into the
actual Healthcheck function and throw it in a defer to make sure
it happens at the very end, after logs are written.

Ignores several conditions that did not log previously (container
in question does not have a healthcheck, or an internal failure
that should not really happen).

Still not a perfect solution. This relies on the HC log being
written, when instead we could just get the status straight from
the function writing the event - so if we fail to write the log,
we can still report a bad status. But if the log wasn't written,
we're in bad shape regardless - `podman ps` would disagree with
the event written, for example.

Fixes containers#19237

Signed-off-by: Matt Heon <[email protected]>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Dec 11, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants