--health-on-failure=restart doesn't restart container? #17777

GaryRevell · 2023-03-14T13:04:31Z

Issue Description

This is an RFI and potentially a bug report.

I've been working on setting up health checks for our podman containers and have followed the instructions on this page:

https://www.redhat.com/sysadmin/podman-edge-healthcheck

It's mentioned that one of the --health-on-failure= options is restart so I tried it rather than kill which is given in the example.
However, it never appears to restart the container when the current one is set to unhealthy, is this a bug OR am I not using the option correctly?

$ podman run --replace -d --name test-container --health-cmd /healthcheck --health-on-failure=restart --health-retries=1 health-check-action

When I use the kill option this works, as does none from recollection. Some example commands below:

[gary@myServer tmp.cvo0HuLSA9]# podman run --replace -d --name test-container --health-cmd /healthcheck --health-on-failure=kill --health-retries=1 health-check-actions
e0676d33a91dac37da670c3b45f2478e902186055ee4e7d0c02bc3b0843f3a95
d9a2e0114df37b32e5dab9fb8a258ecfbcf18263913acad2ef1a6e4592e6728c
[gary@myServer tmp.cvo0HuLSA9]# podman ps -a
CONTAINER ID  IMAGE                                  COMMAND      CREATED        STATUS                      PORTS       NAMES
d9a2e0114df3  localhost/health-check-actions:latest  /entrypoint  2 seconds ago  Up 2 seconds ago (healthy)              test-container
[gary@myServer tmp.cvo0HuLSA9]# podman exec test-container touch /uh-oh
[gary@myServer tmp.cvo0HuLSA9]# podman ps -a
CONTAINER ID  IMAGE                                  COMMAND      CREATED         STATUS                       PORTS       NAMES
d9a2e0114df3  localhost/health-check-actions:latest  /entrypoint  18 seconds ago  Up 19 seconds ago (healthy)              test-container
[gary@myServer tmp.cvo0HuLSA9]# podman ps -a
CONTAINER ID  IMAGE                                  COMMAND      CREATED         STATUS                       PORTS       NAMES
d9a2e0114df3  localhost/health-check-actions:latest  /entrypoint  21 seconds ago  Up 21 seconds ago (healthy)              test-container
[gary@myServer tmp.cvo0HuLSA9]# podman ps -a
CONTAINER ID  IMAGE                                  COMMAND      CREATED         STATUS                                  PORTS       NAMES
d9a2e0114df3  localhost/health-check-actions:latest  /entrypoint  34 seconds ago  Exited (137) 3 seconds ago (unhealthy)              test-container
[gary@myServer tmp.cvo0HuLSA9]# podman run --replace -d --name test-container --health-cmd /healthcheck --health-on-failure=none --health-retries=1 health-check-actions
d9a2e0114df37b32e5dab9fb8a258ecfbcf18263913acad2ef1a6e4592e6728c
c0bc0b2668b5dc6c2c269e965f013626854d18320949f549d78eb2a968f84339
[gary@myServer tmp.cvo0HuLSA9]# podman ps -a
CONTAINER ID  IMAGE                                  COMMAND      CREATED        STATUS                      PORTS       NAMES
c0bc0b2668b5  localhost/health-check-actions:latest  /entrypoint  2 seconds ago  Up 3 seconds ago (healthy)              test-container
[gary@myServer tmp.cvo0HuLSA9]# podman exec test-container touch /uh-oh
[gary@myServer tmp.cvo0HuLSA9]# podman ps -a
CONTAINER ID  IMAGE                                  COMMAND      CREATED         STATUS                       PORTS       NAMES
c0bc0b2668b5  localhost/health-check-actions:latest  /entrypoint  15 seconds ago  Up 16 seconds ago (healthy)              test-container
[gary@myServer tmp.cvo0HuLSA9]# podman healthcheck run test-container
unhealthy
[gary@myServer tmp.cvo0HuLSA9]# podman ps -a
CONTAINER ID  IMAGE                                  COMMAND      CREATED         STATUS                         PORTS       NAMES
c0bc0b2668b5  localhost/health-check-actions:latest  /entrypoint  36 seconds ago  Up 36 seconds ago (unhealthy)              test-container
[gary@myServer tmp.cvo0HuLSA9]#

Steps to reproduce the issue

I followed the steps as per the RedHat web page given above.

Describe the results you received

The container wasn't restarted as expected.

Describe the results you expected

I expected the container to be restarted and in a healthy state.

podman info output

If you are unable to run podman info for any reason, please provide the podman version, operating system and its version and the architecture you are running.

O/S: Oracle Linux V8.7
podman version: 4.2.0
podman info:
host:
  arch: amd64
  buildahVersion: 1.27.3
  cgroupControllers:
  - cpuset
  - cpu
  - cpuacct
  - blkio
  - memory
  - devices
  - freezer
  - net_cls
  - perf_event
  - net_prio
  - hugetlb
  - pids
  - rdma
  cgroupManager: systemd
  cgroupVersion: v1
  conmon:
    package: conmon-2.1.4-1.module+el8.7.0+20930+90b24198.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.4, commit: 3922bff22a9c3ddaae27e66d280941f60a8b2554'
  cpuUtilization:
    idlePercent: 99.78
    systemPercent: 0.08
    userPercent: 0.14
  cpus: 16
  distribution:
    distribution: '"ol"'
    variant: server
    version: "8.7"
  eventLogger: file
  hostname: myServer
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 5.4.17-2136.316.7.el8uek.x86_64
  linkmode: dynamic
  logDriver: k8s-file
  memFree: 63231135744
  memTotal: 66397937664
  networkBackend: cni
  ociRuntime:
    name: runc
    package: runc-1.1.4-1.module+el8.7.0+20930+90b24198.x86_64
    path: /usr/bin/runc
    version: |-
      runc version 1.1.4
      spec: 1.0.2-dev
      go: go1.18.9
      libseccomp: 2.5.2
  os: linux
  remoteSocket:
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_NET_RAW,CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /bin/slirp4netns
    package: slirp4netns-1.2.0-2.module+el8.7.0+20930+90b24198.x86_64
    version: |-
      slirp4netns version 1.2.0
      commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
      libslirp: 4.4.0
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.2
  swapFree: 4294963200
  swapTotal: 4294963200
  uptime: 82h 1m 50.00s (Approximately 3.42 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - container-registry.oracle.com
  - docker.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 1
    paused: 0
    running: 1
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 212055355392
  graphRootUsed: 9143263232
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "true"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 5
  runRoot: /run/containers/storage
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 4.2.0
  Built: 1677014962
  BuiltTime: Tue Feb 21 13:29:22 2023
  GitCommit: ""
  GoVersion: go1.18.9
  Os: linux
  OsArch: linux/amd64
  Version: 4.2.0

Podman in a container

No

Privileged Or Rootless

Privileged

Upstream Latest Release

Yes

Additional environment details

Additional information

Happy to provide any extra information and screenshots needed.

I'm running these tests as root as I was getting a podman build error when using my own account.

The text was updated successfully, but these errors were encountered:

vrothberg · 2023-03-15T11:36:00Z

Thanks for reaching out, @GaryRevell!

How do you determine whether the container got restarted?

Running a simple example podman run --replace -d --name test-container --health-cmd false --health-on-failure=restart --health-retries=2 alpine top, the container gets restarted after running the health check twice.

GaryRevell · 2023-03-15T13:49:27Z

podman run --replace -d --name test-container --health-cmd false --health-on-failure=restart --health-retries=2 alpine top

Thanks Valentin, I'll try this but am having some problems (proxy?) running this example on my system(s). Once I get that sorted I'll confirm all is OK.

vrothberg · 2023-03-16T07:46:46Z

Thanks, @GaryRevell! Looking forward to hearing back from you.

GaryRevell · 2023-03-16T20:21:31Z

Hi @vrothberg ,

OK, so I've been doing some more testing on my Mac using a simple bash script and the output it generates.
Machine details are:
O/S: Darwin grevell-mac 22.2.0 Darwin Kernel Version 22.2.0: Fri Nov 11 02:03:51 PST 2022; root:xnu-8792.61.2~4/RELEASE_ARM64_T6000 arm64

podman version 4.3.1

Here's the script:

#!/bin/bash

podman ps -a
podman run --replace -d --name true_container  --health-cmd true  --health-on-failure=restart --health-interval 15s --health-retries=2 container-registry.oracle.com/os/oraclelinux:9-slim sleep 300

podman run --replace -d --name false_container --health-cmd false --health-on-failure=restart --health-interval 15s --health-retries=2 container-registry.oracle.com/os/oraclelinux:9-slim sleep 300
echo `date "+%H:%M:%S"` Created false_container 

#podman run --replace -d --name simple container-registry.oracle.com/os/oraclelinux:9-slim sleep 900

while true; do
    sleep 5
    date '+%H:%M:%S'
    podman ps -a
done

And here's the output it generates:

grevell@grevell-mac podman % ./hc.sh
CONTAINER ID  IMAGE                                                COMMAND     CREATED       STATUS                                 PORTS       NAMES
11df846eaeea  container-registry.oracle.com/os/oraclelinux:9-slim  sleep 300   11 hours ago  Up 11 hours ago (healthy)                          true_container
a5716d486192  container-registry.oracle.com/os/oraclelinux:9-slim  sleep 300   11 hours ago  Exited (137) 11 hours ago (unhealthy)              false_container
11df846eaeead107535a5784edea5dc6efa8b9c944844f0af6483e42bc15401b
935b4f2aab1526389866f9ece69ac4b4e2825e1a9b81780d588c684f4501249e
a5716d486192f2147b2e6681be20abda4985133face67f18776e506b25448418
e55aef8feccca4d9f9a0511669059822ed06b59cfc5a4aa2df9533bf4dd0d12a
17:42:21 Created false_container
17:42:26
CONTAINER ID  IMAGE                                                COMMAND     CREATED       STATUS                      PORTS       NAMES
935b4f2aab15  container-registry.oracle.com/os/oraclelinux:9-slim  sleep 300   11 hours ago  Up 11 hours ago (healthy)               true_container
e55aef8feccc  container-registry.oracle.com/os/oraclelinux:9-slim  sleep 300   11 hours ago  Up 11 hours ago (starting)              false_container
17:42:31
CONTAINER ID  IMAGE                                                COMMAND     CREATED       STATUS                      PORTS       NAMES
935b4f2aab15  container-registry.oracle.com/os/oraclelinux:9-slim  sleep 300   11 hours ago  Up 11 hours ago (healthy)               true_container
e55aef8feccc  container-registry.oracle.com/os/oraclelinux:9-slim  sleep 300   11 hours ago  Up 11 hours ago (starting)              false_container
17:42:36
CONTAINER ID  IMAGE                                                COMMAND     CREATED       STATUS                     PORTS       NAMES
935b4f2aab15  container-registry.oracle.com/os/oraclelinux:9-slim  sleep 300   11 hours ago  Up 11 hours ago (healthy)              true_container
e55aef8feccc  container-registry.oracle.com/os/oraclelinux:9-slim  sleep 300   11 hours ago  Stopping (unhealthy)                   false_container
17:42:41
CONTAINER ID  IMAGE                                                COMMAND     CREATED       STATUS                     PORTS       NAMES
935b4f2aab15  container-registry.oracle.com/os/oraclelinux:9-slim  sleep 300   11 hours ago  Up 11 hours ago (healthy)              true_container
e55aef8feccc  container-registry.oracle.com/os/oraclelinux:9-slim  sleep 300   11 hours ago  Stopping (unhealthy)                   false_container
17:42:47
CONTAINER ID  IMAGE                                                COMMAND     CREATED       STATUS                                 PORTS       NAMES
935b4f2aab15  container-registry.oracle.com/os/oraclelinux:9-slim  sleep 300   11 hours ago  Up 11 hours ago (healthy)                          true_container
e55aef8feccc  container-registry.oracle.com/os/oraclelinux:9-slim  sleep 300   11 hours ago  Exited (137) 11 hours ago (unhealthy)              false_container
^C
grevell@grevell-mac podman % pps -a

So, I create two containers, one with each true/false health command , interval of 15secs, and on-failure to restart & retries=2.

However, after 15 seconds the false_container becomes unhealthy as expected but doesn't restart as requested. Can you tell me what I'm doing wrong so that it does restart?

I think we're mostly there, the podman documentation is vague to say the least and there aren't a great number of working examples to crib from.

Look forward to hearing your comments etc.

Thanks!

Gary

vrothberg · 2023-03-17T09:24:40Z

Thanks, @GaryRevell!

I can reproduce the issue and will look into it.

vrothberg · 2023-03-17T13:10:59Z

I opened #17830 to fix the issue.

@GaryRevell since you are running Podman on RHEL. Please open a bugzilla in case you desire a backport to RHEL.

GaryRevell · 2023-03-17T14:39:44Z

I opened #17830 to fix the issue.

@GaryRevell since you are running Podman on RHEL. Please open a bugzilla in case you desire a backport to RHEL.

Bugzilla created, thanks for your work on this @vrothberg .

https://bugzilla.redhat.com/show_bug.cgi?id=2179369

As described in containers#17777, the `restart` on-failure action did not behave correctly when the health check is being run by a transient systemd unit. It ran just fine when being executed outside such a unit, for instance, manually or, as done in the system tests, in a scripted fashion. There were two issue causing the `restart` on-failure action to misbehave: 1) The transient systemd units used the default `KillMode=cgroup` which will nuke all processes in the specific cgroup including the recently restarted container/conmon once the main `podman healthcheck run` process exits. 2) Podman attempted to remove the transient systemd unit and timer during restart. That is perfectly fine when manually restarting the container but not when the restart itself is being executed inside such a transient unit. Ultimately, Podman tried to shoot itself in the foot. Fix both issues by moving the restart logic in the cleanup process. Instead of restarting the container, the `healthcheck run` will just stop the container and the cleanup process will restart the container once it has turned unhealthy. Fixes: containers#17777 Signed-off-by: Valentin Rothberg <[email protected]>

As described in containers#17777, the `restart` on-failure action did not behave correctly when the health check is being run by a transient systemd unit. It ran just fine when being executed outside such a unit, for instance, manually or, as done in the system tests, in a scripted fashion. There were two issue causing the `restart` on-failure action to misbehave: 1) The transient systemd units used the default `KillMode=cgroup` which will nuke all processes in the specific cgroup including the recently restarted container/conmon once the main `podman healthcheck run` process exits. 2) Podman attempted to remove the transient systemd unit and timer during restart. That is perfectly fine when manually restarting the container but not when the restart itself is being executed inside such a transient unit. Ultimately, Podman tried to shoot itself in the foot. Fix both issues by moving the restart logic in the cleanup process. Instead of restarting the container, the `healthcheck run` will just stop the container and the cleanup process will restart the container once it has turned unhealthy. Backport of commit 9563415. Fixes: containers#17777 Signed-off-by: Valentin Rothberg <[email protected]>

As described in containers#17777, the `restart` on-failure action did not behave correctly when the health check is being run by a transient systemd unit. It ran just fine when being executed outside such a unit, for instance, manually or, as done in the system tests, in a scripted fashion. There were two issue causing the `restart` on-failure action to misbehave: 1) The transient systemd units used the default `KillMode=cgroup` which will nuke all processes in the specific cgroup including the recently restarted container/conmon once the main `podman healthcheck run` process exits. 2) Podman attempted to remove the transient systemd unit and timer during restart. That is perfectly fine when manually restarting the container but not when the restart itself is being executed inside such a transient unit. Ultimately, Podman tried to shoot itself in the foot. Fix both issues by moving the restart logic in the cleanup process. Instead of restarting the container, the `healthcheck run` will just stop the container and the cleanup process will restart the container once it has turned unhealthy. Backport of commit 9563415. Fixes: containers#17777 Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2180125 Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2180126 Signed-off-by: Valentin Rothberg <[email protected]>

As described in containers#17777, the `restart` on-failure action did not behave correctly when the health check is being run by a transient systemd unit. It ran just fine when being executed outside such a unit, for instance, manually or, as done in the system tests, in a scripted fashion. There were two issue causing the `restart` on-failure action to misbehave: 1) The transient systemd units used the default `KillMode=cgroup` which will nuke all processes in the specific cgroup including the recently restarted container/conmon once the main `podman healthcheck run` process exits. 2) Podman attempted to remove the transient systemd unit and timer during restart. That is perfectly fine when manually restarting the container but not when the restart itself is being executed inside such a transient unit. Ultimately, Podman tried to shoot itself in the foot. Fix both issues by moving the restart logic in the cleanup process. Instead of restarting the container, the `healthcheck run` will just stop the container and the cleanup process will restart the container once it has turned unhealthy. Backport of commit 9563415. Fixes: containers#17777 Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2180104 Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2180108 Signed-off-by: Valentin Rothberg <[email protected]>

GaryRevell added the kind/bug Categorizes issue or PR as related to a bug. label Mar 14, 2023

vrothberg mentioned this issue Mar 17, 2023

fix --health-on-failure=restart in transient unit #17830

Merged

openshift-merge-robot closed this as completed in #17830 Mar 20, 2023

vrothberg mentioned this issue Mar 21, 2023

[v4.4] fix --health-on-failure=restart in transient unit #17862

Merged

vrothberg mentioned this issue Mar 21, 2023

[v4.4.1-rhel] fix --health-on-failure=restart in transient unit #17863

Merged

vrothberg mentioned this issue Mar 21, 2023

[v4.3.1-rhel] fix --health-on-failure=restart in transient unit #17864

Merged

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 29, 2023

github-actions bot locked as resolved and limited conversation to collaborators Aug 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--health-on-failure=restart doesn't restart container? #17777

--health-on-failure=restart doesn't restart container? #17777

GaryRevell commented Mar 14, 2023 •

edited

Loading

vrothberg commented Mar 15, 2023

GaryRevell commented Mar 15, 2023

vrothberg commented Mar 16, 2023

GaryRevell commented Mar 16, 2023

vrothberg commented Mar 17, 2023

vrothberg commented Mar 17, 2023

GaryRevell commented Mar 17, 2023 •

edited

Loading

--health-on-failure=restart doesn't restart container? #17777

--health-on-failure=restart doesn't restart container? #17777

Comments

GaryRevell commented Mar 14, 2023 • edited Loading

Issue Description

Steps to reproduce the issue

Describe the results you received

Describe the results you expected

podman info output

Podman in a container

Privileged Or Rootless

Upstream Latest Release

Additional environment details

Additional information

vrothberg commented Mar 15, 2023

GaryRevell commented Mar 15, 2023

vrothberg commented Mar 16, 2023

GaryRevell commented Mar 16, 2023

vrothberg commented Mar 17, 2023

vrothberg commented Mar 17, 2023

GaryRevell commented Mar 17, 2023 • edited Loading

GaryRevell commented Mar 14, 2023 •

edited

Loading

GaryRevell commented Mar 17, 2023 •

edited

Loading