Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--health-on-failure=restart doesn't restart container? #17777

Closed
GaryRevell opened this issue Mar 14, 2023 · 7 comments · Fixed by #17830
Closed

--health-on-failure=restart doesn't restart container? #17777

GaryRevell opened this issue Mar 14, 2023 · 7 comments · Fixed by #17830
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@GaryRevell
Copy link

GaryRevell commented Mar 14, 2023

Issue Description

This is an RFI and potentially a bug report.

I've been working on setting up health checks for our podman containers and have followed the instructions on this page:

https://www.redhat.com/sysadmin/podman-edge-healthcheck

It's mentioned that one of the --health-on-failure= options is restart so I tried it rather than kill which is given in the example.
However, it never appears to restart the container when the current one is set to unhealthy, is this a bug OR am I not using the option correctly?

$ podman run --replace -d --name test-container --health-cmd /healthcheck --health-on-failure=restart --health-retries=1 health-check-action

When I use the kill option this works, as does none from recollection. Some example commands below:

[gary@myServer tmp.cvo0HuLSA9]# podman run --replace -d --name test-container --health-cmd /healthcheck --health-on-failure=kill --health-retries=1 health-check-actions
e0676d33a91dac37da670c3b45f2478e902186055ee4e7d0c02bc3b0843f3a95
d9a2e0114df37b32e5dab9fb8a258ecfbcf18263913acad2ef1a6e4592e6728c
[gary@myServer tmp.cvo0HuLSA9]# podman ps -a
CONTAINER ID  IMAGE                                  COMMAND      CREATED        STATUS                      PORTS       NAMES
d9a2e0114df3  localhost/health-check-actions:latest  /entrypoint  2 seconds ago  Up 2 seconds ago (healthy)              test-container
[gary@myServer tmp.cvo0HuLSA9]# podman exec test-container touch /uh-oh
[gary@myServer tmp.cvo0HuLSA9]# podman ps -a
CONTAINER ID  IMAGE                                  COMMAND      CREATED         STATUS                       PORTS       NAMES
d9a2e0114df3  localhost/health-check-actions:latest  /entrypoint  18 seconds ago  Up 19 seconds ago (healthy)              test-container
[gary@myServer tmp.cvo0HuLSA9]# podman ps -a
CONTAINER ID  IMAGE                                  COMMAND      CREATED         STATUS                       PORTS       NAMES
d9a2e0114df3  localhost/health-check-actions:latest  /entrypoint  21 seconds ago  Up 21 seconds ago (healthy)              test-container
[gary@myServer tmp.cvo0HuLSA9]# podman ps -a
CONTAINER ID  IMAGE                                  COMMAND      CREATED         STATUS                                  PORTS       NAMES
d9a2e0114df3  localhost/health-check-actions:latest  /entrypoint  34 seconds ago  Exited (137) 3 seconds ago (unhealthy)              test-container
[gary@myServer tmp.cvo0HuLSA9]# podman run --replace -d --name test-container --health-cmd /healthcheck --health-on-failure=none --health-retries=1 health-check-actions
d9a2e0114df37b32e5dab9fb8a258ecfbcf18263913acad2ef1a6e4592e6728c
c0bc0b2668b5dc6c2c269e965f013626854d18320949f549d78eb2a968f84339
[gary@myServer tmp.cvo0HuLSA9]# podman ps -a
CONTAINER ID  IMAGE                                  COMMAND      CREATED        STATUS                      PORTS       NAMES
c0bc0b2668b5  localhost/health-check-actions:latest  /entrypoint  2 seconds ago  Up 3 seconds ago (healthy)              test-container
[gary@myServer tmp.cvo0HuLSA9]# podman exec test-container touch /uh-oh
[gary@myServer tmp.cvo0HuLSA9]# podman ps -a
CONTAINER ID  IMAGE                                  COMMAND      CREATED         STATUS                       PORTS       NAMES
c0bc0b2668b5  localhost/health-check-actions:latest  /entrypoint  15 seconds ago  Up 16 seconds ago (healthy)              test-container
[gary@myServer tmp.cvo0HuLSA9]# podman healthcheck run test-container
unhealthy
[gary@myServer tmp.cvo0HuLSA9]# podman ps -a
CONTAINER ID  IMAGE                                  COMMAND      CREATED         STATUS                         PORTS       NAMES
c0bc0b2668b5  localhost/health-check-actions:latest  /entrypoint  36 seconds ago  Up 36 seconds ago (unhealthy)              test-container
[gary@myServer tmp.cvo0HuLSA9]#

Steps to reproduce the issue

Steps to reproduce the issue

  1. I followed the steps as per the RedHat web page given above.

Describe the results you received

The container wasn't restarted as expected.

Describe the results you expected

I expected the container to be restarted and in a healthy state.

podman info output

If you are unable to run podman info for any reason, please provide the podman version, operating system and its version and the architecture you are running.

O/S: Oracle Linux V8.7
podman version: 4.2.0
podman info:
host:
  arch: amd64
  buildahVersion: 1.27.3
  cgroupControllers:
  - cpuset
  - cpu
  - cpuacct
  - blkio
  - memory
  - devices
  - freezer
  - net_cls
  - perf_event
  - net_prio
  - hugetlb
  - pids
  - rdma
  cgroupManager: systemd
  cgroupVersion: v1
  conmon:
    package: conmon-2.1.4-1.module+el8.7.0+20930+90b24198.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.4, commit: 3922bff22a9c3ddaae27e66d280941f60a8b2554'
  cpuUtilization:
    idlePercent: 99.78
    systemPercent: 0.08
    userPercent: 0.14
  cpus: 16
  distribution:
    distribution: '"ol"'
    variant: server
    version: "8.7"
  eventLogger: file
  hostname: myServer
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 5.4.17-2136.316.7.el8uek.x86_64
  linkmode: dynamic
  logDriver: k8s-file
  memFree: 63231135744
  memTotal: 66397937664
  networkBackend: cni
  ociRuntime:
    name: runc
    package: runc-1.1.4-1.module+el8.7.0+20930+90b24198.x86_64
    path: /usr/bin/runc
    version: |-
      runc version 1.1.4
      spec: 1.0.2-dev
      go: go1.18.9
      libseccomp: 2.5.2
  os: linux
  remoteSocket:
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_NET_RAW,CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /bin/slirp4netns
    package: slirp4netns-1.2.0-2.module+el8.7.0+20930+90b24198.x86_64
    version: |-
      slirp4netns version 1.2.0
      commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
      libslirp: 4.4.0
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.2
  swapFree: 4294963200
  swapTotal: 4294963200
  uptime: 82h 1m 50.00s (Approximately 3.42 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - container-registry.oracle.com
  - docker.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 1
    paused: 0
    running: 1
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 212055355392
  graphRootUsed: 9143263232
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "true"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 5
  runRoot: /run/containers/storage
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 4.2.0
  Built: 1677014962
  BuiltTime: Tue Feb 21 13:29:22 2023
  GitCommit: ""
  GoVersion: go1.18.9
  Os: linux
  OsArch: linux/amd64
  Version: 4.2.0

Podman in a container

No

Privileged Or Rootless

Privileged

Upstream Latest Release

Yes

Additional environment details

Additional environment details

Additional information

Happy to provide any extra information and screenshots needed.

I'm running these tests as root as I was getting a podman build error when using my own account.

@GaryRevell GaryRevell added the kind/bug Categorizes issue or PR as related to a bug. label Mar 14, 2023
@vrothberg
Copy link
Member

Thanks for reaching out, @GaryRevell!

How do you determine whether the container got restarted?

Running a simple example podman run --replace -d --name test-container --health-cmd false --health-on-failure=restart --health-retries=2 alpine top, the container gets restarted after running the health check twice.

@GaryRevell
Copy link
Author

podman run --replace -d --name test-container --health-cmd false --health-on-failure=restart --health-retries=2 alpine top

Thanks Valentin, I'll try this but am having some problems (proxy?) running this example on my system(s). Once I get that sorted I'll confirm all is OK.

@vrothberg
Copy link
Member

Thanks, @GaryRevell! Looking forward to hearing back from you.

@GaryRevell
Copy link
Author

Hi @vrothberg ,

OK, so I've been doing some more testing on my Mac using a simple bash script and the output it generates.
Machine details are:
O/S: Darwin grevell-mac 22.2.0 Darwin Kernel Version 22.2.0: Fri Nov 11 02:03:51 PST 2022; root:xnu-8792.61.2~4/RELEASE_ARM64_T6000 arm64

podman version 4.3.1

Here's the script:

#!/bin/bash

podman ps -a
podman run --replace -d --name true_container  --health-cmd true  --health-on-failure=restart --health-interval 15s --health-retries=2 container-registry.oracle.com/os/oraclelinux:9-slim sleep 300

podman run --replace -d --name false_container --health-cmd false --health-on-failure=restart --health-interval 15s --health-retries=2 container-registry.oracle.com/os/oraclelinux:9-slim sleep 300
echo `date "+%H:%M:%S"` Created false_container 

#podman run --replace -d --name simple container-registry.oracle.com/os/oraclelinux:9-slim sleep 900

while true; do
    sleep 5
    date '+%H:%M:%S'
    podman ps -a
done

And here's the output it generates:

grevell@grevell-mac podman % ./hc.sh
CONTAINER ID  IMAGE                                                COMMAND     CREATED       STATUS                                 PORTS       NAMES
11df846eaeea  container-registry.oracle.com/os/oraclelinux:9-slim  sleep 300   11 hours ago  Up 11 hours ago (healthy)                          true_container
a5716d486192  container-registry.oracle.com/os/oraclelinux:9-slim  sleep 300   11 hours ago  Exited (137) 11 hours ago (unhealthy)              false_container
11df846eaeead107535a5784edea5dc6efa8b9c944844f0af6483e42bc15401b
935b4f2aab1526389866f9ece69ac4b4e2825e1a9b81780d588c684f4501249e
a5716d486192f2147b2e6681be20abda4985133face67f18776e506b25448418
e55aef8feccca4d9f9a0511669059822ed06b59cfc5a4aa2df9533bf4dd0d12a
17:42:21 Created false_container
17:42:26
CONTAINER ID  IMAGE                                                COMMAND     CREATED       STATUS                      PORTS       NAMES
935b4f2aab15  container-registry.oracle.com/os/oraclelinux:9-slim  sleep 300   11 hours ago  Up 11 hours ago (healthy)               true_container
e55aef8feccc  container-registry.oracle.com/os/oraclelinux:9-slim  sleep 300   11 hours ago  Up 11 hours ago (starting)              false_container
17:42:31
CONTAINER ID  IMAGE                                                COMMAND     CREATED       STATUS                      PORTS       NAMES
935b4f2aab15  container-registry.oracle.com/os/oraclelinux:9-slim  sleep 300   11 hours ago  Up 11 hours ago (healthy)               true_container
e55aef8feccc  container-registry.oracle.com/os/oraclelinux:9-slim  sleep 300   11 hours ago  Up 11 hours ago (starting)              false_container
17:42:36
CONTAINER ID  IMAGE                                                COMMAND     CREATED       STATUS                     PORTS       NAMES
935b4f2aab15  container-registry.oracle.com/os/oraclelinux:9-slim  sleep 300   11 hours ago  Up 11 hours ago (healthy)              true_container
e55aef8feccc  container-registry.oracle.com/os/oraclelinux:9-slim  sleep 300   11 hours ago  Stopping (unhealthy)                   false_container
17:42:41
CONTAINER ID  IMAGE                                                COMMAND     CREATED       STATUS                     PORTS       NAMES
935b4f2aab15  container-registry.oracle.com/os/oraclelinux:9-slim  sleep 300   11 hours ago  Up 11 hours ago (healthy)              true_container
e55aef8feccc  container-registry.oracle.com/os/oraclelinux:9-slim  sleep 300   11 hours ago  Stopping (unhealthy)                   false_container
17:42:47
CONTAINER ID  IMAGE                                                COMMAND     CREATED       STATUS                                 PORTS       NAMES
935b4f2aab15  container-registry.oracle.com/os/oraclelinux:9-slim  sleep 300   11 hours ago  Up 11 hours ago (healthy)                          true_container
e55aef8feccc  container-registry.oracle.com/os/oraclelinux:9-slim  sleep 300   11 hours ago  Exited (137) 11 hours ago (unhealthy)              false_container
^C
grevell@grevell-mac podman % pps -a

So, I create two containers, one with each true/false health command , interval of 15secs, and on-failure to restart & retries=2.

However, after 15 seconds the false_container becomes unhealthy as expected but doesn't restart as requested. Can you tell me what I'm doing wrong so that it does restart?

I think we're mostly there, the podman documentation is vague to say the least and there aren't a great number of working examples to crib from.

Look forward to hearing your comments etc.

Thanks!

Gary

@vrothberg
Copy link
Member

Thanks, @GaryRevell!

I can reproduce the issue and will look into it.

@vrothberg
Copy link
Member

I opened #17830 to fix the issue.

@GaryRevell since you are running Podman on RHEL. Please open a bugzilla in case you desire a backport to RHEL.

@GaryRevell
Copy link
Author

GaryRevell commented Mar 17, 2023

I opened #17830 to fix the issue.

@GaryRevell since you are running Podman on RHEL. Please open a bugzilla in case you desire a backport to RHEL.

Bugzilla created, thanks for your work on this @vrothberg .

https://bugzilla.redhat.com/show_bug.cgi?id=2179369

vrothberg added a commit to vrothberg/libpod that referenced this issue Mar 20, 2023
As described in containers#17777, the `restart` on-failure action did not behave
correctly when the health check is being run by a transient systemd
unit.  It ran just fine when being executed outside such a unit, for
instance, manually or, as done in the system tests, in a scripted
fashion.

There were two issue causing the `restart` on-failure action to
misbehave:

1) The transient systemd units used the default `KillMode=cgroup` which
   will nuke all processes in the specific cgroup including the recently
   restarted container/conmon once the main `podman healthcheck run`
   process exits.

2) Podman attempted to remove the transient systemd unit and timer
   during restart.  That is perfectly fine when manually restarting the
   container but not when the restart itself is being executed inside
   such a transient unit.  Ultimately, Podman tried to shoot itself in
   the foot.

Fix both issues by moving the restart logic in the cleanup process.
Instead of restarting the container, the `healthcheck run` will just
stop the container and the cleanup process will restart the container
once it has turned unhealthy.

Fixes: containers#17777
Signed-off-by: Valentin Rothberg <[email protected]>
vrothberg added a commit to vrothberg/libpod that referenced this issue Mar 21, 2023
As described in containers#17777, the `restart` on-failure action did not behave
correctly when the health check is being run by a transient systemd
unit.  It ran just fine when being executed outside such a unit, for
instance, manually or, as done in the system tests, in a scripted
fashion.

There were two issue causing the `restart` on-failure action to
misbehave:

1) The transient systemd units used the default `KillMode=cgroup` which
   will nuke all processes in the specific cgroup including the recently
   restarted container/conmon once the main `podman healthcheck run`
   process exits.

2) Podman attempted to remove the transient systemd unit and timer
   during restart.  That is perfectly fine when manually restarting the
   container but not when the restart itself is being executed inside
   such a transient unit.  Ultimately, Podman tried to shoot itself in
   the foot.

Fix both issues by moving the restart logic in the cleanup process.
Instead of restarting the container, the `healthcheck run` will just
stop the container and the cleanup process will restart the container
once it has turned unhealthy.

Backport of commit 9563415.

Fixes: containers#17777
Signed-off-by: Valentin Rothberg <[email protected]>
vrothberg added a commit to vrothberg/libpod that referenced this issue Mar 21, 2023
As described in containers#17777, the `restart` on-failure action did not behave
correctly when the health check is being run by a transient systemd
unit.  It ran just fine when being executed outside such a unit, for
instance, manually or, as done in the system tests, in a scripted
fashion.

There were two issue causing the `restart` on-failure action to
misbehave:

1) The transient systemd units used the default `KillMode=cgroup` which
   will nuke all processes in the specific cgroup including the recently
   restarted container/conmon once the main `podman healthcheck run`
   process exits.

2) Podman attempted to remove the transient systemd unit and timer
   during restart.  That is perfectly fine when manually restarting the
   container but not when the restart itself is being executed inside
   such a transient unit.  Ultimately, Podman tried to shoot itself in
   the foot.

Fix both issues by moving the restart logic in the cleanup process.
Instead of restarting the container, the `healthcheck run` will just
stop the container and the cleanup process will restart the container
once it has turned unhealthy.

Backport of commit 9563415.

Fixes: containers#17777
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2180125
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2180126
Signed-off-by: Valentin Rothberg <[email protected]>
vrothberg added a commit to vrothberg/libpod that referenced this issue Mar 21, 2023
As described in containers#17777, the `restart` on-failure action did not behave
correctly when the health check is being run by a transient systemd
unit.  It ran just fine when being executed outside such a unit, for
instance, manually or, as done in the system tests, in a scripted
fashion.

There were two issue causing the `restart` on-failure action to
misbehave:

1) The transient systemd units used the default `KillMode=cgroup` which
   will nuke all processes in the specific cgroup including the recently
   restarted container/conmon once the main `podman healthcheck run`
   process exits.

2) Podman attempted to remove the transient systemd unit and timer
   during restart.  That is perfectly fine when manually restarting the
   container but not when the restart itself is being executed inside
   such a transient unit.  Ultimately, Podman tried to shoot itself in
   the foot.

Fix both issues by moving the restart logic in the cleanup process.
Instead of restarting the container, the `healthcheck run` will just
stop the container and the cleanup process will restart the container
once it has turned unhealthy.

Backport of commit 9563415.

Fixes: containers#17777
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2180104
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2180108
Signed-off-by: Valentin Rothberg <[email protected]>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 29, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 29, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants