Unhealthy pod(created via podman play kube) doesn't restart #14505

welderpb · 2022-06-07T07:57:22Z

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

When pod is created via "podman play kube" with livness probe - it is always running with "unhealthy" status.

spec:
  restartPolicy: OnFailure

I checked a code and found:

                // append `kill 1` to `cmd` if appropriate restart policy is configured.
		if restartPolicy == "always" || restartPolicy == "onfailure" {
			// container will be restarted so we can kill init.
			failureCmd = "kill 1"
		}

So i inspected my container(created with 'play kube') and it has a restart policy:

               "PortBindings": {},
               "RestartPolicy": {
                    "Name": "on-failure",
                    "MaximumRetryCount": 0
               },
               "AutoRemove": false,

and "exit 1" failureCmd instead "kill 1"

              "Healthcheck": {
                    "Test": [
                         "CMD-SHELL",
                         "curl",
                         "http://localhost:8080/",
                         "",
                         "||",
                         "exit",
                         "1"
                    ],

Steps to reproduce the issue:

Create Kubernetes YAML file with liveness probe and restartPolicy: OnFailure
Play kube

Describe the results you received:

Container always running with unhealthy status..

Describe the results you expected:

unhealthy container should be killed..

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

Client:       Podman Engine
Version:      4.1.0
API Version:  4.1.0
Go Version:   go1.17.5
Built:        Wed May 25 07:40:16 2022
OS/Arch:      linux/amd64

Output of podman info --debug:

host:
  arch: amd64
  buildahVersion: 1.26.1
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - hugetlb
  - pids
  - rdma
  - misc
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.1-1.el9.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.1, commit: bb7e4979ae5b33d8c8793618965018f335af0827'
  cpuUtilization:
    idlePercent: 90.1
    systemPercent: 0.63
    userPercent: 9.27
  cpus: 4
  distribution:
    distribution: '"centos"'
    version: "9"
  eventLogger: journald
  hostname: ip-172-25-33-161.ec2.internal
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 5.14.0-101.el9.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 130281472
  memTotal: 7743361024
  networkBackend: netavark
  ociRuntime:
    name: crun
    package: crun-1.4.5-2.el9.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.4.5
      commit: c381048530aa750495cf502ddb7181f2ded5b400
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_NET_RAW,CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /bin/slirp4netns
    package: slirp4netns-1.2.0-2.el9.x86_64
    version: |-
      slirp4netns version 1.2.0
      commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
      libslirp: 4.4.0
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.2
  swapFree: 0
  swapTotal: 0
  uptime: 93h 43m 1.12s (Approximately 3.88 days)
plugins:
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - registry.centos.org
  - quay.io
  - docker.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 4
    paused: 0
    running: 3
    stopped: 1
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 10724814848
  graphRootUsed: 2292486144
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "true"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 3
  runRoot: /run/containers/storage
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 4.1.0
  Built: 1653464416
  BuiltTime: Wed May 25 07:40:16 2022
  GitCommit: ""
  GoVersion: go1.17.5
  Os: linux
  OsArch: linux/amd64
  Version: 4.1.0

Package info (e.g. output of rpm -q podman or apt list podman):

podman-4.1.0-4.el9.x86_64

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/main/troubleshooting.md)

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):

The text was updated successfully, but these errors were encountered:

mheon · 2022-06-07T14:52:23Z

Restart policy doesn't include health checks - it only restarts on the container actually exiting. Adding integration between healthchecks and restart policy would be a new feature.

welderpb · 2022-06-07T15:17:59Z

@mheon
I am not talking about integration between healthchecks and restart policy. I am taking that heathcheck command constructed from liveness probe definition, and contains "failureCmd" variable.

pkg/specgen/generate/kube/kube.go

case probeHandler.Exec != nil:
			execString := strings.Join(probeHandler.Exec.Command, " ")
			commandString = fmt.Sprintf("%s || %s", execString, failureCmd)
		case probeHandler.HTTPGet != nil:
			commandString = fmt.Sprintf("curl %s://%s:%d/%s  || %s", probeHandler.HTTPGet.Scheme, probeHandler.HTTPGet.Host, probeHandler.HTTPGet.Port.IntValue(), probeHandler.HTTPGet.Path, failureCmd)
		case probeHandler.TCPSocket != nil:
			commandString = fmt.Sprintf("nc -z -v %s %d || %s", probeHandler.TCPSocket.Host, probeHandler.TCPSocket.Port.IntValue(), failureCmd)
		}

in case if restart policy equals "onfailure" or "alway" it should be "kill 1" according this code:

                // append `kill 1` to `cmd` if appropriate restart policy is configured.
		if restartPolicy == "always" || restartPolicy == "onfailure" {
			// container will be restarted so we can kill init.
			failureCmd = "kill 1"
		}

in fact it is "exit 1" when RestartPolicy: OnFailure in kube definition.
If RestartPolicy is Always, it is "kill 1" as expected..

mheon · 2022-06-07T15:19:22Z

This is also not supported yet. I'm working on a related feature (startup probes) and I can look into this once I'm done.

welderpb · 2022-06-07T15:42:52Z

@mheon Why it is not supported if i can see this code in v4.1.0 tag?
Could you check this condition:

if restartPolicy == "always" || restartPolicy == "onfailure" {

might be "onfailure" should be "on-failure"?
"always" condition working as expected..

mheon · 2022-06-07T16:43:57Z

There is no handling for failureCmd. You're seeing handling for the actual healthcheck, not for any command to run on failure. There is no association between healthchecks and restart policy at present.

welderpb · 2022-06-08T09:03:47Z

@mheon
Could you explain then what this block of code mean?

                // append `kill 1` to `cmd` if appropriate restart policy is configured.
		if restartPolicy == "always" || restartPolicy == "onfailure" {
			// container will be restarted so we can kill init.
			failureCmd = "kill 1"
		}

it is exactly what i need.. Append "kill 1" to cmd whtn restartPolicy always or onfailure..
In case if restartPolicy "always" it is working, in case "on-failure" it is not!

Changes: - use --timestamp option to produce 'created' stamps that can be reliably tested in the image-history test - podman now supports manifest & multiarch run, so we no longer need buildah - bump up base alpine & busybox images This turned out to be WAY more complicated than it should've been, because: - alpine 3.14 fixed 'date -Iseconds' to include a colon in the TZ offset ("-07:00", was "-0700"). This is now consistent with GNU date's --iso-8601 format, yay, so we can eliminate a minor workaround. - with --timestamp, all ADDed files are set to that timestamp, including the custom-reference-timestamp file that many tests rely on. So we need to split the build into two steps. But: - ...with a two-step build I need to use --squash-all, not --squash, but: - ... (deep sigh) --squash-all doesn't work with --timestamp (containers#14536) so we need to alter existing tests to deal with new image layers. - And, long and sordid story relating to --rootfs. TL;DR that option only worked by a miracle relating to something special in one specific test image; it doesn't work with any other images. Fix seems to be complicated, so we're bypassing with a FIXME (containers#14505). And, unrelated: - remove obsolete skip and workaround in run-basic test (dating back to varlink days) - add a pause-image cleanup to avoid icky red warnings in logs Fixes: containers#14456 Signed-off-by: Ed Santiago <[email protected]>

github-actions · 2022-07-09T00:07:14Z

A friendly reminder that this issue had no activity for 30 days.

rhatdan · 2022-07-09T10:05:12Z

@mheon this Issue is waiting on feedback from you?

github-actions · 2022-08-09T00:08:28Z

A friendly reminder that this issue had no activity for 30 days.

rhatdan · 2022-08-09T14:26:21Z

@mheon Reminder

mheon · 2022-08-25T18:52:31Z

Trying to remember what was going on here...

It looks like play kube is hacking around our lack of proper support for taking action on healthchecks.

IMO - we should just add support for restarting on healthcheck failure and remove this. It doesn't seem to be working as advertised in this case, at least.

github-actions · 2022-09-28T00:14:31Z

A friendly reminder that this issue had no activity for 30 days.

rhatdan · 2022-09-28T09:55:01Z

We now have healthcheck restart capability. I believe that solves the problem. Reopen if I am mistaken.

openshift-ci bot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 7, 2022

edsantiago mentioned this issue Jun 8, 2022

system test image: bump to 20220615 #14529

Merged

github-actions bot added the stale-issue label Jul 9, 2022

rhatdan removed the stale-issue label Jul 9, 2022

github-actions bot added the stale-issue label Aug 9, 2022

rhatdan removed the stale-issue label Aug 9, 2022

github-actions bot added the stale-issue label Sep 28, 2022

rhatdan closed this as completed Sep 28, 2022

fpoirotte mentioned this issue Dec 14, 2022

feature: Stop pod when one of its containers dies #16832

Open

github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 14, 2023

github-actions bot locked as resolved and limited conversation to collaborators Sep 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unhealthy pod(created via podman play kube) doesn't restart #14505

Unhealthy pod(created via podman play kube) doesn't restart #14505

welderpb commented Jun 7, 2022

mheon commented Jun 7, 2022

welderpb commented Jun 7, 2022

mheon commented Jun 7, 2022

welderpb commented Jun 7, 2022

mheon commented Jun 7, 2022

welderpb commented Jun 8, 2022

github-actions bot commented Jul 9, 2022

rhatdan commented Jul 9, 2022

github-actions bot commented Aug 9, 2022

rhatdan commented Aug 9, 2022

mheon commented Aug 25, 2022

github-actions bot commented Sep 28, 2022

rhatdan commented Sep 28, 2022

Unhealthy pod(created via podman play kube) doesn't restart #14505

Unhealthy pod(created via podman play kube) doesn't restart #14505

Comments

welderpb commented Jun 7, 2022

mheon commented Jun 7, 2022

welderpb commented Jun 7, 2022

mheon commented Jun 7, 2022

welderpb commented Jun 7, 2022

mheon commented Jun 7, 2022

welderpb commented Jun 8, 2022

github-actions bot commented Jul 9, 2022

rhatdan commented Jul 9, 2022

github-actions bot commented Aug 9, 2022

rhatdan commented Aug 9, 2022

mheon commented Aug 25, 2022

github-actions bot commented Sep 28, 2022

rhatdan commented Sep 28, 2022