Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unhealthy pod(created via podman play kube) doesn't restart #14505

Closed
welderpb opened this issue Jun 7, 2022 · 13 comments
Closed

Unhealthy pod(created via podman play kube) doesn't restart #14505

welderpb opened this issue Jun 7, 2022 · 13 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue

Comments

@welderpb
Copy link

welderpb commented Jun 7, 2022

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description

When pod is created via "podman play kube" with livness probe - it is always running with "unhealthy" status.

spec:
  restartPolicy: OnFailure

I checked a code and found:

                // append `kill 1` to `cmd` if appropriate restart policy is configured.
		if restartPolicy == "always" || restartPolicy == "onfailure" {
			// container will be restarted so we can kill init.
			failureCmd = "kill 1"
		}

So i inspected my container(created with 'play kube') and it has a restart policy:

               "PortBindings": {},
               "RestartPolicy": {
                    "Name": "on-failure",
                    "MaximumRetryCount": 0
               },
               "AutoRemove": false,

and "exit 1" failureCmd instead "kill 1"

              "Healthcheck": {
                    "Test": [
                         "CMD-SHELL",
                         "curl",
                         "http://localhost:8080/",
                         "",
                         "||",
                         "exit",
                         "1"
                    ],

Steps to reproduce the issue:

  1. Create Kubernetes YAML file with liveness probe and restartPolicy: OnFailure

  2. Play kube

Describe the results you received:

Container always running with unhealthy status..

Describe the results you expected:

unhealthy container should be killed..

Additional information you deem important (e.g. issue happens only occasionally):

Output of podman version:

Client:       Podman Engine
Version:      4.1.0
API Version:  4.1.0
Go Version:   go1.17.5
Built:        Wed May 25 07:40:16 2022
OS/Arch:      linux/amd64

Output of podman info --debug:

host:
  arch: amd64
  buildahVersion: 1.26.1
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - hugetlb
  - pids
  - rdma
  - misc
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.1-1.el9.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.1, commit: bb7e4979ae5b33d8c8793618965018f335af0827'
  cpuUtilization:
    idlePercent: 90.1
    systemPercent: 0.63
    userPercent: 9.27
  cpus: 4
  distribution:
    distribution: '"centos"'
    version: "9"
  eventLogger: journald
  hostname: ip-172-25-33-161.ec2.internal
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 5.14.0-101.el9.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 130281472
  memTotal: 7743361024
  networkBackend: netavark
  ociRuntime:
    name: crun
    package: crun-1.4.5-2.el9.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.4.5
      commit: c381048530aa750495cf502ddb7181f2ded5b400
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  remoteSocket:
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_NET_RAW,CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /bin/slirp4netns
    package: slirp4netns-1.2.0-2.el9.x86_64
    version: |-
      slirp4netns version 1.2.0
      commit: 656041d45cfca7a4176f6b7eed9e4fe6c11e8383
      libslirp: 4.4.0
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.2
  swapFree: 0
  swapTotal: 0
  uptime: 93h 43m 1.12s (Approximately 3.88 days)
plugins:
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - registry.centos.org
  - quay.io
  - docker.io
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 4
    paused: 0
    running: 3
    stopped: 1
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 10724814848
  graphRootUsed: 2292486144
  graphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "true"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 3
  runRoot: /run/containers/storage
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 4.1.0
  Built: 1653464416
  BuiltTime: Wed May 25 07:40:16 2022
  GitCommit: ""
  GoVersion: go1.17.5
  Os: linux
  OsArch: linux/amd64
  Version: 4.1.0

Package info (e.g. output of rpm -q podman or apt list podman):

podman-4.1.0-4.el9.x86_64

Have you tested with the latest version of Podman and have you checked the Podman Troubleshooting Guide? (https://github.com/containers/podman/blob/main/troubleshooting.md)

Yes

Additional environment details (AWS, VirtualBox, physical, etc.):

@openshift-ci openshift-ci bot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 7, 2022
@mheon
Copy link
Member

mheon commented Jun 7, 2022

Restart policy doesn't include health checks - it only restarts on the container actually exiting. Adding integration between healthchecks and restart policy would be a new feature.

@welderpb
Copy link
Author

welderpb commented Jun 7, 2022

@mheon
I am not talking about integration between healthchecks and restart policy. I am taking that heathcheck command constructed from liveness probe definition, and contains "failureCmd" variable.

pkg/specgen/generate/kube/kube.go

case probeHandler.Exec != nil:
			execString := strings.Join(probeHandler.Exec.Command, " ")
			commandString = fmt.Sprintf("%s || %s", execString, failureCmd)
		case probeHandler.HTTPGet != nil:
			commandString = fmt.Sprintf("curl %s://%s:%d/%s  || %s", probeHandler.HTTPGet.Scheme, probeHandler.HTTPGet.Host, probeHandler.HTTPGet.Port.IntValue(), probeHandler.HTTPGet.Path, failureCmd)
		case probeHandler.TCPSocket != nil:
			commandString = fmt.Sprintf("nc -z -v %s %d || %s", probeHandler.TCPSocket.Host, probeHandler.TCPSocket.Port.IntValue(), failureCmd)
		}

in case if restart policy equals "onfailure" or "alway" it should be "kill 1" according this code:

                // append `kill 1` to `cmd` if appropriate restart policy is configured.
		if restartPolicy == "always" || restartPolicy == "onfailure" {
			// container will be restarted so we can kill init.
			failureCmd = "kill 1"
		}

in fact it is "exit 1" when RestartPolicy: OnFailure in kube definition.
If RestartPolicy is Always, it is "kill 1" as expected..

@mheon
Copy link
Member

mheon commented Jun 7, 2022

This is also not supported yet. I'm working on a related feature (startup probes) and I can look into this once I'm done.

@welderpb
Copy link
Author

welderpb commented Jun 7, 2022

@mheon Why it is not supported if i can see this code in v4.1.0 tag?
Could you check this condition:

if restartPolicy == "always" || restartPolicy == "onfailure" {

might be "onfailure" should be "on-failure"?
"always" condition working as expected..

@mheon
Copy link
Member

mheon commented Jun 7, 2022

There is no handling for failureCmd. You're seeing handling for the actual healthcheck, not for any command to run on failure. There is no association between healthchecks and restart policy at present.

@welderpb
Copy link
Author

welderpb commented Jun 8, 2022

@mheon
Could you explain then what this block of code mean?

                // append `kill 1` to `cmd` if appropriate restart policy is configured.
		if restartPolicy == "always" || restartPolicy == "onfailure" {
			// container will be restarted so we can kill init.
			failureCmd = "kill 1"
		}

it is exactly what i need.. Append "kill 1" to cmd whtn restartPolicy always or onfailure..
In case if restartPolicy "always" it is working, in case "on-failure" it is not!

edsantiago added a commit to edsantiago/libpod that referenced this issue Jun 15, 2022
Changes:
 - use --timestamp option to produce 'created' stamps
   that can be reliably tested in the image-history test

 - podman now supports manifest & multiarch run, so we
   no longer need buildah

 - bump up base alpine & busybox images

This turned out to be WAY more complicated than it should've been,
because:

 - alpine 3.14 fixed 'date -Iseconds' to include a colon in
   the TZ offset ("-07:00", was "-0700"). This is now consistent
   with GNU date's --iso-8601 format, yay, so we can eliminate
   a minor workaround.

 - with --timestamp, all ADDed files are set to that timestamp,
   including the custom-reference-timestamp file that many tests
   rely on. So we need to split the build into two steps. But:

 - ...with a two-step build I need to use --squash-all, not --squash, but:

 - ... (deep sigh) --squash-all doesn't work with --timestamp (containers#14536)
   so we need to alter existing tests to deal with new image layers.

 - And, long and sordid story relating to --rootfs. TL;DR that option
   only worked by a miracle relating to something special in one
   specific test image; it doesn't work with any other images. Fix
   seems to be complicated, so we're bypassing with a FIXME (containers#14505).

And, unrelated:

 - remove obsolete skip and workaround in run-basic test (dating
   back to varlink days)
 - add a pause-image cleanup to avoid icky red warnings in logs

Fixes: containers#14456

Signed-off-by: Ed Santiago <[email protected]>
@github-actions
Copy link

github-actions bot commented Jul 9, 2022

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Jul 9, 2022

@mheon this Issue is waiting on feedback from you?

@github-actions
Copy link

github-actions bot commented Aug 9, 2022

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Aug 9, 2022

@mheon Reminder

@mheon
Copy link
Member

mheon commented Aug 25, 2022

Trying to remember what was going on here...

It looks like play kube is hacking around our lack of proper support for taking action on healthchecks.

IMO - we should just add support for restarting on healthcheck failure and remove this. It doesn't seem to be working as advertised in this case, at least.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Sep 28, 2022

We now have healthcheck restart capability. I believe that solves the problem. Reopen if I am mistaken.

@rhatdan rhatdan closed this as completed Sep 28, 2022
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 14, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 14, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. stale-issue
Projects
None yet
Development

No branches or pull requests

3 participants