HealthCmd interval in quadlet not being followed + transient timers not cleaned up #22884

lespea · 2024-06-03T14:55:49Z

Issue Description

With the latest update of podman (v5.1.0) I noticed that in my quadlet definitions the HealthInterval is not being followed but instead the HealthStartupInterval is. Moreover the transient .timer files are being left behind whenever the service is stop/restarted causing many error logs to fill be generated since the container is no longer running but the healthcmd continues to be retried (in my case every few seconds for every container).

Quadlet def:

[Unit]
Description=Podman test
After=network.target
Requires=network.target

[Container]

# General
Image=docker.io/nginx
AutoUpdate=registry
ContainerName=testing
Pull=Always
RunInit=true
Timezone=local

# Health
HealthCmd=sleep 2
HealthInterval=5m
HealthStartupCmd=sleep 5
HealthStartupInterval=2s
HealthStartupTimeout=30s
Notify=healthy

[Service]
Restart=always
RestartSec=10s
RestartSteps=5
RestartMaxDelaySec=5m

[Install]
WantedBy=multi-user.target default.target

Transient logs persisting:

> cd /run/systemd/transient

> ll
.rw-r--r-- 418 root root  3 Jun 09:39 2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6-startup-1b5d1787b55c20ba.service
.rw-r--r-- 274 root root  3 Jun 09:39 2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6-startup-1b5d1787b55c20ba.timer
.rw-r--r-- 440 root root  3 Jun 09:10 session-1.scope

> sudo systemctl restart test.service

> ll
.rw-r--r-- 418 root root  3 Jun 09:39 2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6-startup-1b5d1787b55c20ba.service
.rw-r--r-- 274 root root  3 Jun 09:39 2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6-startup-1b5d1787b55c20ba.timer
.rw-r--r-- 418 root root  3 Jun 09:39 21e23a6d475aec6aebb72ad62091d37b42ede0cf986cc98ef05543893d8c7111-startup-58efb645f187fff1.service
.rw-r--r-- 274 root root  3 Jun 09:39 21e23a6d475aec6aebb72ad62091d37b42ede0cf986cc98ef05543893d8c7111-startup-58efb645f187fff1.timer
.rw-r--r-- 440 root root  3 Jun 09:10 session-1.scope

> sudo systemctl stop test.service

> ll
.rw-r--r-- 418 root root  3 Jun 09:39 2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6-startup-1b5d1787b55c20ba.service
.rw-r--r-- 274 root root  3 Jun 09:39 2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6-startup-1b5d1787b55c20ba.timer
.rw-r--r-- 418 root root  3 Jun 09:39 21e23a6d475aec6aebb72ad62091d37b42ede0cf986cc98ef05543893d8c7111-startup-58efb645f187fff1.service
.rw-r--r-- 274 root root  3 Jun 09:39 21e23a6d475aec6aebb72ad62091d37b42ede0cf986cc98ef05543893d8c7111-startup-58efb645f187fff1.timer
.rw-r--r-- 440 root root  3 Jun 09:10 session-1.scope

Example of a transient service/timer

> bat 2f*
───────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: 2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6-startup-1b5d1787b55c20ba.service
───────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ # This is a transient unit file, created programmatically via the systemd API. Do not edit.
   2   │ [Unit]
   3   │ Description=/usr/bin/podman healthcheck run 2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6
   4   │ 
   5   │ [Service]
   6   │ LogLevelMax=5
   7   │ Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/bin"
   8   │ ExecStart=
   9   │ ExecStart="/usr/bin/podman" "healthcheck" "run" "2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6"
───────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
───────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: 2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6-startup-1b5d1787b55c20ba.timer
───────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ # This is a transient unit file, created programmatically via the systemd API. Do not edit.
   2   │ [Unit]
   3   │ Description=/usr/bin/podman healthcheck run 2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6
   4   │ 
   5   │ [Timer]
   6   │ OnUnitInactiveSec=2s
   7   │ AccuracySec=1s
   8   │ RemainAfterElapse=no
───────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Steps to reproduce the issue

Setup the quadlet as described in the description; reload the systemd daemon and start the service
Notice that the HealthStartupCmd is run, sleep 5
Notice that every 2s the HealthCmd is being run, sleep 2
Stop the service, notice that in journalctl every 2 seconds there is an error log for the startup timer/service since those containers no longer exist
Look in /var/run/systemd/transient/ to see the old timers/services
- You can also see how OnUnitInactiveSec=2s is in the timers which is the interval for the startup health check not the normal one

Describe the results you received

Timers removed on service reset/stop

Describe the results you expected

Initial health cmd runs it's cmd/interval then once health the normal cmd runs its cmds/interval. Also the checks should be removed whenever the container is restarted/shutdown.

podman info output

host:
  arch: amd64
  buildahVersion: 1.36.0
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - hugetlb
  - pids
  - rdma
  - misc
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: /usr/bin/conmon is owned by conmon 1:2.1.12-1
    path: /usr/bin/conmon
    version: 'conmon version 2.1.12, commit: e8896631295ccb0bfdda4284f1751be19b483264'
  cpuUtilization:
    idlePercent: 98.39
    systemPercent: 0.47
    userPercent: 1.15
  cpus: 16
  databaseBackend: sqlite
  distribution:
    distribution: arch
    version: unknown
  eventLogger: journald
  freeLocks: 2047
  hostname: hydra.lespea.org
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 6.9.3-zen1-1-zen
  linkmode: dynamic
  logDriver: journald
  memFree: 45088784384
  memTotal: 67316936704
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: /usr/lib/podman/aardvark-dns is owned by aardvark-dns 1.11.0-1
      path: /usr/lib/podman/aardvark-dns
      version: aardvark-dns 1.11.0
    package: /usr/lib/podman/netavark is owned by netavark 1.11.0-1
    path: /usr/lib/podman/netavark
    version: netavark 1.11.0
  ociRuntime:
    name: crun
    package: /usr/bin/crun is owned by crun 1.15-1
    path: /usr/bin/crun
    version: |-
      crun version 1.15
      commit: e6eacaf4034e84185fd8780ac9262bbf57082278
      rundir: /run/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: /usr/bin/pasta is owned by passt 2024_05_23.765eb0b-1
    version: |
      pasta 2024_05_23.765eb0b
      Copyright Red Hat
      GNU General Public License, version 2 or later
        <https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: false
    path: /run/podman/podman.sock
  rootlessNetworkCmd: pasta
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /etc/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: /usr/bin/slirp4netns is owned by slirp4netns 1.3.1-1
    version: |-
      slirp4netns version 1.3.1
      commit: e5e368c4f5db6ae75c2fce786e31eef9da6bf236
      libslirp: 4.8.0
      SLIRP_CONFIG_VERSION_MAX: 5
      libseccomp: 2.5.5
  swapFree: 0
  swapTotal: 0
  uptime: 0h 31m 2.00s
  variant: ""
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries: {}
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 1
    paused: 0
    running: 1
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 412300083200
  graphRootUsed: 142776213504
  graphStatus:
    Backing Filesystem: btrfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Supports shifting: "true"
    Supports volatile: "true"
    Using metacopy: "true"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 2
  runRoot: /run/containers/storage
  transientStore: false
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 5.1.0
  Built: 1717105841
  BuiltTime: Thu May 30 16:50:41 2024
  GitCommit: 4e9486dbc63c24bfe109066abbb54d5d8dc2489e-dirty
  GoVersion: go1.22.3
  Os: linux
  OsArch: linux/amd64
  Version: 5.1.0

Podman in a container

No

Privileged Or Rootless

Privileged

Upstream Latest Release

Yes

Additional environment details

No response

Additional information

No response

The text was updated successfully, but these errors were encountered:

Luap99 · 2024-06-03T15:16:38Z

@mheon PTAL

mheon · 2024-06-03T15:18:25Z

To be certain: did this work with Podman 5.0?

lespea · 2024-06-03T15:55:16Z

I've used a setup similar to this for a while, yes. I'm not 100% sure when it started occurring but I'm pretty sure it was right after 5.1.0. When I get home from work I can try rolling back to the previous version to double check that's when the errors started.

Luap99 · 2024-06-04T10:54:34Z

The only relevant change I remember going into 5.1 is 4fd8419

So maybe worth to try to revert this and test then

mheon · 2024-06-04T12:08:14Z

@lespea Does this reproduce outside of Quadlet? Something like podman run --health-cmd=... --health-internal=... using the same values as your previous script?

I'm hopeful it doesn't because we have CI that should catch such things

Luap99 · 2024-06-04T12:15:21Z

This reproduces podman run --name c1 --health-cmd true --health-interval 15s --health-start-period 30s --health-startup-cmd true --health-startup-interval 5s -d quay.io/libpod/testimage:20240123 sleep inf

Our CI doesn't check for leaked transient units, the healthchecks are running fine it is just the cleanup which is failing to remove the timer

Luap99 · 2024-06-04T12:18:44Z

Reverting your change makes it work again.

Luap99 · 2024-06-04T12:24:24Z

Problems is this code

podman/libpod/healthcheck.go

Lines 282 to 297 in c510959

    
           if err := c.createTimer(c.HealthCheckConfig().Interval.String(), false); err != nil { 
        
           	logrus.Errorf("Error recreating container %s healthcheck: %v", c.ID(), err) 
        
           	return 
        
           } 
        
           if err := c.startTimer(false); err != nil { 
        
           	logrus.Errorf("Error restarting container %s healthcheck timer: %v", c.ID(), err) 
        
           } 
        
           // This kills the process the healthcheck is running. 
        
           // Which happens to be us. 
        
           // So this has to be last - after this, systemd serves us a 
        
           // SIGTERM and we exit. 
        
           if err := c.removeTransientFiles(ctx, true); err != nil { 
        
           	logrus.Errorf("Error removing container %s healthcheck: %v", c.ID(), err) 
        
           	return 
        
           }

Your new code only uses one field for the unit name and createTimer overwrites the startup hc with the real hc name so removeTransientFiles then removes the real hc timer and thus leaks the startup hc timer.

This fixes a regression added in commit 4fd8419, because the name was overwritten by the createTimer() timer call the removeTransientFiles() call removed the new timer and not the startup healthcheck. And then when the container was stopped we leaked it as the wrong unit name was in the state. A new test has been added to ensure the logic works and we never leak the system timers. Fixes containers#22884 Signed-off-by: Paul Holzinger <[email protected]>

lespea · 2024-06-07T22:29:06Z

Ugh sorry this has been an absolute insane week. Really appreciate the fast fix/release and I can confirm that v5.1.1 works as expected!

This fixes a regression added in commit 4fd8419, because the name was overwritten by the createTimer() timer call the removeTransientFiles() call removed the new timer and not the startup healthcheck. And then when the container was stopped we leaked it as the wrong unit name was in the state. A new test has been added to ensure the logic works and we never leak the system timers. Fixes containers#22884 Signed-off-by: Paul Holzinger <[email protected]> Signed-off-by: tomsweeneyredhat <[email protected]>

lespea added the kind/bug Categorizes issue or PR as related to a bug. label Jun 3, 2024

Luap99 added regression 5.1 labels Jun 4, 2024

Luap99 self-assigned this Jun 4, 2024

Luap99 mentioned this issue Jun 4, 2024

libpod: do not leak systemd hc startup unit timer #22895

Merged

openshift-merge-bot bot closed this as completed in #22895 Jun 4, 2024

stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 6, 2024

stale-locking-app bot locked as resolved and limited conversation to collaborators Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HealthCmd interval in quadlet not being followed + transient timers not cleaned up #22884

HealthCmd interval in quadlet not being followed + transient timers not cleaned up #22884

lespea commented Jun 3, 2024

Luap99 commented Jun 3, 2024

mheon commented Jun 3, 2024

lespea commented Jun 3, 2024

Luap99 commented Jun 4, 2024

mheon commented Jun 4, 2024

Luap99 commented Jun 4, 2024

Luap99 commented Jun 4, 2024

Luap99 commented Jun 4, 2024

lespea commented Jun 7, 2024

HealthCmd interval in quadlet not being followed + transient timers not cleaned up #22884

HealthCmd interval in quadlet not being followed + transient timers not cleaned up #22884

Comments

lespea commented Jun 3, 2024

Issue Description

Steps to reproduce the issue

Describe the results you received

Describe the results you expected

podman info output

Podman in a container

Privileged Or Rootless

Upstream Latest Release

Additional environment details

Additional information

Luap99 commented Jun 3, 2024

mheon commented Jun 3, 2024

lespea commented Jun 3, 2024

Luap99 commented Jun 4, 2024

mheon commented Jun 4, 2024

Luap99 commented Jun 4, 2024

Luap99 commented Jun 4, 2024

Luap99 commented Jun 4, 2024

lespea commented Jun 7, 2024