Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HealthCmd interval in quadlet not being followed + transient timers not cleaned up #22884

Closed
lespea opened this issue Jun 3, 2024 · 9 comments · Fixed by #22895
Closed

HealthCmd interval in quadlet not being followed + transient timers not cleaned up #22884

lespea opened this issue Jun 3, 2024 · 9 comments · Fixed by #22895
Assignees
Labels
5.1 kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. regression

Comments

@lespea
Copy link

lespea commented Jun 3, 2024

Issue Description

With the latest update of podman (v5.1.0) I noticed that in my quadlet definitions the HealthInterval is not being followed but instead the HealthStartupInterval is. Moreover the transient .timer files are being left behind whenever the service is stop/restarted causing many error logs to fill be generated since the container is no longer running but the healthcmd continues to be retried (in my case every few seconds for every container).

Quadlet def:

[Unit]
Description=Podman test
After=network.target
Requires=network.target

[Container]

# General
Image=docker.io/nginx
AutoUpdate=registry
ContainerName=testing
Pull=Always
RunInit=true
Timezone=local

# Health
HealthCmd=sleep 2
HealthInterval=5m
HealthStartupCmd=sleep 5
HealthStartupInterval=2s
HealthStartupTimeout=30s
Notify=healthy

[Service]
Restart=always
RestartSec=10s
RestartSteps=5
RestartMaxDelaySec=5m

[Install]
WantedBy=multi-user.target default.target

Transient logs persisting:

> cd /run/systemd/transient

> ll
.rw-r--r-- 418 root root  3 Jun 09:39 2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6-startup-1b5d1787b55c20ba.service
.rw-r--r-- 274 root root  3 Jun 09:39 2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6-startup-1b5d1787b55c20ba.timer
.rw-r--r-- 440 root root  3 Jun 09:10 session-1.scope

> sudo systemctl restart test.service

> ll
.rw-r--r-- 418 root root  3 Jun 09:39 2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6-startup-1b5d1787b55c20ba.service
.rw-r--r-- 274 root root  3 Jun 09:39 2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6-startup-1b5d1787b55c20ba.timer
.rw-r--r-- 418 root root  3 Jun 09:39 21e23a6d475aec6aebb72ad62091d37b42ede0cf986cc98ef05543893d8c7111-startup-58efb645f187fff1.service
.rw-r--r-- 274 root root  3 Jun 09:39 21e23a6d475aec6aebb72ad62091d37b42ede0cf986cc98ef05543893d8c7111-startup-58efb645f187fff1.timer
.rw-r--r-- 440 root root  3 Jun 09:10 session-1.scope

> sudo systemctl stop test.service

> ll
.rw-r--r-- 418 root root  3 Jun 09:39 2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6-startup-1b5d1787b55c20ba.service
.rw-r--r-- 274 root root  3 Jun 09:39 2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6-startup-1b5d1787b55c20ba.timer
.rw-r--r-- 418 root root  3 Jun 09:39 21e23a6d475aec6aebb72ad62091d37b42ede0cf986cc98ef05543893d8c7111-startup-58efb645f187fff1.service
.rw-r--r-- 274 root root  3 Jun 09:39 21e23a6d475aec6aebb72ad62091d37b42ede0cf986cc98ef05543893d8c7111-startup-58efb645f187fff1.timer
.rw-r--r-- 440 root root  3 Jun 09:10 session-1.scope

Example of a transient service/timer

> bat 2f*
───────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: 2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6-startup-1b5d1787b55c20ba.service
───────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ # This is a transient unit file, created programmatically via the systemd API. Do not edit.
   2   │ [Unit]
   3   │ Description=/usr/bin/podman healthcheck run 2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6
   4   │ 
   5   │ [Service]
   6   │ LogLevelMax=5
   7   │ Environment="PATH=/usr/local/sbin:/usr/local/bin:/usr/bin"
   8   │ ExecStart=
   9   │ ExecStart="/usr/bin/podman" "healthcheck" "run" "2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6"
───────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
───────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: 2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6-startup-1b5d1787b55c20ba.timer
───────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1   │ # This is a transient unit file, created programmatically via the systemd API. Do not edit.
   2   │ [Unit]
   3   │ Description=/usr/bin/podman healthcheck run 2f839ea75c704dd74920b54f82101443c6791b460e10bd3468582fa919ec1bc6
   4   │ 
   5   │ [Timer]
   6   │ OnUnitInactiveSec=2s
   7   │ AccuracySec=1s
   8   │ RemainAfterElapse=no
───────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Steps to reproduce the issue

Steps to reproduce the issue

  1. Setup the quadlet as described in the description; reload the systemd daemon and start the service
  2. Notice that the HealthStartupCmd is run, sleep 5
  3. Notice that every 2s the HealthCmd is being run, sleep 2
  4. Stop the service, notice that in journalctl every 2 seconds there is an error log for the startup timer/service since those containers no longer exist
  5. Look in /var/run/systemd/transient/ to see the old timers/services
    • You can also see how OnUnitInactiveSec=2s is in the timers which is the interval for the startup health check not the normal one

Describe the results you received

Timers removed on service reset/stop

Describe the results you expected

Initial health cmd runs it's cmd/interval then once health the normal cmd runs its cmds/interval. Also the checks should be removed whenever the container is restarted/shutdown.

podman info output

host:
  arch: amd64
  buildahVersion: 1.36.0
  cgroupControllers:
  - cpuset
  - cpu
  - io
  - memory
  - hugetlb
  - pids
  - rdma
  - misc
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: /usr/bin/conmon is owned by conmon 1:2.1.12-1
    path: /usr/bin/conmon
    version: 'conmon version 2.1.12, commit: e8896631295ccb0bfdda4284f1751be19b483264'
  cpuUtilization:
    idlePercent: 98.39
    systemPercent: 0.47
    userPercent: 1.15
  cpus: 16
  databaseBackend: sqlite
  distribution:
    distribution: arch
    version: unknown
  eventLogger: journald
  freeLocks: 2047
  hostname: hydra.lespea.org
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 6.9.3-zen1-1-zen
  linkmode: dynamic
  logDriver: journald
  memFree: 45088784384
  memTotal: 67316936704
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: /usr/lib/podman/aardvark-dns is owned by aardvark-dns 1.11.0-1
      path: /usr/lib/podman/aardvark-dns
      version: aardvark-dns 1.11.0
    package: /usr/lib/podman/netavark is owned by netavark 1.11.0-1
    path: /usr/lib/podman/netavark
    version: netavark 1.11.0
  ociRuntime:
    name: crun
    package: /usr/bin/crun is owned by crun 1.15-1
    path: /usr/bin/crun
    version: |-
      crun version 1.15
      commit: e6eacaf4034e84185fd8780ac9262bbf57082278
      rundir: /run/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: /usr/bin/pasta is owned by passt 2024_05_23.765eb0b-1
    version: |
      pasta 2024_05_23.765eb0b
      Copyright Red Hat
      GNU General Public License, version 2 or later
        <https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: false
    path: /run/podman/podman.sock
  rootlessNetworkCmd: pasta
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /etc/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: /usr/bin/slirp4netns is owned by slirp4netns 1.3.1-1
    version: |-
      slirp4netns version 1.3.1
      commit: e5e368c4f5db6ae75c2fce786e31eef9da6bf236
      libslirp: 4.8.0
      SLIRP_CONFIG_VERSION_MAX: 5
      libseccomp: 2.5.5
  swapFree: 0
  swapTotal: 0
  uptime: 0h 31m 2.00s
  variant: ""
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries: {}
store:
  configFile: /etc/containers/storage.conf
  containerStore:
    number: 1
    paused: 0
    running: 1
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 412300083200
  graphRootUsed: 142776213504
  graphStatus:
    Backing Filesystem: btrfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Supports shifting: "true"
    Supports volatile: "true"
    Using metacopy: "true"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 2
  runRoot: /run/containers/storage
  transientStore: false
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 5.1.0
  Built: 1717105841
  BuiltTime: Thu May 30 16:50:41 2024
  GitCommit: 4e9486dbc63c24bfe109066abbb54d5d8dc2489e-dirty
  GoVersion: go1.22.3
  Os: linux
  OsArch: linux/amd64
  Version: 5.1.0

Podman in a container

No

Privileged Or Rootless

Privileged

Upstream Latest Release

Yes

Additional environment details

No response

Additional information

No response

@lespea lespea added the kind/bug Categorizes issue or PR as related to a bug. label Jun 3, 2024
@Luap99
Copy link
Member

Luap99 commented Jun 3, 2024

@mheon PTAL

@mheon
Copy link
Member

mheon commented Jun 3, 2024

To be certain: did this work with Podman 5.0?

@lespea
Copy link
Author

lespea commented Jun 3, 2024

I've used a setup similar to this for a while, yes. I'm not 100% sure when it started occurring but I'm pretty sure it was right after 5.1.0. When I get home from work I can try rolling back to the previous version to double check that's when the errors started.

@Luap99
Copy link
Member

Luap99 commented Jun 4, 2024

The only relevant change I remember going into 5.1 is 4fd8419

So maybe worth to try to revert this and test then

@mheon
Copy link
Member

mheon commented Jun 4, 2024

@lespea Does this reproduce outside of Quadlet? Something like podman run --health-cmd=... --health-internal=... using the same values as your previous script?

I'm hopeful it doesn't because we have CI that should catch such things

@Luap99
Copy link
Member

Luap99 commented Jun 4, 2024

This reproduces podman run --name c1 --health-cmd true --health-interval 15s --health-start-period 30s --health-startup-cmd true --health-startup-interval 5s -d quay.io/libpod/testimage:20240123 sleep inf

Our CI doesn't check for leaked transient units, the healthchecks are running fine it is just the cleanup which is failing to remove the timer

@Luap99
Copy link
Member

Luap99 commented Jun 4, 2024

Reverting your change makes it work again.

@Luap99
Copy link
Member

Luap99 commented Jun 4, 2024

Problems is this code

if err := c.createTimer(c.HealthCheckConfig().Interval.String(), false); err != nil {
logrus.Errorf("Error recreating container %s healthcheck: %v", c.ID(), err)
return
}
if err := c.startTimer(false); err != nil {
logrus.Errorf("Error restarting container %s healthcheck timer: %v", c.ID(), err)
}
// This kills the process the healthcheck is running.
// Which happens to be us.
// So this has to be last - after this, systemd serves us a
// SIGTERM and we exit.
if err := c.removeTransientFiles(ctx, true); err != nil {
logrus.Errorf("Error removing container %s healthcheck: %v", c.ID(), err)
return
}

Your new code only uses one field for the unit name and createTimer overwrites the startup hc with the real hc name so removeTransientFiles then removes the real hc timer and thus leaks the startup hc timer.

@Luap99 Luap99 self-assigned this Jun 4, 2024
Luap99 added a commit to Luap99/libpod that referenced this issue Jun 4, 2024
This fixes a regression added in commit 4fd8419, because the name was
overwritten by the createTimer() timer call the removeTransientFiles()
call removed the new timer and not the startup healthcheck. And then
when the container was stopped we leaked it as the wrong unit name was
in the state.

A new test has been added to ensure the logic works and we never leak
the system timers.

Fixes containers#22884

Signed-off-by: Paul Holzinger <[email protected]>
mheon pushed a commit to mheon/libpod that referenced this issue Jun 4, 2024
This fixes a regression added in commit 4fd8419, because the name was
overwritten by the createTimer() timer call the removeTransientFiles()
call removed the new timer and not the startup healthcheck. And then
when the container was stopped we leaked it as the wrong unit name was
in the state.

A new test has been added to ensure the logic works and we never leak
the system timers.

Fixes containers#22884

Signed-off-by: Paul Holzinger <[email protected]>
@lespea
Copy link
Author

lespea commented Jun 7, 2024

Ugh sorry this has been an absolute insane week. Really appreciate the fast fix/release and I can confirm that v5.1.1 works as expected!

TomSweeneyRedHat pushed a commit to TomSweeneyRedHat/podman that referenced this issue Jul 26, 2024
This fixes a regression added in commit 4fd8419, because the name was
overwritten by the createTimer() timer call the removeTransientFiles()
call removed the new timer and not the startup healthcheck. And then
when the container was stopped we leaked it as the wrong unit name was
in the state.

A new test has been added to ensure the logic works and we never leak
the system timers.

Fixes containers#22884

Signed-off-by: Paul Holzinger <[email protected]>
Signed-off-by: tomsweeneyredhat <[email protected]>
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 6, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Sep 6, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
5.1 kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. regression
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants