[systests] quadlet hang, no details #18514

edsantiago · 2023-05-09T11:52:20Z

Seen twice in system tests:

[+1320s] ok 348 quadlet kube - basic
[+1893s] ok 349 quadlet - rootfs
[+2494s] not ok 350 quadlet - selinux disable
         # (from function `run_podman' in file test/system/helpers.bash, line 218,
         #  from function `basic_setup' in file test/system/helpers.bash, line 49,
         #  from function `setup' in test file test/system/252-quadlet.bats, line 25)
         #   `basic_setup' failed
         # # podman rm -t 0 --all --force --ignore
         # timeout: sending signal TERM to command ?/var/tmp/go/src/github.com/containers/podman/bin/podman?
         # [ rc=124 (** EXPECTED 0 **) ]
         # *** TIMED OUT ***
         # # [teardown]
         # # podman pod rm -t 0 --all --force --ignore
         # timeout: sending signal TERM to command ?/var/tmp/go/src/github.com/containers/podman/bin/podman?
         # [ rc=124 ]
         # *** TIMED OUT ***

...and, game over. No more logs from there, Cirrus times out. Look at the timestamps, though: those are absurd. Nine minutes per test? This is probably the kind of thing that needs to be debugged by looking at journal logs. Or maybe by instrumenting.

fedora-38-aarch64 : sys podman fedora-38-aarch64 root host sqlite
- PR Ed's pet PR with no flake retries #17831
  - 05-05 22:16
rawhide : sys podman rawhide root host sqlite
- PR Check on client side for Containerfile, if none specified #18498
  - 05-08 15:26

The text was updated successfully, but these errors were encountered:

Luap99 · 2023-05-09T12:03:26Z

Is the logged timestamp when the test ends? If so it is seems to be caused by quadlet - rootfs in both cases.

Luap99 · 2023-05-09T12:13:23Z

I think I was able to reproduce locally.

$ time hack/bats --root 252:rootfs
# bats --filter rootfs test/system/252-quadlet.bats
252-quadlet.bats
 ✓ quadlet - rootfs

1 test, 0 failures


real    3m2.145s
user    0m0.566s
sys     0m0.219s

This was the fourth try. Didn't took as long as in CI but still 3 mins??? Normal run is 2s.

relevant process while it took that long:

root       18638  0.0  0.0   6692  2212 ?        Ss   14:06   0:00 /usr/bin/conmon --api-version 1 -c 27d6284a17a993a6fd86cadece44803137e94e3ab2cdd2391cd3a8b34bf0e1b9 -u 27d6284a17a993a6fd86cadece44803137e94e3ab2cdd2391cd3a8b34bf0e1b9 -r /usr/bin/crun -b /var/lib/containers/storage/overlay-containers/27d6284a17a993a6fd86cadece44803137e94e3ab2cdd2391cd3a8b34bf0e1b9/userdata -p /run/containers/storage/overlay-containers/27d6284a17a993a6fd86cadece44803137e94e3ab2cdd2391cd3a8b34bf0e1b9/userdata/pidfile -n systemd-basic_CFGaa7e6Qi --exit-dir /run/libpod/exits --full-attach -l journald --log-level warning --runtime-arg --log-format=json --runtime-arg --log --runtime-arg=/run/containers/storage/overlay-containers/27d6284a17a993a6fd86cadece44803137e94e3ab2cdd2391cd3a8b34bf0e1b9/userdata/oci-log --conmon-pidfile /run/containers/storage/overlay-containers/27d6284a17a993a6fd86cadece44803137e94e3ab2cdd2391cd3a8b34bf0e1b9/userdata/conmon.pid --exit-command /home/pholzing/go/src/github.com/containers/podman/bin/podman --exit-command-arg --root --exit-command-arg /var/lib/containers/storage --exit-command-arg --runroot --exit-command-arg /run/containers/storage --exit-command-arg --log-level --exit-command-arg warning --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /run/libpod --exit-command-arg --network-config-dir --exit-command-arg  --exit-command-arg --network-backend --exit-command-arg netavark --exit-command-arg --volumepath --exit-command-arg /var/lib/containers/storage/volumes --exit-command-arg --db-backend --exit-command-arg boltdb --exit-command-arg --transient-store=false --exit-command-arg --runtime --exit-command-arg crun --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --storage-opt --exit-command-arg overlay.mountopt=nodev,metacopy=on --exit-command-arg --events-backend --exit-command-arg journald --exit-command-arg container --exit-command-arg cleanup --exit-command-arg --rm --exit-command-arg 27d6284a17a993a6fd86cadece44803137e94e3ab2cdd2391cd3a8b34bf0e1b9
root       18646  195  0.1 2159424 45820 ?       Sl   14:06   1:33 /home/pholzing/go/src/github.com/containers/podman/bin/podman --root /var/lib/containers/storage --runroot /run/containers/storage --log-level warning --cgroup-manager systemd --tmpdir /run/libpod --network-config-dir  --network-backend netavark --volumepath /var/lib/containers/storage/volumes --db-backend boltdb --transient-store=false --runtime crun --storage-driver overlay --storage-opt overlay.mountopt=nodev,metacopy=on --events-backend journald container cleanup --rm 27d6284a17a993a6fd86cadece44803137e94e3ab2cdd2391cd3a8b34bf0e1b9
root       18673  0.0  0.0 224268  2876 pts/2    S+   14:06   0:00 /usr/bin/bash /usr/libexec/bats-core/bats-exec-test --dummy-flag -x /home/pholzing/go/src/github.com/containers/podman/test/system/252-quadlet.bats test_quadlet_-2d_rootfs 1 1 1
root       18674  0.0  0.0 235040  7936 pts/2    S+   14:06   0:00 systemctl stop basic_CFGaa7e6Qi.service
root       18675  0.0  0.0  15952  6528 pts/2    S+   14:06   0:00 /usr/bin/systemd-tty-ask-password-agent --watch
root       18676  0.1  0.1 1789996 38856 ?       Ssl  14:06   0:00 /home/pholzing/go/src/github.com/containers/podman/bin/podman rm -f -i --cidfile=/run/basic_CFGaa7e6Qi.cid

Run $QUADLET and all systemctl/journalctl commands using 'timeout'. Nothing should ever, ever take more than the default 2 minutes. Followup to containers#18514, in which quadlet tests are found to be taking 9-10 minutes. Signed-off-by: Ed Santiago <[email protected]>

edsantiago · 2023-05-17T11:19:01Z

Another one, f38 aarch64 root

Luap99 · 2023-05-17T11:43:40Z

I tried to get a reproducer but looks like I was only lucky once. I haven't been able to trigger this since then.
The problem in my case and very likely CI as well is that the cleanup process deadlocks somewhere while holding the container lock. Thus all following --all are stucked because they cannot get the lock.
I assumed systemd would SIGKILL the cleanup process at some point when it runs into the systemd stop time-out but maybe not for some reason?

edsantiago · 2023-06-07T12:32:07Z

Another one, and once again f38 aarch64 root (the usual pattern) (there's one on rawhide-amd64, but all the rest are aarch64, FWIW). This time we have fine-tune timestamps in the logs, in case that helps anyone.

edsantiago · 2023-06-20T20:38:44Z

The stats so far. root only (so far), f38 and rawhide, aarch64 and regular amd64. And, yes, still happening.

fedora-38 : sys podman fedora-38 root host boltdb
- PR Ed's pet PR with no flake retries #17831
  - 06-15 19:35 in sys podman fedora-38 root host boltdb: Timeout
  - 06-14 18:55 in sys podman fedora-38 root host boltdb: Timeout
fedora-38-aarch64 : sys podman fedora-38-aarch64 root host boltdb
- PR refactor: improve get identity path duplicate code #18750
  - 06-06 22:31 in [sys] quadlet - selinux disable
  - 06-06 22:31 in [sys] quadlet - selinux labels
- PR compat,build: pull must accept string #18583
  - 05-16-2023 15:00 in [sys] quadlet - selinux disable
  - 05-16-2023 15:00 in [sys] quadlet - selinux labels
- PR DO NOT MERGE: buildah vendor treadmill #13808
  - 05-19-2023 20:19 in [sys] quadlet - selinux disable
  - 05-19-2023 20:19 in [sys] quadlet - selinux labels
fedora-38-aarch64 : sys podman fedora-38-aarch64 root host sqlite
- PR Ed's pet PR with no flake retries #17831
  - 05-05-2023 22:16 in [sys] quadlet - selinux disable
rawhide : sys podman rawhide root host sqlite
- PR Check on client side for Containerfile, if none specified #18498
  - 05-08-2023 15:26 in [sys] quadlet - selinux disable

Luap99 · 2023-06-21T09:53:43Z

@vrothberg To me this looks like something in the qualdet unit causes a deadlock on a single container. I was able to reproduce once we the cleanup process causes a deadlock but unfortunately I killed it before I got the stack trace.

@edsantiago If we hit a timeout in the cleanup commands can you patch the tests to use the new podman system locks command from @mheon. This should show at least if we hang on a container lock.

for containers#18514: if we get a timeout in teardown(), run and show the output of podman system locks for containers#18831: if we hit unmount/EINVAL, nothing will ever work again, so signal all future tests to skip. Signed-off-by: Ed Santiago <[email protected]>

github-actions · 2023-08-01T00:08:05Z

A friendly reminder that this issue had no activity for 30 days.

edsantiago · 2023-08-01T11:12:54Z

Only one instance in my logs... but I'm >80% sure that I've seen it in other PRs in the past two weeks. (Quick reminder that my flake logger does not, and cannot, log every flake).

fedora-38-aarch64 : sys podman fedora-38-aarch64 root host boltdb
- PR DO NOT MERGE: buildah vendor treadmill #13808
  - 07-01-2023 00:15 in sys podman fedora-38-aarch64 root host boltdb: Timeout

vrothberg · 2023-10-19T12:35:13Z

Still happening?

edsantiago · 2023-10-19T12:44:32Z

Sigh. No new instances.

mcepl · 2024-02-07T21:45:26Z

Remco on IRC suggested that I hit this problem in openSUSE/Tumbleweed (podman 4.9.2) with this .container file:

[Unit]
Description=Podman container-transmission.service
Documentation=https://github.com/linuxserver/docker-transmission/blob/master/README.md
Wants=network-online.target
After=network-online.target

#   -e USER= `#optional` \
#   -e PASS= `#optional` \
#   -e WHITELIST= `#optional` \
#   -e PEERPORT= `#optional` \
#   -e HOST_WHITELIST= `#optional` \

[Container]
# Image=registry.opensuse.org/home/mcepl/containers/containers/opensuse/transmission:latest
# Image=lscr.io/linuxserver/transmission:latest
Image=docker.io/linuxserver/transmission
ContainerName=transmission
# HostName=my-syncthing
Label=io.containers.autoupdate=registry
Environment=PUID=1000 PGID=100 SEED_RATIO=1.3 TZ=Europe/Prague
Volume=/home/matej/.config/transmission:/config
Volume=/home/matej/Videa/2BSeen:/downloads
#   -v /path/to/watch/folder:/watch \
PublishPort=127.0.0.1:9091:9091
PublishPort=127.0.0.1:51413:51413
PublishPort=127.0.0.1:51413:51413/udp
UserNS=keep-id:uid=1000,gid=100
PodmanArgs=--hostname my-transmission --privileged

[Service]
Restart=on-failure

[Install]
WantedBy=default.target

on running quadlet I get no error message, but this strace output.

ygalblum · 2024-02-08T11:00:55Z

@mcepl thanks for reporting this one. I've created a new issue for it #21555

edsantiago · 2024-04-04T10:41:24Z

Haven't seen this one in many months.

edsantiago added flakes Flakes from Continuous Integration quadlet labels May 9, 2023

edsantiago mentioned this issue May 9, 2023

system tests: timeoutize quadlet, systemd #18517

Merged

edsantiago mentioned this issue Jun 21, 2023

systests: test instrumentation #18963

Merged

github-actions bot added the stale-issue label Aug 1, 2023

edsantiago removed the stale-issue label Aug 1, 2023

ygalblum mentioned this issue Feb 8, 2024

Quadlet - Line continuation character '/' at the end of a comment line causes infinite loop #21555

Closed

edsantiago closed this as not planned Won't fix, can't repro, duplicate, stale Apr 4, 2024

edsantiago mentioned this issue Jun 7, 2024

test/system: speed up basic_{setup,teardown}() #22909

Merged

stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Jul 4, 2024

stale-locking-app bot locked as resolved and limited conversation to collaborators Jul 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[systests] quadlet hang, no details #18514

[systests] quadlet hang, no details #18514

edsantiago commented May 9, 2023

Luap99 commented May 9, 2023

Luap99 commented May 9, 2023

edsantiago commented May 17, 2023

Luap99 commented May 17, 2023

edsantiago commented Jun 7, 2023

edsantiago commented Jun 20, 2023

Luap99 commented Jun 21, 2023

github-actions bot commented Aug 1, 2023

edsantiago commented Aug 1, 2023

vrothberg commented Oct 19, 2023

edsantiago commented Oct 19, 2023

mcepl commented Feb 7, 2024

ygalblum commented Feb 8, 2024

edsantiago commented Apr 4, 2024

[systests] quadlet hang, no details #18514

[systests] quadlet hang, no details #18514

Comments

edsantiago commented May 9, 2023

Luap99 commented May 9, 2023

Luap99 commented May 9, 2023

edsantiago commented May 17, 2023

Luap99 commented May 17, 2023

edsantiago commented Jun 7, 2023

edsantiago commented Jun 20, 2023

Luap99 commented Jun 21, 2023

github-actions bot commented Aug 1, 2023

edsantiago commented Aug 1, 2023

vrothberg commented Oct 19, 2023

edsantiago commented Oct 19, 2023

mcepl commented Feb 7, 2024

ygalblum commented Feb 8, 2024

edsantiago commented Apr 4, 2024