Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[systests] quadlet hang, no details #18514

Closed
edsantiago opened this issue May 9, 2023 · 14 comments
Closed

[systests] quadlet hang, no details #18514

edsantiago opened this issue May 9, 2023 · 14 comments
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. quadlet

Comments

@edsantiago
Copy link
Member

Seen twice in system tests:

[+1320s] ok 348 quadlet kube - basic
[+1893s] ok 349 quadlet - rootfs
[+2494s] not ok 350 quadlet - selinux disable
         # (from function `run_podman' in file test/system/helpers.bash, line 218,
         #  from function `basic_setup' in file test/system/helpers.bash, line 49,
         #  from function `setup' in test file test/system/252-quadlet.bats, line 25)
         #   `basic_setup' failed
         # # podman rm -t 0 --all --force --ignore
         # timeout: sending signal TERM to command ?/var/tmp/go/src/github.com/containers/podman/bin/podman?
         # [ rc=124 (** EXPECTED 0 **) ]
         # *** TIMED OUT ***
         # # [teardown]
         # # podman pod rm -t 0 --all --force --ignore
         # timeout: sending signal TERM to command ?/var/tmp/go/src/github.com/containers/podman/bin/podman?
         # [ rc=124 ]
         # *** TIMED OUT ***

...and, game over. No more logs from there, Cirrus times out. Look at the timestamps, though: those are absurd. Nine minutes per test? This is probably the kind of thing that needs to be debugged by looking at journal logs. Or maybe by instrumenting.

@edsantiago edsantiago added flakes Flakes from Continuous Integration quadlet labels May 9, 2023
@Luap99
Copy link
Member

Luap99 commented May 9, 2023

Is the logged timestamp when the test ends? If so it is seems to be caused by quadlet - rootfs in both cases.

@Luap99
Copy link
Member

Luap99 commented May 9, 2023

I think I was able to reproduce locally.

$ time hack/bats --root 252:rootfs
# bats --filter rootfs test/system/252-quadlet.bats
252-quadlet.bats
 ✓ quadlet - rootfs

1 test, 0 failures


real    3m2.145s
user    0m0.566s
sys     0m0.219s

This was the fourth try. Didn't took as long as in CI but still 3 mins??? Normal run is 2s.

relevant process while it took that long:

root       18638  0.0  0.0   6692  2212 ?        Ss   14:06   0:00 /usr/bin/conmon --api-version 1 -c 27d6284a17a993a6fd86cadece44803137e94e3ab2cdd2391cd3a8b34bf0e1b9 -u 27d6284a17a993a6fd86cadece44803137e94e3ab2cdd2391cd3a8b34bf0e1b9 -r /usr/bin/crun -b /var/lib/containers/storage/overlay-containers/27d6284a17a993a6fd86cadece44803137e94e3ab2cdd2391cd3a8b34bf0e1b9/userdata -p /run/containers/storage/overlay-containers/27d6284a17a993a6fd86cadece44803137e94e3ab2cdd2391cd3a8b34bf0e1b9/userdata/pidfile -n systemd-basic_CFGaa7e6Qi --exit-dir /run/libpod/exits --full-attach -l journald --log-level warning --runtime-arg --log-format=json --runtime-arg --log --runtime-arg=/run/containers/storage/overlay-containers/27d6284a17a993a6fd86cadece44803137e94e3ab2cdd2391cd3a8b34bf0e1b9/userdata/oci-log --conmon-pidfile /run/containers/storage/overlay-containers/27d6284a17a993a6fd86cadece44803137e94e3ab2cdd2391cd3a8b34bf0e1b9/userdata/conmon.pid --exit-command /home/pholzing/go/src/github.com/containers/podman/bin/podman --exit-command-arg --root --exit-command-arg /var/lib/containers/storage --exit-command-arg --runroot --exit-command-arg /run/containers/storage --exit-command-arg --log-level --exit-command-arg warning --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /run/libpod --exit-command-arg --network-config-dir --exit-command-arg  --exit-command-arg --network-backend --exit-command-arg netavark --exit-command-arg --volumepath --exit-command-arg /var/lib/containers/storage/volumes --exit-command-arg --db-backend --exit-command-arg boltdb --exit-command-arg --transient-store=false --exit-command-arg --runtime --exit-command-arg crun --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --storage-opt --exit-command-arg overlay.mountopt=nodev,metacopy=on --exit-command-arg --events-backend --exit-command-arg journald --exit-command-arg container --exit-command-arg cleanup --exit-command-arg --rm --exit-command-arg 27d6284a17a993a6fd86cadece44803137e94e3ab2cdd2391cd3a8b34bf0e1b9
root       18646  195  0.1 2159424 45820 ?       Sl   14:06   1:33 /home/pholzing/go/src/github.com/containers/podman/bin/podman --root /var/lib/containers/storage --runroot /run/containers/storage --log-level warning --cgroup-manager systemd --tmpdir /run/libpod --network-config-dir  --network-backend netavark --volumepath /var/lib/containers/storage/volumes --db-backend boltdb --transient-store=false --runtime crun --storage-driver overlay --storage-opt overlay.mountopt=nodev,metacopy=on --events-backend journald container cleanup --rm 27d6284a17a993a6fd86cadece44803137e94e3ab2cdd2391cd3a8b34bf0e1b9
root       18673  0.0  0.0 224268  2876 pts/2    S+   14:06   0:00 /usr/bin/bash /usr/libexec/bats-core/bats-exec-test --dummy-flag -x /home/pholzing/go/src/github.com/containers/podman/test/system/252-quadlet.bats test_quadlet_-2d_rootfs 1 1 1
root       18674  0.0  0.0 235040  7936 pts/2    S+   14:06   0:00 systemctl stop basic_CFGaa7e6Qi.service
root       18675  0.0  0.0  15952  6528 pts/2    S+   14:06   0:00 /usr/bin/systemd-tty-ask-password-agent --watch
root       18676  0.1  0.1 1789996 38856 ?       Ssl  14:06   0:00 /home/pholzing/go/src/github.com/containers/podman/bin/podman rm -f -i --cidfile=/run/basic_CFGaa7e6Qi.cid

edsantiago added a commit to edsantiago/libpod that referenced this issue May 9, 2023
Run $QUADLET and all systemctl/journalctl commands using 'timeout'.
Nothing should ever, ever take more than the default 2 minutes.

Followup to containers#18514, in which quadlet tests are found to be
taking 9-10 minutes.

Signed-off-by: Ed Santiago <[email protected]>
@edsantiago
Copy link
Member Author

Another one, f38 aarch64 root

@Luap99
Copy link
Member

Luap99 commented May 17, 2023

I tried to get a reproducer but looks like I was only lucky once. I haven't been able to trigger this since then.
The problem in my case and very likely CI as well is that the cleanup process deadlocks somewhere while holding the container lock. Thus all following --all are stucked because they cannot get the lock.
I assumed systemd would SIGKILL the cleanup process at some point when it runs into the systemd stop time-out but maybe not for some reason?

@edsantiago
Copy link
Member Author

Another one, and once again f38 aarch64 root (the usual pattern) (there's one on rawhide-amd64, but all the rest are aarch64, FWIW). This time we have fine-tune timestamps in the logs, in case that helps anyone.

@edsantiago
Copy link
Member Author

The stats so far. root only (so far), f38 and rawhide, aarch64 and regular amd64. And, yes, still happening.

@Luap99
Copy link
Member

Luap99 commented Jun 21, 2023

@vrothberg To me this looks like something in the qualdet unit causes a deadlock on a single container. I was able to reproduce once we the cleanup process causes a deadlock but unfortunately I killed it before I got the stack trace.

@edsantiago If we hit a timeout in the cleanup commands can you patch the tests to use the new podman system locks command from @mheon. This should show at least if we hang on a container lock.

edsantiago added a commit to edsantiago/libpod that referenced this issue Jun 21, 2023
for containers#18514: if we get a timeout in teardown(), run and show
the output of podman system locks

for containers#18831: if we hit unmount/EINVAL, nothing will ever work
again, so signal all future tests to skip.

Signed-off-by: Ed Santiago <[email protected]>
@github-actions
Copy link

github-actions bot commented Aug 1, 2023

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Member Author

Only one instance in my logs... but I'm >80% sure that I've seen it in other PRs in the past two weeks. (Quick reminder that my flake logger does not, and cannot, log every flake).

@vrothberg
Copy link
Member

Still happening?

@edsantiago
Copy link
Member Author

Sigh. No new instances.

@mcepl
Copy link

mcepl commented Feb 7, 2024

Remco on IRC suggested that I hit this problem in openSUSE/Tumbleweed (podman 4.9.2) with this .container file:

[Unit]
Description=Podman container-transmission.service
Documentation=https://github.com/linuxserver/docker-transmission/blob/master/README.md
Wants=network-online.target
After=network-online.target

#   -e USER= `#optional` \
#   -e PASS= `#optional` \
#   -e WHITELIST= `#optional` \
#   -e PEERPORT= `#optional` \
#   -e HOST_WHITELIST= `#optional` \

[Container]
# Image=registry.opensuse.org/home/mcepl/containers/containers/opensuse/transmission:latest
# Image=lscr.io/linuxserver/transmission:latest
Image=docker.io/linuxserver/transmission
ContainerName=transmission
# HostName=my-syncthing
Label=io.containers.autoupdate=registry
Environment=PUID=1000 PGID=100 SEED_RATIO=1.3 TZ=Europe/Prague
Volume=/home/matej/.config/transmission:/config
Volume=/home/matej/Videa/2BSeen:/downloads
#   -v /path/to/watch/folder:/watch \
PublishPort=127.0.0.1:9091:9091
PublishPort=127.0.0.1:51413:51413
PublishPort=127.0.0.1:51413:51413/udp
UserNS=keep-id:uid=1000,gid=100
PodmanArgs=--hostname my-transmission --privileged

[Service]
Restart=on-failure

[Install]
WantedBy=default.target

on running quadlet I get no error message, but this strace output.

@ygalblum
Copy link
Contributor

ygalblum commented Feb 8, 2024

@mcepl thanks for reporting this one. I've created a new issue for it #21555

@edsantiago
Copy link
Member Author

Haven't seen this one in many months.

@edsantiago edsantiago closed this as not planned Won't fix, can't repro, duplicate, stale Apr 4, 2024
@stale-locking-app stale-locking-app bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Jul 4, 2024
@stale-locking-app stale-locking-app bot locked as resolved and limited conversation to collaborators Jul 4, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. quadlet
Projects
None yet
Development

No branches or pull requests

5 participants