Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The unlinkat/EBUSY flake is back #11594

Closed
edsantiago opened this issue Sep 15, 2021 · 33 comments
Closed

The unlinkat/EBUSY flake is back #11594

edsantiago opened this issue Sep 15, 2021 · 33 comments
Assignees
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

Background: we thought this had been fixed in containers/storage#926 . Further background: #7139, #10454.

Symptom [source]:

Running: podman [options] pod rm -fa
Error: error removing container SHA root filesystem: 2 errors occurred:
	* unlinkat /tmp/podman_test184319701/crio/vfs-containers/SHA/userdata/shm: device or resource busy
	* unlinkat /tmp/podman_test184319701/crio/vfs-containers/SHA/userdata/shm: device or resource busy

Most often seen in "podman play kube with image data" test, with "podman pod rm -fa", but also seen in "podman create container with --uidmap and conmon PidFile accessible" test (with podman rm, not podman pod rm, and with one unlinkat line, not two). Happening on ubuntu and fedora.

Flakes began on 2021-08-31 and continue through today. The multiarch folks are also seeing it.

These flakes do not appear in my flake logs because they seem to be happening in common_test.go:Cleanup(), which (sigh) does not do exit-status checks.

Log links available if absolutely necessary, but please don't ask unless absolutely necessary: my flake checker doesn't catch these (see immediately above), so finding logs requires tedious manual effort.

What happened on August 30/31?

@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Sep 15, 2021
edsantiago added a commit to edsantiago/libpod that referenced this issue Sep 15, 2021
The e2e Cleanup handlers run various 'rm' commands, but don't
check the exit status of any. This masks real problems such
as the unlinkat-EBUSY flake (containers#11594) which only triggers on
container/pod rm.

Solution: check exit status: a failure in cleanup will now be
considered a failure in the test itself.

Signed-off-by: Ed Santiago <[email protected]>
@giuseppe
Copy link
Member

this looks like a new issue not related to the previous one we had in c/storage.

It looks like a race when cleaning up /dev/shm shared among the containers in the pod

@giuseppe
Copy link
Member

Flakes began on 2021-08-31 and continue through today. The multiarch folks are also seeing it.

this one instead looks like the one we had in storage :/ It could be an issue specific to ppc64le

@edsantiago
Copy link
Member Author

Here's something similar. The string "unlinkat" does not appear here (f33 root container) but it seems to have to do with pods and EBUSY, so I'll report here:

Running: podman [options] system prune -f
time="2021-10-12T16:12:29Z" level=error msg="Deleting pod be7634678daf6da8af56f5f0b4d25090cded05be478d6420c7bb63b42f2d5894 cgroup /libpod_parent/be7634678daf6da8af56f5f0b4d25090cded05be478d6420c7bb63b42f2d5894: remove /sys/fs/cgroup/misc/libpod_parent/be7634678daf6da8af56f5f0b4d25090cded05be478d6420c7bb63b42f2d5894: remove /sys/fs/cgroup/misc/libpod_parent/be7634678daf6da8af56f5f0b4d25090cded05be478d6420c7bb63b42f2d5894/conmon: remove /sys/fs/cgroup/misc/libpod_parent/be7634678daf6da8af56f5f0b4d25090cded05be478d6420c7bb63b42f2d5894/conmon: device or resource busy"
Deleted Containers
b31a202ad92adf0c0110efe9d93894a5e54c05189f684d5b5914ad121962faa4
Deleted Pods
Error: error removing pod be7634678daf6da8af56f5f0b4d25090cded05be478d6420c7bb63b42f2d5894 conmon cgroup: remove /sys/fs/cgroup/misc/libpod_parent/be7634678daf6da8af56f5f0b4d25090cded05be478d6420c7bb63b42f2d5894/conmon: remove /sys/fs/cgroup/misc/libpod_parent/be7634678daf6da8af56f5f0b4d25090cded05be478d6420c7bb63b42f2d5894/conmon: remove /sys/fs/cgroup/misc/libpod_parent/be7634678daf6da8af56f5f0b4d25090cded05be478d6420c7bb63b42f2d5894/conmon: device or resource busy
         [AfterEach] Podman prune

@mheon
Copy link
Member

mheon commented Oct 12, 2021

That's separate - it's a cgroups issue, Conmon is not exiting by the time we want to remove the pod's cgroup.

@edsantiago
Copy link
Member Author

Thanks. Filed #11946.

@edsantiago
Copy link
Member Author

And, back on topic, this is still happening. About 1-2 every three days. As I mentioned before, these are really tedious to look for and link, so yeah, I'll post links if desired, but I really hope someone just has an Aha moment and realizes how to fix it.

@edsantiago
Copy link
Member Author

Still happening:

So: both root and rootless. Fedora and Ubuntu. Local and remote. Always in podman play kube with image data test.

@edsantiago
Copy link
Member Author

Hmmm, I just managed to accidentally reproduce this on my laptop. I ^C'ed a while-forever loop that was doing podman run. This left a container in state Created, which I then tried to rm:

$ bin/podman ps -a
CONTAINER ID  IMAGE                            COMMAND     CREATED        STATUS      PORTS       NAMES
99cc76cd2fe8  docker.io/library/fedora:latest  true        2 seconds ago  Created                 charming_dirac
$ bin/podman rm -f -a      <--- This paused for about 10s.
Error: error removing container 99cc76cd2fe854a86529081a519a90510243ab4681e3a79e1b7899e53827fb96 root filesystem: 2 errors occurred:
        * unlinkat /home/esm/.local/share/containers/storage/overlay-containers/99cc76cd2fe854a86529081a519a90510243ab4681e3a79e1b7899e53827fb96/userdata/shm: device or resource busy
        * unlinkat /home/esm/.local/share/containers/storage/overlay-containers/99cc76cd2fe854a86529081a519a90510243ab4681e3a79e1b7899e53827fb96/userdata/shm: device or resource busy

(Yes, two blank lines afterward).

The pause surprised me: I did not expect to need -t 0 on a Created container.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Member Author

Just reproduced it again on my laptop, again unintentionally via ^C:

...
^C
$ bin/podman rm foo
Error: error removing container f512e6e9274b8ccb8d4b2715c482308a00c404b553ae81611146f339df107ba7 root filesystem: 2 errors occurred:
        * unlinkat /home/esm/.local/share/containers/storage/overlay-containers/f512e6e9274b8ccb8d4b2715c482308a00c404b553ae81611146f339df107ba7/userdata/shm: device or resource busy
        * unlinkat /home/esm/.local/share/containers/storage/overlay-containers/f512e6e9274b8ccb8d4b2715c482308a00c404b553ae81611146f339df107ba7/userdata/shm: device or resource busy

Based on the output of the loop, I'm pretty sure the ^C interrupted this command:

bin/podman run -d --name foo $IMAGE sh -c 'printenv NOTIFY_SOCKET;echo READY;while ! test -f /stop;do sleep 0.1;done'

@jnovy
Copy link
Contributor

jnovy commented Jan 13, 2022

Reproduced this again in RHEL8 gating tests:

Error: error removing container 033f3b8a1eec95570e68aacc12035c56113d9e180eb3e83bbe6a48ee05f7ce71 root filesystem: 1 error occurred: * unlinkat /var/lib/containers/storage/overlay/240447b311711770c5d6a17a15f9cc6ec7284a98d05dccce841abec60bbdb7ab/merged: device or resource busy [ rc=125 (** EXPECTED 0 **) ]

giuseppe added a commit to giuseppe/storage that referenced this issue Jan 14, 2022
if the container deletion fails with the simple os.RemoveAll(path),
then attempt again using system.EnsureRemoveAll(path).

The difference is that system.EnsureRemoveAll(path) tries to umount
eventual mounts that are keeping the directory busy.

It might help with: containers/podman#11594

Signed-off-by: Giuseppe Scrivano <[email protected]>
@giuseppe
Copy link
Member

I was not able to reproduce locally even once, so this is just an attempt that hopefully can help dealing with the race we have seen here: containers/storage#1106

giuseppe added a commit to giuseppe/storage that referenced this issue Jan 14, 2022
if the container deletion fails with the simple os.RemoveAll(path),
then attempt again using system.EnsureRemoveAll(path).

The difference is that system.EnsureRemoveAll(path) tries to umount
eventual mounts that are keeping the directory busy.

It might help with: containers/podman#11594

Signed-off-by: Giuseppe Scrivano <[email protected]>
@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Member Author

Still.

Please don't make me cite more examples: these are not easily captured by my flake logger, and it is super tedious for me to manually find and link to each one.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Member Author

Haven't seen this in, oh, five days. (Plus lots more in April and March and February; but like I said above, it's tedious to gather these).

@edsantiago
Copy link
Member Author

Seen today f36 remote rootless:

[+1790s] not ok 345 podman play --annotation
...
         # $ podman-remote --url unix:/tmp/podman_tmp_1ytS stop -a -t 0
         # 7a31b6a381e43f0ddb956072018fa33037657c471ff8a9250206fc5eb00b65dc
         # 35a84ff304a006de59c19941cad812fe5417345aa64ded5073a2837ed98b81c3
         # $ podman-remote --url unix:/tmp/podman_tmp_1ytS pod rm -t 0 -f test_pod
         # Error: error removing container 35a84ff304a006de59c19941cad812fe5417345aa64ded5073a2837ed98b81c3 root filesystem: 1 error occurred:
         # 	* unlinkat /home/some5835dude/.local/share/containers/storage/overlay/4c6ba48ec95c425623b4e8964b3fc07ee6f96d78dd9147e56535cdbff09ac39e/merged: device or resource busy

@Eugeny
Copy link

Eugeny commented Jun 30, 2022

still happening on root containers with podman 4.1.0 on EL7, happy to provide you with remote access:

WARN[0025] Found incomplete layer "831b9ae6149f440fe45423384f7ba5ca916a92ee23c8401e855a8a609127af24", deleting it 
Error: error mounting storage for container e84b32d982c7a4fb56ddf3db1a2d75351cc02934c1e7fda25f8a2d7888d47aa8: unlinkat /var/lib/containers/storage/overlay/831b9ae6149f440fe45423384f7ba5ca916a92ee23c8401e855a8a609127af24/merged: device or resource busy

@edsantiago
Copy link
Member Author

Seen just now, ubuntu 2110 remote root:

[+0654s] not ok 255 podman pod create --share
         # (from function `die' in file test/system/[helpers.bash, line 555](https://github.com/containers/podman/blob/cd16c6a6a7d153b18a7f25a5728a44d0acf64b3f/test/system/helpers.bash#L555),
         #  from function `run_podman' in file test/system/[helpers.bash, line 219](https://github.com/containers/podman/blob/cd16c6a6a7d153b18a7f25a5728a44d0acf64b3f/test/system/helpers.bash#L219),
         #  in test file test/system/[200-pod.bats, line 342](https://github.com/containers/podman/blob/cd16c6a6a7d153b18a7f25a5728a44d0acf64b3f/test/system/200-pod.bats#L342))
         #   `run_podman pod create --share +pid --replace --name $pod_name' failed with status 125
         # # podman-remote --url unix:/tmp/podman_tmp_orNk rm -t 0 --all --force --ignore
         # # podman-remote --url unix:/tmp/podman_tmp_orNk ps --all --external --format {{.ID}} {{.Names}}
         # # podman-remote --url unix:/tmp/podman_tmp_orNk images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
         # quay.io/libpod/testimage:20220615 f26aa69bb3f3
         # # podman-remote --url unix:/tmp/podman_tmp_orNk pod create --share bogus --name dd1keqxlvx
         # Error: failed to make pod: Invalid kernel namespace to share: bogus. Options are: cgroup, ipc, net, pid, uts or none
         # [ rc=125 (expected) ]
         # # podman-remote --url unix:/tmp/podman_tmp_orNk pod create --share ipc --name dd1keqxlvx
         # 21ac2d3fe9939e699a6d1f4da0adf5cbed36a6e27c4b455b6f865ba832f7258c
         # # podman-remote --url unix:/tmp/podman_tmp_orNk pod inspect dd1keqxlvx --format {{.SharedNamespaces}}
         # [ipc]
         # # podman-remote --url unix:/tmp/podman_tmp_orNk run --rm --pod dd1keqxlvx --hostname foobar quay.io/libpod/testimage:20220615 hostname
         # foobar
         # # podman-remote --url unix:/tmp/podman_tmp_orNk pod create --share +pid --replace --name dd1keqxlvx
         # Error: error removing container 14a06e3b746c083e938bad3dfbbcd4869effcd5013df1e49ec3e39837edcfadf root filesystem: 1 error occurred:
         # 	* unlinkat /var/lib/containers/storage/overlay/08c4e1dad9a8889447368484b7ceadbb5551e171c5917a00d38c2f2db4049b40/merged: device or resource busy

This then caused a cascade of failures, where every test after that failed:

         # Error: error retrieving parent of image "f26aa69bb3f3de470d358b6c4e7f9de8ca165ef7f1689af1cdd446902dc27928": you may need to remove the image to resolve the error: unlinkat /var/lib/containers/storage/overlay/08c4e1dad9a8889447368484b7ceadbb5551e171c5917a00d38c2f2db4049b40/merged: device or resource busy

@Eugeny
Copy link

Eugeny commented Jul 7, 2022

Seems to be much easier to trigger when running on the 3.10 kernel

@edsantiago
Copy link
Member Author

Okay, we've got a problem. This is blowing up today. New one, f36 remote rootless

@edsantiago
Copy link
Member Author

And another one. Now f36 remote root.

@edsantiago
Copy link
Member Author

And yet another one, sys remote fedora-36 rootless.

@containers/podman-maintainers PTAL this one is getting ugly really fast.

@vrothberg
Copy link
Member

@giuseppe, if you find time, could you take a look?

@edsantiago
Copy link
Member Author

Another one: sys remote ubuntu root

@edsantiago
Copy link
Member Author

Another recent one: sys remote f36 rootless on August 8

@edsantiago
Copy link
Member Author

In this one remote ubuntu root, the unlinkat/EBUSY flake triggers the error-freeing error which renders the system unusable (#15367)

@vrothberg
Copy link
Member

@giuseppe if you find time, could you take a look? It seems related to #15367.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Sep 30, 2022

Reminder @giuseppe @vrothberg

@edsantiago
Copy link
Member Author

This one could be the same one, or could be the rm-hose issue (#15367), I can't tell. podman ubuntu rootless, in kube down:

$ podman kube down /tmp/podman_bats.pH6C43/testpod.yaml
Pods stopped:
3cb812278cf782c71fb501a171658fac32c9fdbbba3055ec9d2a6ccbd003a2f8
Pods removed:
Error: 1 error occurred:
	* removing container 10413d40f879e63f9d5802f8dc26343b8b6267020014e1ae0491b26c8eea33cc from pod 3cb812278cf782c71fb501a171658fac32c9fdbbba3055ec9d2a6ccbd003a2f8: removing container 10413d40f879e63f9d5802f8dc26343b8b6267020014e1ae0491b26c8eea33cc root filesystem: 1 error occurred:
	* unlinkat /home/some25997dude/.local/share/containers/storage/overlay-containers/10413d40f879e63f9d5802f8dc26343b8b6267020014e1ae0491b26c8eea33cc/userdata/shm: device or resource busy
[ rc=125 (** EXPECTED 0 **) ]

vrothberg added a commit to vrothberg/storage that referenced this issue Oct 13, 2022
The function is being used in a number of places, notably container
removal and cleanup.  While container removal already loops over EBUSY,
cleanup does not.

To make sure that all callers of Unmount get a fair chance of unmounting
cleanly, also loop there.  I used the same values as containerd: 50
loops with 50ms sleeps.

Context: containers/podman/issues/11594
Signed-off-by: Valentin Rothberg <[email protected]>
vrothberg added a commit to vrothberg/storage that referenced this issue Oct 13, 2022
The function is being used in a number of places, notably container
removal and cleanup.  While container removal already loops over EBUSY,
cleanup does not.

To make sure that all callers of Unmount get a fair chance of unmounting
cleanly, also loop there.  I used the same values as containerd: 50
loops with 50ms sleeps.

Context: containers/podman/issues/11594
Signed-off-by: Valentin Rothberg <[email protected]>
vrothberg added a commit to vrothberg/storage that referenced this issue Oct 13, 2022
The function is being used in a number of places, notably container
removal and cleanup.  While container removal already loops over EBUSY,
cleanup does not.

To make sure that all callers of Unmount get a fair chance of unmounting
cleanly, also loop there.  I used the same values as containerd: 50
loops with 50ms sleeps.

Context: containers/podman/issues/11594
Signed-off-by: Valentin Rothberg <[email protected]>
@vrothberg
Copy link
Member

Since #16159 merged, I am going to close it. Please reopen if I am mistaken.

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 7, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

No branches or pull requests

7 participants