Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rootless: pods: unlinkat, EBUSY #7139

Closed
edsantiago opened this issue Jul 29, 2020 · 35 comments
Closed

rootless: pods: unlinkat, EBUSY #7139

edsantiago opened this issue Jul 29, 2020 · 35 comments
Assignees
Labels
flakes Flakes from Continuous Integration kind/test-flake Categorizes issue or PR as related to test flakes. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. rootless

Comments

@edsantiago
Copy link
Member

Are pods supposed to work with podman-remote? There's a race condition somewhere in rootless pods. (There's one in rootful, too, but I'm having a really hard time getting a reproducer).

In window 1:

$ ./bin/podman system service --timeout=0

In window 2:

$ cat >foo.sh <<EOF
#!/bin/bash

./bin/podman-remote pod create --infra=true --name=foo

cid=$(./bin/podman-remote run -d   --pod foo alpine sleep 1)
./bin/podman-remote run --rm --pod foo alpine true

sleep 1
/bin/time ./bin/podman-remote rm $cid

/bin/time ./bin/podman-remote pod rm foo
EOF

$ bash -x foo.sh
+ ./bin/podman-remote pod create --infra=true --name=foo
4e182754caf7e065917a258a9a5ebc4c29498df1f2381b165c10616c458ebedf
++ ./bin/podman-remote run -d --pod foo alpine sleep 1
+ cid=82822d1b6544692162118f553b857a10490a6a18ebd2ae6ecb1c088eec661c37
+ ./bin/podman-remote run --rm --pod foo alpine true
+ sleep 1
+ /bin/time ./bin/podman-remote rm 82822d1b6544692162118f553b857a10490a6a18ebd2ae6ecb1c088eec661c37
Error: error removing container 82822d1b6544692162118f553b857a10490a6a18ebd2ae6ecb1c088eec661c37 root filesystem: 1 error occurred:
        * unlinkat /home/esm/.local/share/containers/storage/overlay/3ba0c1f06abbda6296b7fd183b776c212bb5cf352100ea1ff6e93bd543204037/merged: device or resource busy


Command exited with non-zero status 125
0.03user 0.03system 0:10.12elapsed 0%CPU (0avgtext+0avgdata 29276maxresident)k   <--- note: 10s
0inputs+0outputs (0major+2127minor)pagefaults 0swaps
+ /bin/time ./bin/podman-remote pod rm foo
Error: error removing container 59f9f424ad65a88f4b55af3873a00259f74eeb8e4ab56e6bf1e01f041e822d84 root filesystem: 1 error occurred:
        * unlinkat /home/esm/.local/share/containers/storage/overlay/02b1322ceed80312346340312c1068d832727ac30a3683c6c6b2cb22073c40fa/merged: device or resource busy


Command exited with non-zero status 125
0.04user 0.03system 0:10.29elapsed 0%CPU (0avgtext+0avgdata 29292maxresident)k     <---- note: 10s
0inputs+0outputs (0major+2240minor)pagefaults 0swaps

This leaves droppings behind, the two directories listed above. I can remove them manually:

$ /bin/rm -rf ~/.local/share/containers/storage/overlay/3ba*
$ /bin/rm -rf ~/.local/share/containers/storage/overlay/02b*
(no errors)

FWIW I can't reproduce by running the commands manually in my shell; only by running the script above.

master @ 7f38774, rootless only. f32 with crun

@edsantiago edsantiago added kind/bug Categorizes issue or PR as related to a bug. remote Problem is in podman-remote labels Jul 29, 2020
@rhatdan
Copy link
Member

rhatdan commented Jul 30, 2020

Yes, pods are supposed to work.

@rhatdan
Copy link
Member

rhatdan commented Aug 4, 2020

I have not been able to get this to fail on my laptop with master?

@edsantiago
Copy link
Member Author

Now I can't either. Worse, I can't even get the reproducer script to fail on 7f38774. I did get test/system/200-pod.bats to fail semi-consistently on that checkout, so that reassures me that the problem was real, but on 93d6320 I can't get either test to fail even with two dozen tries. Shall I try bisecting? Shall I submit a PR to reenable pod.bats tests in rootless remote, and watch closely for failures?

@edsantiago
Copy link
Member Author

If I remove all the skips and if is_remote; then sleep, it reproduces. Maybe this is a dup of #7119 ?

@github-actions
Copy link

github-actions bot commented Sep 4, 2020

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Member Author

Update: this might not be remote-only: this failed today in #7556 on ubuntu-19

...'podman run --pull' test passes
...
# podman rm -a
# fae3bcbd349507832fd922a128ad538faa83414b731ff35c351b2a68eef4c24d
# 9258918e0a37501b9bc1e08cd06a29f8739db8f891d2eab9b11799bdeba9e7c7
# b4fdf2a9c2d89d8d6a7cdb3bd3d594970784c489e6ef349b4e12381387176d7f
# Error: error removing container 4e6ff23369e624e9bf13f7a637fa01238766556a8b359e36f998004654cd8944 root filesystem: 1 error occurred:
# 	* unlinkat /var/lib/containers/storage/overlay/cfc3184e47b43c91be4b13576da46cac3506bc7e019d85f10fa0c16f133a3b41/merged: device or resource busy
# [ rc=125 (** EXPECTED 0 **) ]

@edsantiago edsantiago added flakes Flakes from Continuous Integration and removed remote Problem is in podman-remote stale-issue labels Sep 8, 2020
@mheon
Copy link
Member

mheon commented Sep 8, 2020

I have definitely seen this on non-remote Podman, but it's more of a symptom than a cause. Usually, something blocked Podman from unmounting the container's root filesystem, so we subsequently could not remove said root filesystem. Unfortunately, we only see the second bit, so we don't know what actually started the failure.

@edsantiago
Copy link
Member Author

Another one: PR 7851, ubuntu root

@rhatdan rhatdan added the kind/test-flake Categorizes issue or PR as related to test flakes. label Oct 7, 2020
@edsantiago
Copy link
Member Author

I think this might be the same error, although it's just plain podman-run, no pods: sys podman fedora-31 rootless host

$ podman rm --all --force
# 0daf46f689b44322eb963636a39154cdd92d4337e238dade51b844ebbb57d59d
# 2ccbc8b270c3e1ca148dc27dc81dbf5211eb22bd590cba794f9a6212d20fda43
# 69d78ab66a3d379eb9ac3b6a953cc5d8c20b98b71f484d95fb1c71d8e24a1677
# 729e15b19ea6109f1a1df4db183219246f1a5169369e76125c76c6248fc5536b
# cc527b1117d07d110d482120163f1b6a81dc448d9fe0ee5802f8e1d3bc277dda
# Error: error removing container ad3656475555979d84ebb98ac7e4dfbf6dca5dd39b4f9445bdb44604de851e34 root filesystem: 1 error occurred:
# 	* unlinkat /home/some28861dude/.local/share/containers/storage/overlay/1d454de9493f54d1633acb331d56276c8707483e9a0e68d39f02b6d0ff2345a7/merged: device or resource busy
# [ rc=125 ]

@edsantiago
Copy link
Member Author

Another one in sys podman fedora-31 rootless host

@edsantiago
Copy link
Member Author

One in sys podman fedora-32 rootless host (first one I've seen in f32). This and the above are in regular old 'run' tests, nothing to do with podman pod.

@edsantiago
Copy link
Member Author

Another one in sys podman fedora-32 rootless host

@edsantiago
Copy link
Member Author

Another one: sys podman fedora-33 rootless host

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@rhatdan
Copy link
Member

rhatdan commented Dec 24, 2020

@edsantiago still seeing this?

@rhatdan rhatdan removed the kind/bug Categorizes issue or PR as related to a bug. label Dec 24, 2020
@edsantiago
Copy link
Member Author

It is really hard to answer that question, because the problem manifests in many different tests - so I have to examine individual logs to see if this is the cause.

Since my last report on Nov 18, I see the following:

@edsantiago
Copy link
Member Author

FWIW I've tried the reproducer in comment 0, no luck. I do get a lot of these:

+ ./bin/podman-remote run --rm --pod foo alpine true
WARN[0000] Container 6c8da57da7d3df0e0051b789ade704a71aa7e1eba432aeaa320aea088cfa754c does not exist: container 6c8da57da7d3df0e0051b789ade704a71aa7e1eba432aeaa320aea088cfa754c does not exist in database: no such container

...and once in a while one of these:

+ ./bin/podman-remote run --rm --pod foo alpine true
ERRO[0000] Error removing container 8fa2ef9bb61c85733c6747672338c9a1c350ca6334744a666a9489be73484be1: error looking up container "8fa2ef9bb61c85733c6747672338c9a1c350ca6334744a666a9489be73484be1" mounts: layer not known

but in about fifteen minutes of retries, have never seen the unlinkat error.

@github-actions
Copy link

github-actions bot commented Mar 6, 2021

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Member Author

I can't tell if this is an issue: I've done my Monday-morning pass over the flakes list, and don't see any instances in the last two weeks - but again, the only way to know for sure is to click on and examine every single CI log, and I didn't actually do that - I just did a sample of what I hope was representative.

@rhatdan
Copy link
Member

rhatdan commented Mar 9, 2021

Ok I will close, and we can reopen if you see another instance.

@rhatdan rhatdan closed this as completed Mar 9, 2021
@edsantiago
Copy link
Member Author

It's not dead yet:

sys: podman start --all - start all containers

# $ podman rm af1a43f994a4ea7c900a9b58107dc59286c1237678fe313da804fae8654f7079
# Error: error removing container af1a43f994a4ea7c900a9b58107dc59286c1237678fe313da804fae8654f7079 root filesystem: 1 error occurred:
# 	* unlinkat /home/some26267dude/.local/share/containers/storage/overlay/b500eadb34b65fd1f33b5676dbc39ff356e3d8229aff6cea38ea9c8bd80b5d59/merged: device or resource busy

Once again, this is a pernicious flake because it manifests in many different tests. I caught this one just now by doing a manual review of recent flakes. It's possible that there are other instances I haven't caught.

@edsantiago edsantiago reopened this May 20, 2021
@edsantiago
Copy link
Member Author

Indeed, here's another:

sys: podman run - basic tests

@edsantiago edsantiago changed the title podman-remote: rootless: pods: unlinkat, EBUSY rootless: pods: unlinkat, EBUSY May 20, 2021
@edsantiago
Copy link
Member Author

And another one (rootless)

@edsantiago
Copy link
Member Author

sys: podman start --all - start all containers

This one looks similar (EBUSY) but the error message is slightly different:

Podman pod rm [It] podman pod rm removes a pod with a container

Podman pod prune [It] podman pod prune removes a pod with a stopped container

Please help, this one is getting bad.

@edsantiago
Copy link
Member Author

Here are a few more from today, but all of them non-pod-related. Should I create a separate issue for the non-pod unlinkat-EPERM flake?

sys: podman run - basic tests

sys: podman start --all - start all containers

rhatdan added a commit to rhatdan/podman that referenced this issue May 24, 2021
[NO TESTS NEEDED] This is an attempt to fix a Race condition
since it is a race it is difficult to fix.

Helps fix: containers#7139

Signed-off-by: Daniel J Walsh <[email protected]>
@vrothberg
Copy link
Member

@edsantiago, have you seen this flake since commit c9609d8? I wonder if this issue was an early symptom of the recent flake of doom.

@edsantiago
Copy link
Member Author

No unlinkat/EBUSY flakes since May 27. I'm going to close in hopes that it was fixed by containers/storage#926

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 21, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 21, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration kind/test-flake Categorizes issue or PR as related to test flakes. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. rootless
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants