Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unmounting/EINVAL flake #18831

Closed
edsantiago opened this issue Jun 8, 2023 · 25 comments · Fixed by containers/storage#1724
Closed

unmounting/EINVAL flake #18831

edsantiago opened this issue Jun 8, 2023 · 25 comments · Fixed by containers/storage#1724
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. remote Problem is in podman-remote

Comments

@edsantiago
Copy link
Member

Two new flakes seen in #13808 (buildah treadmill):

<+009ms> # # podman-remote --url unix:/tmp/podman_tmp_9HtW pod rm c3cc31049b088d86ef22ad5b279ce5c1d01b88887116bc9f6486b9b0a28f2080
<+104ms> # Error: 2 errors occurred:
         # 	* removing container 0c7966924a988d4e00f2585da31627a0db0621e6fbd6531cea4d02779e3bcfe2 from pod c3cc31049b088d86ef22ad5b279ce5c1d01b88887116bc9f6486b9b0a28f2080: cleaning up storage: removing container 0c7966924a988d4e00f2585da31627a0db0621e6fbd6531cea4d02779e3bcfe2 root filesystem: unmounting "/var/lib/containers/storage/overlay/4e8aaf7a108d0bf91f873379b4558162620f9e4ffaca942d06ef4d40778456bc/merged": invalid argument
         # 	* removing container 55ade4c16aa0fffb758fc893be4d24c8fdaabcde14eec89df7885b42c4b03881 from pod c3cc31049b088d86ef22ad5b279ce5c1d01b88887116bc9f6486b9b0a28f2080: cleaning up storage: removing container 55ade4c16aa0fffb758fc893be4d24c8fdaabcde14eec89df7885b42c4b03881 root filesystem: unmounting "/var/lib/containers/storage/overlay/1ef8c38e52bd1d345801a0a4a7b2e0631b85965881be472dcb6db0afd6c3a65b/merged": invalid argument

Once they happen, everything is hosed: all subsequent tests fail.

This is sooooooo close to #17216 (unlinkat/EBUSY) that I'm 99.99% convinced it's the same bug, just a different manifestation (presumably because of new containers/whatever). If that turns out to be the case, I'll close this issue and merge these flakes there.

  • debian-12 : sys remote debian-12 root host boltdb [remote]
  • fedora-38-aarch64 : sys remote fedora-38-aarch64 root host boltdb [remote]
@edsantiago edsantiago added flakes Flakes from Continuous Integration remote Problem is in podman-remote labels Jun 8, 2023
@Luap99
Copy link
Member

Luap99 commented Jun 9, 2023

This is likely the result of: containers/storage#1607

cc @mtrmac

Also this change is causing the podman upgrade tests to fail.
Looking at the upgrade test setup it is obvious why this is not compatible with the c/storage change. The upgrade tests start a container inside an old podman version container and then try to stop it outside with the new podman. Inside the container the mounts are not shared so when the outside podman tries to stop it the umount call fail with EINVAL as the mount is not visible in the other namespace.
I am working on fix for the tests as the mounts should be shared but it looks like the default c/storage behaviour is to make the container store use private by default.

@rhatdan
Copy link
Member

rhatdan commented Jun 10, 2023

As I recall making the storage private was because of leaked mounts when we were using Devicemapper. Not sure if this still applies, but that was a huge problem back in the early Docker days.

@mtrmac
Copy link
Collaborator

mtrmac commented Jun 12, 2023

This is likely the result of: containers/storage#1607

Yes. As I said there, I don’t really know what I’m doing and I wouldn’t be too surprised if ignoring unmount failures were intentional.

But then again, there are multiple layers of metadata, reference counting, and extra checks of the tracked state vs. actual state (like https://github.com/mtrmac/storage/blob/da0df59a00e0a2cd2880c7cc82322989834e7bf4/drivers/counter.go#L54 ) that if we get here and we are still trying to unmount something that isn’t mounted at all, that seems like a failure worthy of some attention. By someone who understands things better than I do, if possible…

@mtrmac
Copy link
Collaborator

mtrmac commented Jun 12, 2023

it looks like the default c/storage behaviour is to make the container store use private by default.

(Just to be explicit, that should be possible turn off with skip_mount_home=true.)

@edsantiago
Copy link
Member Author

This is now triggering in main. debian root

@Luap99
Copy link
Member

Luap99 commented Jun 13, 2023

it looks like the default c/storage behaviour is to make the container store use private by default.

(Just to be explicit, that should be possible turn off with skip_mount_home=true.)

Yes I found that option, 68183b0

@edsantiago
Copy link
Member Author

Is anyone working on fixing this?

  • debian-12 : sys remote debian-12 root host sqlite [remote]
    • 06-13 11:03 in [sys] podman run with slirp4ns adds correct dns address to resolv.conf
    • 06-12 15:26 in [sys] podman kube play read-only
  • fedora-37 : sys remote fedora-37 root host sqlite [remote]
    • 06-12 18:13 in [sys] podman --group-add keep-groups plus added groups

edsantiago added a commit to edsantiago/libpod that referenced this issue Jun 21, 2023
for containers#18514: if we get a timeout in teardown(), run and show
the output of podman system locks

for containers#18831: if we hit unmount/EINVAL, nothing will ever work
again, so signal all future tests to skip.

Signed-off-by: Ed Santiago <[email protected]>
@edsantiago
Copy link
Member Author

Ping! Hi everyone, this is still hitting us in CI, and in real PRs. Most recent was last night, in a PR that (if I git correctly) includes #18999 (vendor c/*).

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Member Author

This is still hitting us hard.

@edsantiago
Copy link
Member Author

WOW! Every single integration rootless test in a buildah bump-VM PR failed! So this has gone from a root-flake to a hard (rootless) failure. This is probably the best chance we have right now to track this down: someone with get-ci-vm privs, maybe get one of these new VMs and see what buildah is doing?

@edsantiago
Copy link
Member Author

I just got this on my laptop, rootless non-remote, after ^C'ing from a test loop:

d6ca8864235b6329facf2cdbb48eef7cffe4c4c1669c7e7932cf95e6d88abeed
^C%                                                                                                                                                             
$ bin/podman rm -f -a -t0
ERRO[0000] cleaning up storage: removing container 4d1b2909ce4444604c3616e268b1be978cf58d091df4130cffbb84c1a81e4532 root filesystem: unmounting "/home/esm/.local/share/containers/storage/overlay/a7660b4fa10f5ec903b8025957747358a532a1e304fffb98dbc0347fef6fba60/merged": invalid argument 
138263879a6f1ae5b80d71ad7362d0928dbf2b45c7bc7050e80a5b357f87d579
Error: cleaning up container 4d1b2909ce4444604c3616e268b1be978cf58d091df4130cffbb84c1a81e4532: unmounting container 4d1b2909ce4444604c3616e268b1be978cf58d091df4130cffbb84c1a81e4532 storage: cleaning up container 4d1b2909ce4444604c3616e268b1be978cf58d091df4130cffbb84c1a81e4532 storage: unmounting container 4d1b2909ce4444604c3616e268b1be978cf58d091df4130cffbb84c1a81e4532 root filesystem: unmounting "/home/esm/.local/share/containers/storage/overlay/a7660b4fa10f5ec903b8025957747358a532a1e304fffb98dbc0347fef6fba60/merged": invalid argument

Unlike this issue, a subsequent !! returned no output, no errors, and all is well. My system is not dead-hosed forever. Everything is back to normal. I don't know why it's ok here but hosed-forever when it happens in CI.

@edsantiago
Copy link
Member Author

Haven't seen this one in the last 4-5 days, but now all of a sudden:

# setup(): removing stray external container 8d8e6fe55654 (a669ae9cac29-infra)
         # [setup] # podman rm -f 8d8e6fe55654
         # Error: removing storage for container "8d8e6fe55654": removing mount point "/var/lib/containers/storage/overlay/b14478b250486fcac32b2a4591d781a1aadb4903753c0d8023dee6339778f936/merged": device or resource busy
         # setup(): removing stray external container d0890420d690 (test_pod-test)
         # [setup] # podman rm -f d0890420d690
         # Error: removing storage for container "d0890420d690": removing mount point "/var/lib/containers/storage/overlay/b6747d4b2f72ce2a1df040c116b76fe5b0690bfa6d120e2b62b5d838765813d9/merged": device or resource busy
         # setup(): removing stray image localhost/podman-pause:4.7.0-dev-1693138408
         # setup(): removing stray image 5b53252ec597
[+1191s] not ok 591 podman kube play - default log driver
         # (from function `die' in file test/system/[helpers.bash, line 722](https://github.com/containers/podman/blob/59f66714eee24ac05e1445c5b75d8800c59316b0/test/system/helpers.bash#L722),
         #  from function `run_podman' in file test/system/[helpers.bash, line 381](https://github.com/containers/podman/blob/59f66714eee24ac05e1445c5b75d8800c59316b0/test/system/helpers.bash#L381),
         #  in test file test/system/[700-play.bats, line 468](https://github.com/containers/podman/blob/59f66714eee24ac05e1445c5b75d8800c59316b0/test/system/700-play.bats#L468))
         #   `run_podman kube play $PODMAN_TMPDIR/test.yaml' failed
<+     > # # podman-remote --url unix:/tmp/podman_tmp_GA3f rm -t 0 --all --force --ignore
<+048ms> # # podman-remote --url unix:/tmp/podman_tmp_GA3f ps --all --external --format {{.ID}} {{.Names}}
<+046ms> # 8d8e6fe55654 a669ae9cac29-infra
         # d0890420d690 test_pod-test
<+011ms> # # podman-remote --url unix:/tmp/podman_tmp_GA3f rm -f 8d8e6fe55654
<+046ms> # Error: removing storage for container "8d8e6fe55654": removing mount point "/var/lib/containers/storage/overlay/b14478b250486fcac32b2a4591d781a1aadb4903753c0d8023dee6339778f936/merged": device or resource busy
<+006ms> # [ rc=125 ]

(in f38 remote root).

To recap:

  • first it was unlinkat/EBUSY
  • then unmounting/EINVAL
  • now it's removing storage/EBUSY???!?!?!?

Still remote-only AFAICT.

@mtrmac
Copy link
Collaborator

mtrmac commented Aug 28, 2023

I think the original variant is pre-containers/storage#1607 : Driver.Put silently ignored errors both unmounting and deleting, and unlinkat failed in some higher layer.

The second is after containers/storage#1607 : unmount failure is reported.

The third is after containers/storage#1687 : unmount failure is ignored again, but deleting is now a hard failure in Driver.Put.

AFAICT none of this tinkering has substantially improved our understanding of the actual cause.

@edsantiago
Copy link
Member Author

Seen just now in int f38 rootless non-remote:

[FAILED] DeferCleanup callback returned error: unlinkat /tmp/ginkgo892567680/root/vfs/dir/..../bin: permission denied

Not quite the same -- this is int tests, not sys; and podman, not remote; and unlinkat, not "removing storage"; and obviously it does not destroy all future test runs because int tests each happen with a different root. Maybe related, maybe not, am posting anyway.

@edsantiago
Copy link
Member Author

Hi, I know everyone's busy and that this is a hard one, but please don't forget about it.

Seen in: int/sys podman/remote fedora-37/fedora-38/fedora-38-aarch64/rawhide root/rootless host boltdb/sqlite

@edsantiago
Copy link
Member Author

Here's a variant that I don't recognize (f37 remote root):

<+027ms> # podman run --rm --pod new:xjbur0qugi --hostname foobar quay.io/libpod/testimage:20221018 hostname
<+374ms> Error: failed to make pod: building local pause image: deleting build container "e40fcfa654072d9434bebfed49e6b3724c1f04f569adc601e91bf39e83b3e8c0": removing mount point "/var/lib/containers/storage/overlay/d8935b35f51d27896f81816f58dd60efb8f6860f795dd5a6fddf0e3ab48c6daf/merged": device or resource busy

Curiously, this does not hose all subsequent tests. The error persists forevermore, but only in a place where it doesn't really matter. Subsequent tests actually pass.

@edsantiago
Copy link
Member Author

Just seen in debian 13 but with yet another surprise symptom:

starting container <SHA>: runc: runc create failed: unable to start container process: exec: "touch": executable file not found in $PATH: OCI runtime attempted to invoke a command that was not found

That looks suspiciously like #17042 ("catatonit et al not found", another hard-to-track-down flake). And, blam, everything after that is hosed.

@edsantiago
Copy link
Member Author

Followup: possibly fixed by @giuseppe in containers/storage#1722

@edsantiago
Copy link
Member Author

Sigh..... no such luck.

@giuseppe
Copy link
Member

Was the change already vendored? I will take another look, maybe it is a different instance of the same problem and needs a similar fix

@edsantiago
Copy link
Member Author

Yes, this is #17831 my pet PR on which I churn CI. I believe I correctly vendored c/storage but it's always possible I did something wrong.

PS isn't it really, really late where you are??

@vrothberg
Copy link
Member

Yes, this is #17831 my pet PR on which I churn CI. I believe I correctly vendored c/storage but it's always possible I did something wrong.

I double checked and think you vendored it correctly. But I am sure @giuseppe has picked up the right sent - there may be more such cases scattered in c/storage (or libpod?).

@giuseppe
Copy link
Member

the good news is that the occurrence I looked at yesterday doesn't appear in the log, and that was an easy one since I've managed to reproduce it all the time in a VM. I am looking at the new error message right now.

@giuseppe
Copy link
Member

giuseppe commented Sep 28, 2023

I think the issue could be fixed with something like:

diff --git a/drivers/overlay/overlay.go b/drivers/overlay/overlay.go
index 0f6d74021..f6f0932e0 100644
--- a/drivers/overlay/overlay.go
+++ b/drivers/overlay/overlay.go
@@ -1891,12 +1891,7 @@ func (d *Driver) Put(id string) error {
                }
        }
 
-       if err := unix.Rmdir(mountpoint); err != nil && !os.IsNotExist(err) {
-               logrus.Debugf("Failed to remove mountpoint %s overlay: %s - %v", id, mountpoint, err)
-               return fmt.Errorf("removing mount point %q: %w", mountpoint, err)
-       }
-
-       return nil
+       return system.EnsureRemoveAll(mountpoint)
 }

but before I open a PR, I'd like to understand how we even get in that situation.

@edsantiago do you think we could collect some more information with:

diff --git a/test/system/helpers.bash b/test/system/helpers.bash
index 3fcd69a60..f7dad187e 100644
--- a/test/system/helpers.bash
+++ b/test/system/helpers.bash
@@ -150,10 +150,13 @@ function basic_setup() {
         echo "# setup(): removing stray external container $1 ($2)" >&3
         run_podman '?' rm -f $1
         if [[ $status -ne 0 ]]; then
+            run_podman '?' --log-level debug rm -f $1
             echo "# [setup] $_LOG_PROMPT podman rm -f $1" >&3
             for errline in "${lines[@]}"; do
                 echo "# $errline" >&3
             done
+            echo "# MOUNTINFO"
+            cat /proc/self/mountinfo | while read i; do echo "# $i"; done
             # FIXME FIXME FIXME: temporary hack for #18831. If we see the
             # unmount/EINVAL flake, nothing will ever work again.
             if [[ $output =~ unmounting.*invalid ]]; then

I've opened a test PR based on yours: #20183, I'll let it run for a while and see if I am lucky to hit it

giuseppe added a commit to giuseppe/storage that referenced this issue Oct 4, 2023
Move the execution of RecordWrite() before the graphDriver Cleanup().
This addresses a longstanding issue that occurs when the Podman
cleanup process is forcely terminated and on some occasions the
termination happens after the Cleanup() but before the change is
recorded.  This causes that the next user is not notified about the
change and will mount the container without the home directory
below (the infamous /var/lib/containers/storage/overlay mount).
Then when the next time the graphDriver is initialized, the home
directory is mounted on top of the existing mounts causing some
containers to fail with ENOENT since all files are hidden and some
others cannot be cleaned up since their mount directory is covered by
the home directory mount.

Closes: containers/podman#18831
Closes: containers/podman#17216
Closes: containers/podman#17042

Signed-off-by: Giuseppe Scrivano <[email protected]>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Jan 4, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 4, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. remote Problem is in podman-remote
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants