Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pod stats: unknown FS magic on "/run/user/4902/netns/netns-etc-etc" #17903

Closed
edsantiago opened this issue Mar 23, 2023 · 25 comments · Fixed by #18083
Closed

pod stats: unknown FS magic on "/run/user/4902/netns/netns-etc-etc" #17903

edsantiago opened this issue Mar 23, 2023 · 25 comments · Fixed by #18083
Assignees
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

Seen in CI, f37 rootless sqlite:

podman pod stats on a specific running pod
...
$ podman [options] run --http-proxy=false --pod cec7af2e2f8e0479a91d94bff88cf8482c2907f7644eb7c90a9ea01b2f13ff22 -d quay.io/libpod/alpine:latest top
eed6053f6f0e8a2f883464f994bb30d09304274e1b1e1cdb2eb52a0f49ae3985
$ podman [options] pod stats --no-stream cec7af2e2f8e0479a91d94bff88cf8482c2907f7644eb7c90a9ea01b2f13ff22
Error: unknown FS magic on "/run/user/4902/netns/netns-a2804d97-802b-9e57-2e40-11c1206c102c": 1021994
[AfterEach] Podman pod stats
  /var/tmp/go/src/github.com[/containers/podman/test/e2e/pod_stats_test.go:33](https://github.com/containers/podman/blob/956677a741cdcce627dda4336f85c8fc0be83a5c/test/e2e/pod_stats_test.go#L33)
$ podman [options] pod rm -fa -t 0
time="2023-03-23T15:11:45-05:00" level=error msg="Unable to clean up network for container 652344ca5642edd24b93f810fcb9ebbb1c0969195a67454bc05b8a67e7ed3185: \"unmounting network namespace for container 652344ca5642edd24b93f810fcb9ebbb1c0969195a67454bc05b8a67e7ed3185: failed to unmount NS: at /run/user/4902/netns/netns-a2804d97-802b-9e57-2e40-11c1206c102c: invalid argument\""
         cec7af2e2f8e0479a91d94bff88cf8482c2907f7644eb7c90a9ea01b2f13ff22
@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Mar 23, 2023
@Luap99
Copy link
Member

Luap99 commented Mar 24, 2023

The only way I can see this happening is if /run/user/4902/netns/netns-a2804d97-802b-9e57-2e40-11c1206c102c was unmounted after the pod was started. The question would be why?
My best guess would be that we somehow used the same netns for two different containers, reading
https://github.com/containers/common/blob/main/pkg/netns/netns_linux.go#L103-L104 I see zero duplicate detection, so if the RNG returns the same name it will cause issues. Very unlikely but in theory could cause this one.

@Luap99 Luap99 self-assigned this Mar 24, 2023
Luap99 added a commit to Luap99/common that referenced this issue Mar 24, 2023
The code picks a random name but never checked if this name was already
in use. If the RNG creates generates the same name again we will end up
with containers sharing the same netns which can be very bad.
It is very unlikely to happen but this the only explanation I find for
containers/podman#17903.

Signed-off-by: Paul Holzinger <[email protected]>
Luap99 added a commit to Luap99/common that referenced this issue Mar 24, 2023
The code picks a random name but never checked if this name was already
in use. If the RNG creates generates the same name again we will end up
with contianers sharing the same netns which can be very bad.
It is very unlikely to happen but this the only explanation I find for
containers/podman#17903.

Signed-off-by: Paul Holzinger <[email protected]>
Luap99 added a commit to Luap99/common that referenced this issue Mar 24, 2023
The code picks a random name but never checked if this name was already
in use. If the RNG generates the same name again we will end up with
containers sharing the same netns which can be very bad. It is very
unlikely to happen but this the only explanation I find for
containers/podman#17903.

Signed-off-by: Paul Holzinger <[email protected]>
Luap99 added a commit to Luap99/common that referenced this issue Mar 24, 2023
The code picks a random name but never checked if this name was already
in use. If the RNG generates the same name again we will end up with
containers sharing the same netns which can be very bad. It is very
unlikely to happen but this the only explanation I find for
containers/podman#17903.

Signed-off-by: Paul Holzinger <[email protected]>
@Luap99
Copy link
Member

Luap99 commented Mar 24, 2023

I think https://github.com/containers/common/pull/1381should prevent this but I cannot be sure what the cause here was.

Luap99 added a commit to Luap99/common that referenced this issue Mar 24, 2023
The code picks a random name but never checked if this name was already
in use. If the RNG generates the same name again we will end up with
containers sharing the same netns which can be very bad. It is very
unlikely to happen but this the only explanation I find for
containers/podman#17903.

Signed-off-by: Paul Holzinger <[email protected]>
Luap99 added a commit to Luap99/common that referenced this issue Mar 24, 2023
The code picks a random name but never checked if this name was already
in use. If the RNG generates the same name again we will end up with
containers sharing the same netns which can be very bad. It is very
unlikely to happen but this the only explanation I find for
containers/podman#17903.

Signed-off-by: Paul Holzinger <[email protected]>
Luap99 added a commit to Luap99/common that referenced this issue Mar 27, 2023
The code picks a random name but never checked if this name was already
in use. If the RNG generates the same name again we will end up with
containers sharing the same netns which can be very bad. It is very
unlikely to happen but this the only explanation I find for
containers/podman#17903.

Signed-off-by: Paul Holzinger <[email protected]>
@edsantiago
Copy link
Member Author

Could this be related, or should I file a new issue?

  podman run container with systemd PID1
...
$ podman [options] rm -fa -t 0
time="2023-03-27T20:34:09-05:00" level=error msg="Unable to clean up network 
    for container b717c2ad753aa77f65e18a8bd64de0e62f336d954411f361c0a427187bb80155: 
    \"unmounting network namespace for container b717c2ad753aa77f65e18a8bd64de0e62f336d954411f361c0a427187bb80155:
     failed to unmount NS: at /run/user/6098/netns/netns-5db6ed3c-beda-ab7d-9006-4442e676aa85: invalid argument\""

@Luap99
Copy link
Member

Luap99 commented Mar 28, 2023

It is the same error so yes it should be the same problem, however if we see it more often then I no longer think my fix is enough.

It still can be that we try to unmount the netns twice in podman and at least this is the first time you report this bug and both logs say sqlite so it can be very well a new sqlite bug.

@edsantiago
Copy link
Member Author

both logs say sqlite so it can be very well a new sqlite bug.

Big grain of salt: all of these are being found in #17831 which, yes, is hammering on sqlite, but the big big big deal in that PR is disabling the ginkgo triple-flake-retry. It is possible, even likely, that these are all bugs that have always been there but we never see because of the ginkgo triple-retry.

@edsantiago
Copy link
Member Author

Oooh, here's an interesting one, f37 rootless boltdb, showing two failures very close together in time, suggesting that it's a concurrency issue, that something needs to be locked to prevent these tests from running at the same time.

  podman stats on a running container
...
Error: unknown FS magic on "/run/user/2657/netns/netns-a70ab2b5-2ef4-2369-5a47-29b423e10a55": 1021994

...
  podman stats on all containers
...
Error: unknown FS magic on "/run/user/2657/netns/netns-0c7f9246-ecb5-754a-dd54-3b2723c4609f": 1021994

@edsantiago edsantiago added the kind/bug Categorizes issue or PR as related to a bug. label Mar 28, 2023
@edsantiago
Copy link
Member Author

Here's an interesting new variation on the theme. In system tests (the usual FS-magic flake is in int), debian rootless boltdb:

[+1170s] not ok 421 podman network create
...
$ podman network create --subnet 172.23.105.0/24 testnet-8cBnbGlZxb
testnet-8cBnbGlZxb
...
$ podman run --rm --network testnet-8cBnbGlZxb quay.io/libpod/testimage:20221018 ip a
Error: creating rootless network namespace: open /run/user/3152/netns/rootless-netns-c24c98c9183e681a4c80: file exists

From that point on, pretty much all the rest of the tests fail. The word "magic" does not appear anywhere in the test log, but I still think this is related. If it isn't, lmk and I'll file a separate issue.

@Luap99
Copy link
Member

Luap99 commented Apr 3, 2023

This is definitely caused by my change, file exists was not a problem before.
I thought the rootless netns namespace is locked properly but apparently there is a race somewhere.

@Luap99
Copy link
Member

Luap99 commented Apr 3, 2023

Mhh, I am not really sure where it went wrong.
Looking at the code it is definitely locked:

lfile := filepath.Join(runDir, "rootless-netns.lock")
lock, err := lockfile.GetLockFile(lfile)
if err != nil {
return nil, fmt.Errorf("failed to get rootless-netns lockfile: %w", err)
}
lock.Lock()
defer func() {
// In case of an error (early exit) rootlessNetNS will be nil.
// Make sure to unlock otherwise we could deadlock.
if rootlessNetNS == nil {
lock.Unlock()
}
}()
rootlessNetNsDir := filepath.Join(runDir, rootlessNetNsName)
err = os.MkdirAll(rootlessNetNsDir, 0700)
if err != nil {
return nil, fmt.Errorf("could not create rootless-netns directory: %w", err)
}
nsDir, err := netns.GetNSRunDir()
if err != nil {
return nil, err
}
// create a hash from the static dir
// the cleanup will check if there are running containers
// if you run a several libpod instances with different root/runroot directories this check will fail
// we want one netns for each libpod static dir so we use the hash to prevent name collisions
hash := sha256.Sum256([]byte(r.config.Engine.StaticDir))
netnsName := fmt.Sprintf("%s-%x", rootlessNetNsName, hash[:10])
path := filepath.Join(nsDir, netnsName)
ns, err := ns.GetNS(path)
if err != nil {
if !new {
// return an error if we could not get the namespace and should no create one
return nil, fmt.Errorf("getting rootless network namespace: %w", err)
}
// create a new namespace
logrus.Debugf("creating rootless network namespace with name %q", netnsName)
ns, err = netns.NewNSWithName(netnsName)
if err != nil {
return nil, fmt.Errorf("creating rootless network namespace: %w", err)
}

Somehow ns.GetNS(path) is failing but netns.NewNSWithName(netnsName) is complaining about file exists.
The only way this can happen is when the file exists but there is no netns bind mounted on it. In this case ns.GetNS(path) will fail with unknown FS magic on... so it looks very relevant to this issue.

It is not quite clear what is going on but I think something must be unmounting the netns bind mounts.
Or maybe a podman process got killed between the unmount() and remove() call: https://github.com/containers/common/blob/e572bd0cf69426db7e94aaf2afa44742f3c118be/pkg/netns/netns_linux.go#L191-L197

@edsantiago
Copy link
Member Author

Or maybe a podman process got killed

These are system tests. That's not impossible (there are a few explicit background podman jobs), but it seems unlikely.

@Luap99
Copy link
Member

Luap99 commented Apr 3, 2023

Or maybe a podman process got killed

These are system tests. That's not impossible (there are a few explicit background podman jobs), but it seems unlikely.

Yes I agree very unlikely but I consider something unmounts the netns also in the category very unlikely.

@Luap99
Copy link
Member

Luap99 commented Apr 3, 2023

@edsantiago Could you confirm that this only happens rootless?

I think I understand now what may trigger this. In order for the netns to be valid it must be a child of the podman user namespace. This namespace is kept alive by the podman pause process.

I can reproduce this the following way:

$ podman unshare --rootless-netns ip a
$ pkill "podman pause"
$ podman unshare --rootless-netns ip a
Error: creating rootless network namespace: open /run/user/1000/netns/rootless-netns-294b745a758a274d54d9: file exists

and if I patch it to show the error from ns.GetNS() we see that this is the same issue:
Error: getting rootless network namespace: unknown FS magic on "/run/user/1000/netns/rootless-netns-294b745a758a274d54d9": 1021994

This is at least the only thing which can consistently trigger this, so the question then would be what killed podman pause? One command which does that is podman system migrate but I cannot find any reference of this in either e2e or system tests.

@giuseppe Any idea?

@edsantiago
Copy link
Member Author

Nice detective work. Unfortunately I can't confirm anything because the log I linked above is the only instance I've found of that failure. But if we look at the sequence of tests:

ok 417 podman pod manages /etc/hosts correctly  <<<<<<<<<<<<<<<<<<<<<<<<<<<
ok 418 podman run with slirp4ns assigns correct addresses to /etc/hosts
ok 419 podman run with slirp4ns adds correct dns address to resolv.conf
ok 420 podman run with slirp4ns assigns correct ip address container
not ok 421 podman network create

...and we look at the source of the 417 test:

run_podman pod rm $pod_name
is "$output" "$pid" "Only ID in output (no extra errors)"
# Clean up
run_podman rmi $(pause_image)

...we see pod and pause-image cleanup, which is consistent with your findings (even if the timing doesn't appear to be). Is it worth poking into pod removal code to see if something is not being cleaned up properly?

@Luap99
Copy link
Member

Luap99 commented Apr 3, 2023

The pod pause image is different and should not be related to the podman pause process.

The podman pause process is created by the first podman command you run, podman will create the user namesapce once and to keep it alive it needs a running process (podman pause). All the following process will join this user and mount namespace via setns(). You can find the pause pid in $XDG_RUNTIME_DIR/libpod/tmp/pause.pid

@Luap99
Copy link
Member

Luap99 commented Apr 3, 2023

#18024 should at least let us recover so not all tests blow up when it happens. This should restore the previous behavior but it doesn't address the root cause.

@Luap99
Copy link
Member

Luap99 commented Apr 4, 2023

IMO, we should either clean up that directory or not fail if a file already exists. I think your patch is correct.

Ok lets go with that then for now.

However we still need to figure out if the pause process really got killed in the CI env. I think one possibility is to patch the test to read cat $XDG_RUNTIME_DIR/libpod/tmp/pause.pid before and after each test and compare if the pid changed.

@edsantiago
Copy link
Member Author

I've been playing with variations of --this --that in my PR, and just noted this one, f37 rootless:

Execing [...]/podman --storage-opt vfs.imagestore=/tmp/imagecachedir \
    --root /tmp/podman_test986753573/server_root \
     --tmpdir /tmp/podman_test986753573 \
     --runtime crun \
     --cgroup-manager systemd \
     --db-backend sqlite \
     --storage-driver vfs \
     rm -f top_RPXePqrY
  time="2023-04-04T19:59:20-05:00" level=error \
     msg="Unable to clean up network for container <id>: 
      \"unmounting network namespace for container <id>:
      failed to unmount NS: at /run/user/5575/netns/netns-e4831cc7-2e54-668d-84c1-a34e39804cf9:
     invalid argument\""
         top_RPXePqrY

Granted, for rootless I should've included --net-something /private-dir but I was lazy.

Anyhow, this was a push from tonight, and it includes #18024, so I was surprised to see that message. Quitting for the day.

@Luap99
Copy link
Member

Luap99 commented Apr 5, 2023

I've been playing with variations of --this --that in my PR, and just noted this one, f37 rootless:

Execing [...]/podman --storage-opt vfs.imagestore=/tmp/imagecachedir \
    --root /tmp/podman_test986753573/server_root \
     --tmpdir /tmp/podman_test986753573 \
     --runtime crun \
     --cgroup-manager systemd \
     --db-backend sqlite \
     --storage-driver vfs \
     rm -f top_RPXePqrY
  time="2023-04-04T19:59:20-05:00" level=error \
     msg="Unable to clean up network for container <id>: 
      \"unmounting network namespace for container <id>:
      failed to unmount NS: at /run/user/5575/netns/netns-e4831cc7-2e54-668d-84c1-a34e39804cf9:
     invalid argument\""
         top_RPXePqrY

Granted, for rootless I should've included --net-something /private-dir but I was lazy.

Anyhow, this was a push from tonight, and it includes #18024, so I was surprised to see that message. Quitting for the day.

Note that #18024 made no attempt at fixing this issue, all it fixed was the regression I introduced with the c/common change, so only the error from #17903 (comment) was fixed.

@Luap99
Copy link
Member

Luap99 commented Apr 5, 2023

Ok, --this --that was a good hint:
I now understand were it goes wrong, here is a reproducer:

First make sure you have no podman processes or containers running running, then kill the pause process to start from a clean system:

$ pkill "podman pause"
$ dir=$(mktemp -d)
$ podman --root $dir/root --runroot $dir/runroot --tmpdir $dir/tmp  run -d --name test alpine top
# Now this process will create a pause.pid file but not under $XDG_RUNTIME_DIR/libpod/tmp/pause.pid,
# instead it uses the --tmpdir value so $dir/tmp/pause.pid
# If you then use a regular podman command without the custom options:
$ podman run -d --name test alpine top
$ pgrep "podman pause"
18786
19198
# The normal command had no clue about the pause process so it created its own, so far no problem with that.
# The problem happens when we now use the command with the options again:
$ podman --root $dir/root --runroot $dir/runroot --tmpdir $dir/tmp  stop test
ERRO[0000] Unable to clean up network for container 6fc14126834157223df6318300affcce613b0a3acc0cdeec5e3e8df55a1335c1: "unmounting network namespace for container 6fc14126834157223df6318300affcce613b0a3acc0cdeec5e3e8df55a1335c1: failed to unmount NS: at /run/user/1000/netns/netns-34ab0c13-c7e0-6c3e-2f98-32d2f67bb4e1: invalid argument"

The bug here is that there is a shortcut in pkg/rootless/rootless_linux.c which is always run before any go code is run (including the option parsing). The c code just sees $XDG_RUNTIME_DIR/libpod/tmp/pause.pid and imminently joins this namespace from this process.
This shortcut is only there to join, so if the process does not exits it will do nothing and let podman handle the namespace and pause process creation.

So that is why you have to run the first time with --tmpdir before the pause process p pid existed at $XDG_RUNTIME_DIR/libpod/tmp/pause.pid . If you do it the other way around even the process with --tmpdir it would have joined the namespace via the shortcut so it would use the same one and not cause issues.
See the possibility for flakes here?

The fix for this is of course to not do the shortcut when we see --tmpdir so the podman go code can handle it, Note that we already special case other commands:

if (strcmp (argv[argc], "mount") == 0
|| strcmp (argv[argc], "machine") == 0
|| strcmp (argv[argc], "version") == 0
|| strcmp (argv[argc], "context") == 0
|| strcmp (argv[argc], "search") == 0
|| (strcmp (argv[argc], "system") == 0 && argv[argc+1] && strcmp (argv[argc+1], "service") != 0))
{

Luap99 added a commit to Luap99/libpod that referenced this issue Apr 5, 2023
When using --tmpdir for the podman cli we use this as location for the
pause.pid file. However the c shortcut code has no idea about this
option and always assumes XDG_RUNTIME_DIR/libpod/tmp. This can cause us
to join the wrong user namespace which leads to very weird issues as
mounts are missing there.

Fixes containers#17903

Signed-off-by: Paul Holzinger <[email protected]>
@edsantiago
Copy link
Member Author

Slightly different symptom seen yesterday, f37 rootless:

  podman play kube --no-host
...
$ podman [options] exec <workerCID> cat /etc/hosts
Error: OCI runtime error: crun: fchown std stream 0: Bad file descriptor

That causes the test to fail. Then, in cleanup:

[AfterEach] Podman play kube
$ podman [options] pod rm -fa -t 0
time="2023-04-05T19:00:41-05:00" level=error msg="getting rootless network namespace:
          unknown FS magic on \"/run/user/6459/netns/rootless-netns-0e73abf2193b2cbf501b\": 1021994"
time="2023-04-05T19:00:41-05:00" level=error msg="Unable to clean up network for container <infraCID>:
          \"unmounting network namespace for container <infraCID>:
          failed to unmount NS: at /run/user/6459/netns/netns-b2c8e3d5-a67a-913a-7c99-0e32a0092781: invalid argument\""

I'm assiging that flake to this issue.

@Luap99
Copy link
Member

Luap99 commented Apr 11, 2023

Still happening:
#18083 (comment)

@Luap99 Luap99 reopened this Apr 11, 2023
@edsantiago
Copy link
Member Author

Locks are dangerous of course, but isn't there some way that podman system reset tests can be wrapped inside a lock? That command is too dangerous to run in parallel, even with every possible --root --tmpdir --networkdir --everything option.

@Luap99
Copy link
Member

Luap99 commented Apr 12, 2023

I only see two callers of system reset in the e2e test. I think we can convert them into system tests to fix this issue.
Then make sure nobody ever adds new system migrate or reset tests into e2e.

@Luap99
Copy link
Member

Luap99 commented Apr 12, 2023

Ok new plan, my goal is to update to ginkgo v2 (#18163) so we can use decorators to mark a single test as serial.
https://onsi.github.io/ginkgo/#serial-specs

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
3 participants