Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preserve-fds: file descriptor 4 is not available #15943

Closed
edsantiago opened this issue Sep 26, 2022 · 16 comments
Closed

preserve-fds: file descriptor 4 is not available #15943

edsantiago opened this issue Sep 26, 2022 · 16 comments
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

Been seeing this flake for the last week:

$ podman [options] run --preserve-fds 2 quay.io/libpod/alpine:latest
Error: file descriptor 4 is not available - the preserve-fds option requires that file descriptors must be passed

Podman run [It] podman run --preserve-fds invalid fd

Seems interesting that it's only Ubuntu rootless.

@edsantiago edsantiago added the flakes Flakes from Continuous Integration label Sep 26, 2022
@edsantiago
Copy link
Member Author

@containers/podman-maintainers please help. This one is starting to trigger often.

ISTM that something somewhere is now doing an open() without a corresponding close(), leaving fd3 stuck open forever.

The first instance of the flake was seen in PR #15833 (903f551). That doesn't mean this PR is at fault (I see nothing in it that leaks fds), it just means this is the upper bound of commits to search. Given how frequent this flake is, I'm guessing the lower bound is at most two weeks before that, so, let's say 54873c1 (Aug 31). That's impossible for me to review, even if I limit it to git diff 54873c1f5 903f55 -- test/e2e. (But I tried anyway).

What I need is for someone to think back to early September and say "oh, yeah, I added this-or-that code and might have forgotten to close somewhere". Pretty please, can you do this? The longer we wait, the harder it will be to remember.

If we don't get this resolved, I'm just going to remove the preserve-fd test entirely.

@mheon
Copy link
Member

mheon commented Sep 27, 2022

@edsantiago What's leading you towards that conclusion? It seems like we have too few FDs open from the error message, not too many?

@edsantiago
Copy link
Member Author

My belief came from the test failure message:

Expected
   <string>: Error: file descriptor 4 is not available - the preserve-fds option requires that file descriptors must be passed
to contain substring
   <string>: file descriptor 3 is not available

...but I may be reading it completely wrong.

@mheon
Copy link
Member

mheon commented Sep 27, 2022

Ack, that sounds accurate. An extra FD is getting in occasionally that it altering the error message.

No chance it's leaking in from CI itself passing an extra FD in?

@edsantiago
Copy link
Member Author

This suggests that my hypothesis is correct, though:

$  bin/podman run --preserve-fds 2 quay.io/libpod/testimage:20220615 date
Error: file descriptor 3 is not available - the preserve-fds option requires that file descriptors must be passed
$ bin/podman run --preserve-fds 2 quay.io/libpod/testimage:20220615 date 3>/dev/null
Error: file descriptor 4 is not available - the preserve-fds option requires that file descriptors must be passed

I have no idea what CI might be doing. I did mean to add that this happens only on Ubuntu, and (best I can tell) Ubuntu VMs were not touched in the 2 weeks before the flake appeared.

@mheon
Copy link
Member

mheon commented Sep 27, 2022

Sorry if I wasn't clear, but I concur with your assessment. It's an extra FD, not a missing FD as I originally assumed.

@mheon
Copy link
Member

mheon commented Sep 27, 2022

I am, however, a bit doubtful that we missed a defer file.Close() somewhere - I seem to recall us having a linter that catches such things. Will verify that after lunch.

@edsantiago
Copy link
Member Author

I found one through tedious skimming:

statsDirectory, err := os.Open(c.bundlePath())

(two, actually, there are two instances in that same file). But that's podman, not the e2e tests, so I can't imagine why that would affect fds in the e2e test process.

@Luap99
Copy link
Member

Luap99 commented Sep 27, 2022

I am, however, a bit doubtful that we missed a defer file.Close() somewhere - I seem to recall us having a linter that catches such things. Will verify that after lunch.

I don't think there is a linter for that.

We use os/exec to start podman in the tests, it should not matter how many fds we leak in the test since os/exec will always unset all fds.
https://cs.opensource.google/go/go/+/refs/tags/go1.19.1:src/os/exec/exec.go;l=506-517
That makes me think the leak or bug must be somewhere in podman itself.

ALso the golang garbage collector will automatically close files once the variable is out of scope, since the timing changes with every run it would explain why it flakes only sometimes.

@edsantiago
Copy link
Member Author

Hmmmm. Here's a new flake I haven't seen before (it isn't even in my flake logs):

Systemd activate 
           stop podman.service
   ....
         Execing /var/tmp/go/src/github.com/containers/podman/bin/podman-remote --url tcp://127.0.0.1:40595 create --tty --name top_XVlBzgba --entrypoint top quay.io/libpod/alpine_labels:latest
         
         Listening on 127.0.0.1:40595 as 3.
         Communication attempt on fd 3.     <<<<<<<===================
         Execing /var/tmp/go/src/github.com/containers/podman/bin/podman (/var/tmp/go/src/github.com/containers/podman/bin/podman --root=/tmp/podman_test1509476081/server_root system service --time=0)
         Cannot connect to Podman. Please verify your connection to the Linux system using `podman system connection list`, or try `podman machine init` and `podman machine start` to manage a new Linux VM
         Error: unable to connect to Podman socket: Get "http://d/v4.3.0/libpod/_ping": dial tcp 127.0.0.1:40595: connect: connection refused
...
blah blah test fails

Triple-failure, consistent with what we're seeing in this issue. Maybe red herring, but I'll let y'all decide.

@edsantiago
Copy link
Member Author

For posterity, because it looks like I'm abandoning this again: thinking that perhaps the root cause was some test that ran before the flaking one, I wrote a script to analyze my logs, looking for tests that precede the invalid fd` one in all cases. No luck: there are 73 such matches. Too many to really evaluate.

Among those, in 11 of the 16 log files, podman run with restart-policy always restarts containers is the test run immediately before the flake. I looked at that test, it doesn't strike me as having anything to do with filehandles.

Moving on.

@eriksjolund
Copy link
Contributor

eriksjolund commented Oct 15, 2022

Just an idea:

I found a bug in rootless_linux.c where this PR

seems to fix #15927.
Maybe #15943 is related to that bug as well?

I don't how to reproduce this bug so I don't know how to test that.

@edsantiago
Copy link
Member Author

I've noticed that some of these flake logs -- not all, but an unexpected number -- are correlated with this one-time flake:

stop podman.service
...
Execing podman-remote --url tcp://127.0.0.1:40357 create --tty --name top_nJObCsNV --entrypoint top quay.io/libpod/alpine_labels:latest
Listening on 127.0.0.1:40357 as 3.
Communication attempt on fd 3.
Execing podman (podman --root=/tmp/podman_test926004581/server_root system service --time=0)
Error: unexpected fd received from systemd: cannot listen on it

Example. It's only a one-time flake, always (AFAICT) passing on the first ginkgo retry.

I don't think that's the cause, because in all cases I see the stop podman.service flake happens after all three preserve-fd failures. It's just odd.

Here's the latest list of Cirrus flakes.

Podman run [It] podman run --preserve-fds invalid fd

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Member Author

We no longer test Ubuntu. If we ever do so again, and this resurfaces, then I'll reopen.

@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 5, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

No branches or pull requests

5 participants