-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
open pidfd: no such process #18452
Comments
I am under the impression that's a @giuseppe, @flouthoc, does // Track lifetime of conmon precisely using pidfd_open + poll.
// There are many cases for this to fail, for instance conmon is dead
// or pidfd_open is not supported (pre linux 5.3), so fall back to the
// traditional loop with poll + sleep
if fd, err := unix.PidfdOpen(c.state.ConmonPID, 0); err == nil {
return fd
} else if err != unix.ENOSYS && err != unix.ESRCH {
logrus.Debugf("PidfdOpen(%d) failed: %v", c.state.ConmonPID, err)
} |
crun must report when the kill command fails, as it is in this case because the process doesn't exist. Should Podman ignore it? |
That sounds good to me. Could |
Meh ... but |
What I have observed in #18442 is that it looks like podman sometimes tries to cleanup() twice. So it could be very well that podman call crun twice resulting in the error. But generally speaking there is always the race between podman calling the runtime to kill it and the runtime trying to kill. The process can exit in the meantime. I am not sure how we can make podman ignore it, string matching is way to fragile considering that we support more than one oci runtime. |
I have an idea. Let me wrap something up to discuss the details in the PR. |
-> #18457 |
There is an inherent race when stopping/killing a container with other processes attempting to do the same and also with the container having exited in the meantime. In those cases, the OCI runtime may fail to kill the container as it has already exited. Handle those races by first checking if the container state has changed before returning the error. [NO NEW TESTS NEEDED] - as it's a hard to test race. Fixes: containers#18452 Signed-off-by: Valentin Rothberg <[email protected]>
Maybe a different angle, all of the above tests use kube or pods directly. So it could be very well a bug in our pod handling. |
Nice idea. Let's see if the logs pop up and then we can continue digging. |
A friendly reminder that this issue had no activity for 30 days. |
A friendly reminder that this issue had no activity for 30 days. |
A friendly reminder that this issue had no activity for 30 days. |
Seen in my logs on July 25 but I need to reiterate my usual warning about this sort of flake: it does not cause a CI failure, even in my no-retries PR, so the only way for me to see it is if it happens in the same logfile as a flake that I do detect. IOW, this is likely happening often with no indication. |
If you want to make this fatal pick #18442 in your no retry PR. |
I considered it but didn't see the point? As I recall, this was (is?) such a frequent one that it would be exhausting to sift through; like the "no logs from conmon" one. But I'll give it a try, gather some data, and report back. |
Which is the reason why we need this so badly, it is clear that the teardown logic is not as robust as it should be. The amount of logrus errors/warnings are very concerning. |
Agreed! But, like my no-flake-retries work, it can only succeed if we make a concerted effort to squash all bugs and then enable the checks. And until that happens, new failures are going to creep in. FWIW the |
Seen just now on my laptop, running simple interactive commands, not in int tests or anything: $ time bin/podman kube play --service-container /tmp/foo.yaml
...
$ bin/podman kube down /tmp/foo.yaml
2023-08-23T13:39:24.557276Z: open pidfd: No such process
Pods stopped:
... |
@giuseppe This is an error returned from crun meaning the pid no longer exists. It is called in the step where we are killing the pidfd, Should we ignore this error? |
I think we do ignore the error. At least, podman does not pass it on in the exit code. The problem is that the message is shown to the user. This (1) causes tests to fail if they're expecting empty output, and (2) causes concern in end users, probably also wasted time (what happened? Are my containers ok? I need to look; Maybe I should check the web). |
Well if the message is being shown, isn't it coming from crun then? Podman is probably ignoring the error, but someone is writing it to stdout/stderr. |
yes I think we should ignore it if we try to kill the process. It means the container already exited |
A race exists where a container engine tells crun to kill a container, process, where the process already died. In this case do not return an error. Fixes: containers/podman#18452 Signed-off-by: Daniel J Walsh <[email protected]>
A race exists where a container engine tells crun to kill a container, process, where the process already died. In this case do not return an error. Fixes: containers/podman#18452 Signed-off-by: Daniel J Walsh <[email protected]>
tentative fix: #19760 not 100% sure it solves the problem as I was not able to reproduce locally yet to confirm it |
when the "kill" command fails, print the stderr from the OCI runtime only after we check the container state. It also simplifies the code since we don't have to hard code the error messages we want to ignore. Closes: containers#18452 [NO NEW TESTS NEEDED] it fixes a flake. Signed-off-by: Giuseppe Scrivano <[email protected]>
Not a new flake -- this one started cropping up in 2020, but never often enough to even merit an issue. Now, it's happening constantly in #18442, a PR that adds error checking to e2e-test cleanup.
This may not be a complete list--it has been overwhelming:
The text was updated successfully, but these errors were encountered: