Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

open pidfd: no such process #18452

Closed
edsantiago opened this issue May 4, 2023 · 24 comments · Fixed by #19760
Closed

open pidfd: no such process #18452

edsantiago opened this issue May 4, 2023 · 24 comments · Fixed by #19760
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@edsantiago
Copy link
Member

Not a new flake -- this one started cropping up in 2020, but never often enough to even merit an issue. Now, it's happening constantly in #18442, a PR that adds error checking to e2e-test cleanup.

$ podman [options] stop --all -t 0
open pidfd: No such process

This may not be a complete list--it has been overwhelming:

  • fedora-37 : int podman fedora-37 root container boltdb
    • 05-03 11:21 in podman pod container can override pod not sharing pid
  • fedora-38 : int podman fedora-38 root container boltdb
  • fedora-38 : int podman fedora-38 root host boltdb
    • 05-03 11:19 in podman generate kube sharing pid namespace
  • fedora-38 : int podman fedora-38 root host sqlite
    • 05-03 11:19 in podman play kube should share ipc,net,uts when shareProcessNamespace is set
    • 05-03 11:19 in podman play kube test with init containers and annotation set
    • 05-03 11:19 in podman play kube multi doc yaml with multiple services, pods and deployments
    • 05-03 11:19 in podman play kube with replicas limits the count to 1 and emits a warning
    • 05-03 11:19 in podman play kube override with udp should keep tcp from YAML file
  • fedora-38 : int podman fedora-38 rootless host sqlite
    • 05-03 11:18 in podman play kube test with init containers and annotation set
    • 05-03 11:18 in podman play kube deployment more than 1 replica test correct command
    • 05-03 11:18 in podman play kube multi doc yaml with multiple services, pods and deployments
  • rawhide : int podman rawhide root host sqlite
    • 05-03 11:18 in podman generate kube with volume
    • 05-03 11:18 in podman generate kube sharing pid namespace
    • 05-03 11:18 in podman play kube should share ipc,net,uts when shareProcessNamespace is set
    • 05-03 11:18 in podman play kube expose character device inside container
    • 05-03 11:18 in podman play kube override with udp should keep tcp from YAML file
    • 05-03 11:18 in podman pod correctly sets up PIDNS
  • rawhide : int podman rawhide rootless host sqlite
    • 05-03 11:17 in podman play kube support container liveness probe
    • 05-03 11:17 in podman play kube test with init containers and annotation set
    • 05-03 11:17 in podman play kube test with DirectoryOrCreate HostPath type volume and non-existent directory path
    • 05-03 11:17 in podman generate and reimport kube on pod
@edsantiago edsantiago added the flakes Flakes from Continuous Integration label May 4, 2023
@vrothberg
Copy link
Member

vrothberg commented May 4, 2023

I am under the impression that's a crun error (see https://github.com/containers/crun/blob/645b75b04613debb0a3bdf122fdb0b22d835db68/src/libcrun/linux.c#L5320).

@giuseppe, @flouthoc, does crun need some error handling as done in Podman?

// Track lifetime of conmon precisely using pidfd_open + poll.           
// There are many cases for this to fail, for instance conmon is dead    
// or pidfd_open is not supported (pre linux 5.3), so fall back to the   
// traditional loop with poll + sleep                                    
if fd, err := unix.PidfdOpen(c.state.ConmonPID, 0); err == nil {         
        return fd                                                        
} else if err != unix.ENOSYS && err != unix.ESRCH {                      
        logrus.Debugf("PidfdOpen(%d) failed: %v", c.state.ConmonPID, err)
}                                                                        

@giuseppe
Copy link
Member

giuseppe commented May 4, 2023

crun must report when the kill command fails, as it is in this case because the process doesn't exist. Should Podman ignore it?

@vrothberg
Copy link
Member

crun must report when the kill command fails, as it is in this case because the process doesn't exist. Should Podman ignore it?

That sounds good to me. Could crun return with a specific exit code such that Podman can discriminate a no-such-pid error from others? I'd like to avoid parsing error strings.

@vrothberg
Copy link
Member

Meh ... but runc and other runtimes wouldn't behave that way.

@Luap99
Copy link
Member

Luap99 commented May 4, 2023

What I have observed in #18442 is that it looks like podman sometimes tries to cleanup() twice. So it could be very well that podman call crun twice resulting in the error.

But generally speaking there is always the race between podman calling the runtime to kill it and the runtime trying to kill. The process can exit in the meantime. I am not sure how we can make podman ignore it, string matching is way to fragile considering that we support more than one oci runtime.

@vrothberg
Copy link
Member

I have an idea. Let me wrap something up to discuss the details in the PR.

@vrothberg
Copy link
Member

-> #18457

vrothberg added a commit to vrothberg/libpod that referenced this issue May 4, 2023
There is an inherent race when stopping/killing a container with other
processes attempting to do the same and also with the container having
exited in the meantime.  In those cases, the OCI runtime may fail to
kill the container as it has already exited.

Handle those races by first checking if the container state has changed
before returning the error.

[NO NEW TESTS NEEDED] - as it's a hard to test race.

Fixes: containers#18452
Signed-off-by: Valentin Rothberg <[email protected]>
@Luap99
Copy link
Member

Luap99 commented May 4, 2023

Maybe a different angle, all of the above tests use kube or pods directly. So it could be very well a bug in our pod handling.

@vrothberg
Copy link
Member

Maybe a different angle, all of the above tests use kube or pods directly. So it could be very well a bug in our pod handling.

Nice idea. Let's see if the logs pop up and then we can continue digging.

@github-actions
Copy link

github-actions bot commented Jun 4, 2023

A friendly reminder that this issue had no activity for 30 days.

@github-actions
Copy link

github-actions bot commented Jul 7, 2023

A friendly reminder that this issue had no activity for 30 days.

@github-actions
Copy link

A friendly reminder that this issue had no activity for 30 days.

@edsantiago
Copy link
Member Author

Seen in my logs on July 25 but I need to reiterate my usual warning about this sort of flake: it does not cause a CI failure, even in my no-retries PR, so the only way for me to see it is if it happens in the same logfile as a flake that I do detect. IOW, this is likely happening often with no indication.

@Luap99
Copy link
Member

Luap99 commented Aug 15, 2023

If you want to make this fatal pick #18442 in your no retry PR.

@edsantiago
Copy link
Member Author

I considered it but didn't see the point? As I recall, this was (is?) such a frequent one that it would be exhausting to sift through; like the "no logs from conmon" one.

But I'll give it a try, gather some data, and report back.

@edsantiago
Copy link
Member Author

Oh.... now - I - remember - why - I - stopped - doing - this.

@Luap99
Copy link
Member

Luap99 commented Aug 15, 2023

Which is the reason why we need this so badly, it is clear that the teardown logic is not as robust as it should be. The amount of logrus errors/warnings are very concerning.

@edsantiago
Copy link
Member Author

Agreed! But, like my no-flake-retries work, it can only succeed if we make a concerted effort to squash all bugs and then enable the checks. And until that happens, new failures are going to creep in.

FWIW the kube failure is a constant, not a flake, so it might be easy to fix. At first glance maybe kube down could fix it. I'll try to take a look today or tomorrow.

@edsantiago
Copy link
Member Author

Seen just now on my laptop, running simple interactive commands, not in int tests or anything:

$ time bin/podman kube play --service-container /tmp/foo.yaml
...
$ bin/podman kube down /tmp/foo.yaml
2023-08-23T13:39:24.557276Z: open pidfd: No such process
Pods stopped:
...

@rhatdan
Copy link
Member

rhatdan commented Aug 23, 2023

@giuseppe This is an error returned from crun meaning the pid no longer exists. It is called in the step where we are killing the pidfd, Should we ignore this error?

@edsantiago
Copy link
Member Author

I think we do ignore the error. At least, podman does not pass it on in the exit code.

The problem is that the message is shown to the user. This (1) causes tests to fail if they're expecting empty output, and (2) causes concern in end users, probably also wasted time (what happened? Are my containers ok? I need to look; Maybe I should check the web).

@rhatdan
Copy link
Member

rhatdan commented Aug 23, 2023

Well if the message is being shown, isn't it coming from crun then? Podman is probably ignoring the error, but someone is writing it to stdout/stderr.

@giuseppe
Copy link
Member

@giuseppe This is an error returned from crun meaning the pid no longer exists. It is called in the step where we are killing the pidfd, Should we ignore this error?

yes I think we should ignore it if we try to kill the process. It means the container already exited

rhatdan added a commit to rhatdan/crun that referenced this issue Aug 25, 2023
A race exists where a container engine tells crun to kill a container,
process, where the process already died. In this case do not return
an error.

Fixes: containers/podman#18452

Signed-off-by: Daniel J Walsh <[email protected]>
rhatdan added a commit to rhatdan/crun that referenced this issue Aug 26, 2023
A race exists where a container engine tells crun to kill a container,
process, where the process already died. In this case do not return
an error.

Fixes: containers/podman#18452

Signed-off-by: Daniel J Walsh <[email protected]>
@giuseppe
Copy link
Member

tentative fix: #19760

not 100% sure it solves the problem as I was not able to reproduce locally yet to confirm it

giuseppe added a commit to giuseppe/libpod that referenced this issue Aug 28, 2023
when the "kill" command fails, print the stderr from the OCI runtime
only after we check the container state.

It also simplifies the code since we don't have to hard code the error
messages we want to ignore.

Closes: containers#18452

[NO NEW TESTS NEEDED] it fixes a flake.

Signed-off-by: Giuseppe Scrivano <[email protected]>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Nov 27, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
flakes Flakes from Continuous Integration locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
5 participants