Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Linux-specific pidfd process extensions #77168

Closed
wants to merge 2 commits into from

Conversation

Aaron1011
Copy link
Member

Background:

Over the last year, pidfd support was added to the Linux kernel. This
allows interacting with other processes. In particular, this allows
waiting on a child process with a timeout in a race-free way, bypassing
all of the awful signal-handler tricks that are usually required.

Pidfds can be obtained for a child process (as well as any other
process) via the pidfd_open syscall. Unfortunately, this requires
several conditions to hold in order to be race-free (i.e. the pid is not
reused).
Per man pidfd_open:

· the disposition of SIGCHLD has not been explicitly set to SIG_IGN
 (see sigaction(2));

· the SA_NOCLDWAIT flag was not specified while establishing a han‐
 dler for SIGCHLD or while setting the disposition of that signal to
 SIG_DFL (see sigaction(2)); and

· the zombie process was not reaped elsewhere in the program (e.g.,
 either by an asynchronously executed signal handler or by wait(2)
 or similar in another thread).

If any of these conditions does not hold, then the child process
(along with a PID file descriptor that refers to it) should instead
be created using clone(2) with the CLONE_PIDFD flag.

Sadly, these conditions are impossible to guarantee once any libraries
are used. For example, C code runnng in a different thread could call
wait(), which is impossible to detect from Rust code trying to open a
pidfd.

While pid reuse issues should (hopefully) be rare in practice, we can do
better. By passing the CLONE_PIDFD flag to clone() or clone3(), we
can obtain a pidfd for the child process in a guaranteed race-free
manner.

This PR:

This PR adds Linux-specific process extension methods to allow obtaining
pidfds for processes spawned via the standard Command API. Other than
being made available to user code, the standard library does not make
use of these pidfds in any way. In particular, the implementation of
Child::wait is completely unchanged.

Two Linux-specific helper methods are added: CommandExt::create_pidfd
and ChildExt::pidfd. These methods are intended to serve as a building
block for libraries to build higher-level abstractions - in particular,
waiting on a process with a timeout.

I've included a basic test, which verifies that pidfds are created iff
the create_pidfd method is used. This test is somewhat special - it
should always succeed on systems with the clone3 system call
available, and always fail on systems without clone3 available. I'm
not sure how to best ensure this programatically.

This PR relies on the newer clone3 system call to pass the CLONE_FD,
rather than the older clone system call. clone3 was added to Linux
in the same release as pidfds, so this shouldn't unnecessarily limit the
kernel versions that this code supports.

Unresolved questions:

  • What should the name of the feature gate be for these newly added
    methods?
  • Should the pidfd method distinguish between an error occurring
    and create_pidfd not being called?

@rust-highfive
Copy link
Collaborator

r? @dtolnay

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Sep 24, 2020
@Aaron1011
Copy link
Member Author

This currently uses my fork of libc. Once rust-lang/libc#1897 is merged and a new release of libc is released, I'll switch back to using libc from crates.io

/// A pidfd will only be created if it is possible to do so
/// in a guaranteed race-free manner (e.g. if the `clone3` system call is
/// supported). Otherwise, [`ChildExit::pidfd`] will return an error.
fn create_pidfd(&mut self, val: bool) -> &mut process::Command;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the user need to request it explicitly? As long as clone3 is available is there any cost to always obtaining it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always opening a pidfd would cause it to be leaked, since we don't ever close it in the standard library. This would eventually exhaust all of the file descriptors for the process with leaked pidfds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it could be closed when the Child is dropped and one would have to take it out if one wants to keep using it, but yeah that would indeed be an extra cost.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ownership and moving would be appropriate here since someone has to be responsible for closing it and if one can retrieve copies it becomes unclear who is responsible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Without seeing this discussion, I would presume that Process owns the pidfd and will close it on drop.
  2. The pidfd should be closed when clone3 succeeds but exec fails, since a user has no ability to do so on their own, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without seeing this discussion, I would presume that Process owns the pidfd and will close it on drop.

If we do that, then I think we should add a way to prevent this behavior (other then calling mem::forget on the Child). Otherwise, users who just need the pidfd will be forced to potentially leak memory (if Child ever heap allocates internally), or keep a Child around when it's not actually needed.

The pidfd should be closed when clone3 succeeds but exec fails, since a user has no ability to do so on their own, right?

The pidfd only exists in the parent process. Once clone3 succeeds, the pidfd is perfectly useable, regardless of what happens in the child process from that point on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pidfd only exists in the parent process. Once clone3 succeeds, the pidfd is perfectly useable, regardless of what happens in the child process from that point on.

My point was that spawn returns an error in that case, so the user cannot access pidfd.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do that, then I think we should add a way to prevent this behavior (other then calling mem::forget on the Child).

To keep things RAII we should introduce a new wrapper type akin to other fd-wrapping IO types such as File, TcpStream and so on. E.g. struct ProcessDescriptor(RawFd) which could implement Drop.

Then Child could have

  • take_pidfd(&mut self) -> ProcessDescriptor
  • pidfd(&self) -> &ProcessDescriptor

ProcessDescriptor then could implement IntoRawFd and AsRawFd

Copy link
Member Author

@Aaron1011 Aaron1011 Sep 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point was that spawn returns an error in that case, so the user cannot access pidfd.

Oh, that's a good point. I'll fix that.

/// some other error occured.
///
/// See `man pidfd_open` for more details about pidfds.
fn pidfd(&self) -> Result<i32>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not Result<RawFd>?

// or if the pidfd could not be created for some reason
// (e.g. the `clone3` syscall was not available).
#[cfg(target_os = "linux")]
pidfd: libc::c_int,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using RawFd for this.

@joshtriplett
Copy link
Member

I'd love to have this! Thanks for implementing it.

let args_ptr = &mut args as *mut clone_args;
let args_size = crate::mem::size_of::<clone_args>();

let res = cvt(unsafe { clone3(args_ptr, args_size) });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't run atfork callbacks. This is for example used by rand to force all ReseedingRng instances to reseed themselves. (https://github.com/rust-random/rand/blob/42247d8616b0f8fb62d8a5a6e8dd8f6a17e1b3eb/src/rngs/adapter/reseeding.rs#L287) The rand usage is not relevant here, but there may be other cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought that posix_spawnnp didn't run atfork handlers, but it looks like I was mistaken.
This will require some further thought.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason that I used __clone3 is that glibc's clone() implementation requires the caller to allocate a stack for the new process (even though the kernel allows passing a null stack pointer to the raw clone syscall). We could allocate a stack ourself, but I think it would be better to avoid that if at all possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the posix_spawn man page states:

       According to POSIX, it is unspecified whether fork handlers
       established with pthread_atfork(3) are called when posix_spawn() is
       invoked.  Since glibc 2.24, the fork handlers are not executed in any
       case.  On older implementations, fork handlers are called only if the
       child is created using fork(2).

so I think this should be fine

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some extensive discussion going on in glibc upstream about handling and wrapping clone3. See:

@joshtriplett
Copy link
Member

joshtriplett commented Sep 29, 2020 via email

@@ -267,7 +345,6 @@ impl Command {
#[cfg(any(
target_os = "macos",
target_os = "freebsd",
all(target_os = "linux", target_env = "gnu")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we continue using posix_spawn on GNU/Linux, while adding pidfd to a list of exceptions? It should end up using CLONE_VFORK & CLONE_VM, and looking at various benchmarks it can make substantial performance difference for some use-cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clone3 implementation could use CLONE_VFORK to achieve the same.
For reference, Go seems to pull it off while only having to jump though one hoop to deal with the shared memory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using posix_spawn would require us to create the pidfd in the child process, and then use a Unix domain socket to transfer it to the parent. Using clone3 with the proper flags seems much simpler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When pidfd is not used, so there should be no impact on the implementation. See #77455 for an additional motivation.

@bors
Copy link
Contributor

bors commented Oct 2, 2020

☔ The latest upstream changes (presumably #77462) made this pull request unmergeable. Please resolve the merge conflicts.

Note that reviewers usually do not review pull requests until merge conflicts are resolved! Once you resolve the conflicts, you should change the labels applied by bors to indicate that your PR is ready for review. Post this as a comment to change the labels:

@rustbot modify labels: +S-waiting-on-review -S-waiting-on-author

Background:

Over the last year, pidfd support was added to the Linux kernel. This
allows interacting with other processes. In particular, this allows
waiting on a child process with a timeout in a race-free way, bypassing
all of the awful signal-handler tricks that are usually required.

Pidfds can be obtained for a child process (as well as any other
process) via the `pidfd_open` syscall. Unfortunately, this requires
several conditions to hold in order to be race-free (i.e. the pid is not
reused).
Per `man pidfd_open`:

```
· the disposition of SIGCHLD has not been explicitly set to SIG_IGN
 (see sigaction(2));

· the SA_NOCLDWAIT flag was not specified while establishing a han‐
 dler for SIGCHLD or while setting the disposition of that signal to
 SIG_DFL (see sigaction(2)); and

· the zombie process was not reaped elsewhere in the program (e.g.,
 either by an asynchronously executed signal handler or by wait(2)
 or similar in another thread).

If any of these conditions does not hold, then the child process
(along with a PID file descriptor that refers to it) should instead
be created using clone(2) with the CLONE_PIDFD flag.
```

Sadly, these conditions are impossible to guarantee once any libraries
are used. For example, C code runnng in a different thread could call
`wait()`, which is impossible to detect from Rust code trying to open a
pidfd.

While pid reuse issues should (hopefully) be rare in practice, we can do
better. By passing the `CLONE_PIDFD` flag to `clone()` or `clone3()`, we
can obtain a pidfd for the child process in a guaranteed race-free
manner.

This PR:

This PR adds Linux-specific process extension methods to allow obtaining
pidfds for processes spawned via the standard `Command` API. Other than
being made available to user code, the standard library does not make
use of these pidfds in any way. In particular, the implementation of
`Child::wait` is completely unchanged.

Two Linux-specific helper methods are added: `CommandExt::create_pidfd`
and `ChildExt::pidfd`. These methods are intended to serve as a building
block for libraries to build higher-level abstractions - in particular,
waiting on a process with a timeout.

I've included a basic test, which verifies that pidfds are created iff
the `create_pidfd` method is used. This test is somewhat special - it
should always succeed on systems with the `clone3` system call
available, and always fail on systems without `clone3` available. I'm
not sure how to best ensure this programatically.

This PR relies on the newer `clone3` system call to pass the `CLONE_FD`,
rather than the older `clone` system call. `clone3` was added to Linux
in the same release as pidfds, so this shouldn't unnecessarily limit the
kernel versions that this code supports.

Unresolved questions:
* What should the name of the feature gate be for these newly added
  methods?
* Should the `pidfd` method distinguish between an error occurring
  and `create_pidfd` not being called?
/// Os-specific extensions to [`process::Child`]
///
/// [`process::Child`]: crate::process::Child
pub trait ChildExt {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this a sealed trait (like https://doc.rust-lang.org/std/slice/trait.SliceIndex.html) so it cannot be implemented outside of the standard library and we are free to add methods.

/// Os-specific extensions to [`process::Command`]
///
/// [`process::Command`]: crate::process::Command
pub trait CommandExt {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this a sealed trait.

@@ -0,0 +1,47 @@
//! Linux-specific extensions to primitives in the `std::process` module.

#![unstable(feature = "linux_pidfd", issue = "none")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open a tracking issue and update this attribute before landing.
Putting a review comment as a reminder; doesn't need to be done now if the PR is not in shape to land yet.

@dtolnay
Copy link
Member

dtolnay commented Oct 15, 2020

The public API of std::os::linux::process::{CommandExt, ChildExt} looks good to me to land unstable.

Reassigning to r? @joshtriplett to help review the implementation.

Co-authored-by: bjorn3 <[email protected]>
if HAS_CLONE3.load(Ordering::Relaxed) {
let mut flags = 0;
if self.make_pidfd {
flags |= CLONE_PIDFD;
Copy link
Member

@joshtriplett joshtriplett Oct 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the error path for "the caller asked for a pidfd but the kernel doesn't support clone3"?

(In theory we could have a fallback that tries pidfd_open, but that's less safe for a variety of reasons, so let's just not.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the Some(libc::ENOSYS) => branch below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did see that; I just missed that you'd initialized pidfd to -1 at the top, so that it's safe to always return it. That seems fine.

OK, this LGTM. r=me when you're ready.

@joshtriplett
Copy link
Member

@Aaron1011 Can you please make the sealed-trait changes @dtolnay requested, as well as opening the tracking issue?

@Aaron1011
Copy link
Member Author

@joshtriplett Sorry for the delay - I'll do that today or tomorrow.

@joshtriplett
Copy link
Member

joshtriplett commented Oct 23, 2020 via email

@the8472
Copy link
Member

the8472 commented Oct 23, 2020

Additional open issues:

  • closing the fd on drop or error
  • better types than i32
  • unclear ownership of the file descriptors
  • keep the posix_spawn code path when pid_fds are not requested

@JohnCSimon JohnCSimon added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Nov 9, 2020
@m-ou-se m-ou-se added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. O-linux Operating system: Linux labels Nov 9, 2020
@crlf0710
Copy link
Member

Triage: What's the current status of this?

@crlf0710
Copy link
Member

@Aaron1011 Ping from traige. I'm closing this due to inactivity. Feel free to reopen or create a new PR when you have time to work on this again, thanks!

@crlf0710 crlf0710 closed this Dec 18, 2020
@voidc voidc mentioned this pull request Mar 10, 2021
9 tasks
Dylan-DPC-zz pushed a commit to Dylan-DPC-zz/rust that referenced this pull request Mar 14, 2021
Add Linux-specific pidfd process extensions (take 2)

Continuation of rust-lang#77168.
I addressed the following concerns from the original PR:

- make `CommandExt` and `ChildExt` sealed traits
- wrap file descriptors in `PidFd` struct representing ownership over the fd
- add `take_pidfd` to take the fd out of `Child`
- close fd when dropped

Tracking Issue: rust-lang#82971
Dylan-DPC-zz pushed a commit to Dylan-DPC-zz/rust that referenced this pull request Mar 18, 2021
Add Linux-specific pidfd process extensions (take 2)

Continuation of rust-lang#77168.
I addressed the following concerns from the original PR:

- make `CommandExt` and `ChildExt` sealed traits
- wrap file descriptors in `PidFd` struct representing ownership over the fd
- add `take_pidfd` to take the fd out of `Child`
- close fd when dropped

Tracking Issue: rust-lang#82971
Dylan-DPC-zz pushed a commit to Dylan-DPC-zz/rust that referenced this pull request Mar 18, 2021
Add Linux-specific pidfd process extensions (take 2)

Continuation of rust-lang#77168.
I addressed the following concerns from the original PR:

- make `CommandExt` and `ChildExt` sealed traits
- wrap file descriptors in `PidFd` struct representing ownership over the fd
- add `take_pidfd` to take the fd out of `Child`
- close fd when dropped

Tracking Issue: rust-lang#82971
Dylan-DPC-zz pushed a commit to Dylan-DPC-zz/rust that referenced this pull request Mar 19, 2021
Add Linux-specific pidfd process extensions (take 2)

Continuation of rust-lang#77168.
I addressed the following concerns from the original PR:

- make `CommandExt` and `ChildExt` sealed traits
- wrap file descriptors in `PidFd` struct representing ownership over the fd
- add `take_pidfd` to take the fd out of `Child`
- close fd when dropped

Tracking Issue: rust-lang#82971
Dylan-DPC-zz pushed a commit to Dylan-DPC-zz/rust that referenced this pull request Mar 26, 2021
Add Linux-specific pidfd process extensions (take 2)

Continuation of rust-lang#77168.
I addressed the following concerns from the original PR:

- make `CommandExt` and `ChildExt` sealed traits
- wrap file descriptors in `PidFd` struct representing ownership over the fd
- add `take_pidfd` to take the fd out of `Child`
- close fd when dropped

Tracking Issue: rust-lang#82971
bors added a commit to rust-lang-ci/rust that referenced this pull request Aug 1, 2021
Add Linux-specific pidfd process extensions (take 2)

Continuation of rust-lang#77168.
I addressed the following concerns from the original PR:

- make `CommandExt` and `ChildExt` sealed traits
- wrap file descriptors in `PidFd` struct representing ownership over the fd
- add `take_pidfd` to take the fd out of `Child`
- close fd when dropped

Tracking Issue: rust-lang#82971
@dtolnay dtolnay self-assigned this Mar 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-linux Operating system: Linux S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.