waitpid/waitid race caused by change to Popen.send_signal in Python 3.9 #86724

oconnor663 · 2020-12-03T17:05:29Z

BPO	42558
Nosy	@gpshead, @vstinner, @oconnor663

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2020-12-03.17:05:29.254>
labels = ['3.9', 'type-crash']
title = 'waitpid/waitid race caused by change to Popen.send_signal in Python 3.9'
updated_at = <Date 2020-12-04.20:16:42.254>
user = 'https://github.com/oconnor663'

bugs.python.org fields:

activity = <Date 2020-12-04.20:16:42.254>
actor = 'gregory.p.smith'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = []
creation = <Date 2020-12-03.17:05:29.254>
creator = 'oconnor663'
dependencies = []
files = []
hgrepos = []
issue_num = 42558
keywords = []
message_count = 5.0
messages = ['382429', '382475', '382498', '382506', '382524']
nosy_count = 3.0
nosy_names = ['gregory.p.smith', 'vstinner', 'oconnor663']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'crash'
url = 'https://bugs.python.org/issue42558'
versions = ['Python 3.9']

oconnor663 · 2020-12-03T17:05:29Z

In Python 3.9, Popen.send_signal() was changed to call Popen.poll() internally before signaling. (Tracking bug: https://bugs.python.org/issue38630.) This is a best-effort check for the famous kill/wait race condition. However, because this can now reap an already-exited child process as a side effect, it can cause previously working programs to crash. Here's a simple example:

import os
import subprocess
import time

child = subprocess.Popen(["true"])
time.sleep(1)
child.kill()
os.waitpid(child.pid, 0)

The program above exits cleanly in Python 3.8 but crashes with ChildProcessError in Python 3.9. It's a race against child process exit, so in practice (without the sleep) it's a heisenbug.

There's a deeper race here that's harder to demonstrate in an example, but parallel to the original wait/kill issue: If the child PID happens to be reused by another parent thread in this same process, the call to waitpid might *succeed* and reap the unrelated child process. That would export the crash to that thread, and possibly create more kill/wait races.

In short, the question of when exactly a child process is reaped is important for correct signaling on Unix, and changing that behavior can break programs in confusing ways. This change affected the Duct library, and I might not've caught it if not for a lucky failing doctest: oconnor663/duct.py@5dfae70

I haven't searched for more instances of this bug in the wild, but one way to find them would be to look for code that calls both os.waitpid/waitid and also Popen.send_signal/kill/terminate. Duct found itself in this position because it was using waitid(WNOWAIT) on Unix only, to solve this same race condition, and also using Popen.kill on both Unix and Windows.

vstinner · 2020-12-04T11:08:55Z

The script fails with "ChildProcessError: [Errno 10] No child processes" on line:

os.waitpid(child.pid, 0)

The line before, you call child.kill() which sends SIGKILL signal to the process. So it's likely that the process will complete (killed by SIGKILL) soon.

I don't think that it's a good idea to attempt reading the process status directly (by calling os.waitpid), since subprocess already tracks the process status.

There is the public child.poll() API which calls os.waitpid(child.pid, 0) in a safe way for you. Why not using this API?

IMO this issue is not a bug.

Note: Python 3.8 behaves differently, but it doesn't mean that calling directly os.waitpid() was a good idea in Python 3.8 and older ;-)

oconnor663 · 2020-12-04T16:11:52Z

Right, the example above is contrived to demonstrate the race and the crash.

In real life code, the good reason I know of to write code like this is to use os.waidid(WNOWAIT) to solve the wait/kill race properly. This is what Duct has been doing, and Nathaniel Smith also described this strategy in https://bugs.python.org/issue38630. The idea is that a waiting thread follows these steps:

waitid() with WNOWAIT set, without locking the child
after waitid returns, indicating the child has exited, lock the child
waitid() without WNOWAIT, or just waitpid(), to reap the zombie child
stash the exit status and unlock

Meanwhile a killing thread follows these steps:

lock the child
check the stashed exit status, and unlock and exit early if it's set
otherwise, signal the child and unlock

This strategy solves the race. The killing thread is free to signal while the waiting thread is blocked in step 1. If the killing thread happens to race in between when waitid() returns and when the waiting thread acquires the child lock, the child is a zombie and the kill signal has no effect. This is safe even if other threads (or e.g. the OOM killer) can randomly kill our child: *they* might have to worry about PID reuse, but their signals can never cause *us* to kill an unrelated process. What breaks this scheme is if some thread calls waitpid() and reaps the child outside of the lock, but normally that'd be a pretty unreasonable thing to do, especially since it can only be done by other threads in the parent process. (There's also some complexity around whether multiple threads are allowed to call waitid(WNOWAIT) on the same PID at the same time. I've just had one thread call it, and had other blocking waiters block on a second lock, but maybe that's overcautious.)

So anyway, if you use the strategy above -- precisely because you care about the PID reuse race and want to solve it properly -- and you also happen to use Popen.kill(), then changing Popen.send_signal to reap the child can break you.

I don't think this is a bug per se, but it's a behavior change, which matters to a small set of (abnormally) correct programs. But then again, if Duct is the only project that hits this in practice, maybe I'm making a mountain out of a molehill :)

vstinner · 2020-12-04T16:46:34Z

If you care about race conditions in send_signal(), I suggest you to write a PR to use the newly added os.pidfd_open() in subprocess:
https://docs.python.org/dev/library/os.html#os.pidfd_open

gpshead · 2020-12-04T20:16:41Z

I agree with Victor. When code launches a process with subprocess APIs, it is expected that the subprocess module manages the process.

Calling os.waitpid directly instead of using subprocess's APIs breaks that expectation.

Also, WNOWAIT and waitid() (not waitpid()) are not available on all platforms. Nor is pidfd_open(). Code using any of those for such a PR needs to be conditional on their runtime availability.

kumaraditya303 · 2022-06-19T10:22:41Z

Closed as manually calling waitpid breaks subprocess API's and shouldn't be done.

oconnor663 mannequin added 3.9 only security fixes type-crash A hard crash of the interpreter, possibly with a core dump labels Dec 3, 2020

ezio-melotti transferred this issue from another repository Apr 10, 2022

kumaraditya303 closed this as completed Jun 19, 2022

kumaraditya303 closed this as not planned Won't fix, can't repro, duplicate, stale Jun 19, 2022

graingert mentioned this issue Jul 30, 2022

subprocess.Popen.send_signal() should poll the process #82811

Closed

This was referenced Nov 20, 2024

subprocess.Popen.poll race condition returns without polling the child #127050

Open

gh-127049: Fix asyncio.subprocess.Process race condition killing an unrelated process on Unix #127051

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

waitpid/waitid race caused by change to Popen.send_signal in Python 3.9 #86724

waitpid/waitid race caused by change to Popen.send_signal in Python 3.9 #86724

oconnor663 mannequin commented Dec 3, 2020

oconnor663 mannequin commented Dec 3, 2020

vstinner commented Dec 4, 2020

oconnor663 mannequin commented Dec 4, 2020

vstinner commented Dec 4, 2020

gpshead commented Dec 4, 2020

kumaraditya303 commented Jun 19, 2022

waitpid/waitid race caused by change to Popen.send_signal in Python 3.9 #86724

waitpid/waitid race caused by change to Popen.send_signal in Python 3.9 #86724

Comments

oconnor663 mannequin commented Dec 3, 2020

oconnor663 mannequin commented Dec 3, 2020

vstinner commented Dec 4, 2020

oconnor663 mannequin commented Dec 4, 2020

vstinner commented Dec 4, 2020

gpshead commented Dec 4, 2020

kumaraditya303 commented Jun 19, 2022