-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: Python API calls in destructors are unsafe outside the main thread #3274
Comments
I think it sounds like a.) is the appropriate choice given this info. Unfortunate we need to add a noninline function on such a hot code path, but there doesn't seem to be a elegant way around it. |
I have been exploring an alternative workaround: introduce a type of lock that can be acquired that blocks Python from exiting. This is implemented by registering an Then any time you have a C++ thread that wants to acquire the GIL in order to call Python API functions, it should first acquire this exit block. If successful, it can then acquire the GIL and proceed. If it cannot acquire the exit block because Python is exiting, it will have to skip calling the Python APIs. There is a possibility that some code running in the main Python thread is blocked, directly or indirectly, on whatever Python API the C++ thread would have liked to call. To avoid deadlock, an This prevents C++ code from itself causing crashes. It is still possible for Python code running in a daemon thread to be calling a pybind11-bound function (which has directly or indirectly released the GIL) while Python exits. But that is at least more under the user's control. In general, |
See here for the Python pull request that attempts to address this problem by avoiding |
That's exactly what we're doing in some other contexts (e.g. https://reviews.llvm.org/D114722).
We need performance data. More often than not, overhead from such things is practically undetectable. That's my default assumption here, with conclusion: let's add the |
There seems to be at least some agreement towards fixing the problem in Python itself, but my pull request has stalled: I ended up implementing the "exit block" mechanism In TensorStore as described in my previous comment, similar to the one implemented in the above Python pull request: Even if we solve the specific issue of |
@jbms do you think we should keep this bug open anyway? |
Maybe it is fine to close it given there is a python bug filed. Or we could close it once there is a resolution to the Python bug. |
Closing because it is not actionable here. It will still show up in searches. |
Required prerequisites
Problem description
I observed some random crashes while the Python interpreter is exiting normally, when there are threads created from C++ that invoke Python code immediately before the exit (this thread "correctly" uses gil_scoped_acquire while invoking the Python code).
After some debugging, I believe the root cause is the following:
Py_FinalizeEx
function is called. After doing some initial work with the Python interpreter still in a valid state, it initiates the real shutdown by calling_PyRuntimeState_SetFinalizing
:https://github.com/python/cpython/blob/09b4ad11f323f8702cde795e345b75e0fbb1a9a5/Python/pylifecycle.c#L1748
At this point, the main thread calling
Py_FinalizeEx
holds the GIL and therefore no other threads can be performing any Python API calls. However, once_PyRuntimeState_SetFinalizing
is called, other threads are are no longer blocked waiting for the GIL. Instead, when they attempt to acquire the GIL,pthread_exit
is called.pthread_exit
appears to behave as if a C++ exception of type__forced_unwind
is thrown.noexcept
functions in the call stack,std::terminate
will be called, which terminates the program with an error.This issue is mentioned in this Python bug report as well:
https://bugs.python.org/issue42969
I'm not sure what happens on platforms other than Linux with glibc and GCC, but I suspect the behavior may be similar.
As I see it, this is really a bug in Python itself, but due to pybind11's use of C++ destructors to call Py_DECREF, the problem is much more apparent when using pybind11 (or any similar C++ RAII wrapper). When using pure C code, the thread will likely exit without doing any cleanup, which may be harmless in most cases.
Unfortunately I don't see an easy workaround. We need to ensure Py_DECREF or any other Python APIs are not called from a destructor while Python is finalizing. There are two ways we could go about that:
a. Checking
_Py_IsFinalizing()
from the pybind1::object destructor (and just skip callingPy_DECREF
in that case) would work I think. Unfortunately_Py_IsFinalizing
is not inline, so the cost would not be negligible.b. Alternatively, we could attempt to block the unwinding itself, and just make the thread hang. We could wrap the call to Python API functions with a
try { } catch (...) {sleep_forever();}
block. However, there are a very large number of Python C API functions that can potentially lead to a call topthread_exit
. In addition to the API calls that directly acquire the GIL, any API call that can trigger user Python code (including any call to Py_DECREF), could also triggerpthread_exit
, because that user Python code may release and then re-acquire the GIL.I think (a) would definitely be the most practical option.
We would also need to document the fact that users should not invoke any Python API functions from destructors without checking
_Py_IsFinalizing
.Reproducible example code
No response
The text was updated successfully, but these errors were encountered: