-
-
Notifications
You must be signed in to change notification settings - Fork 30.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible deadlock at shutdown while recursively acquiring head lock #102126
Comments
I'm the downstream bug reporter, and I have verified on my M1 Pro Mac OS 12.6.2 laptop that:
I have not had time to find a more minimal reproduction. |
I've created a smaller reproducible example here: https://github.com/sergei-maertens/threading-deadlock The advantage is that it's possible to reproduce on a linux/docker context and there's no actual application code involved except for two libraries that happen to trigger this when combined (coverage + hypothesis). @CharString remarked that the hypothesis bug report requires two calls to the I looked into the coverage code a bit yesterday and they do some funky stuff to eventually pass it to |
It appears that Python 3.10.10 introduced a regression that can cause deadlocks on threading.local which manifests in CI at the moment for us. For "better safe than sorry" reasons the Docker image version is also pinned as we do NOT want a risk of deadlocks in production. For more information, see python/cpython#102126
I was able to repro this issue on my Mac using Docker and the instructions at https://github.com/sergei-maertens/threading-deadlock I also attempted to repro using the same
|
Hey, I've run into the same problem and went to #python on liberachat where I got some help. I have a django project that uses django-haystack, and the same phenomenon occurs. Because of this bug, my entire CI and production system started acting weird because of the hanging processes. Upon debugging the hanging process with gdb with the python extension enabled, I've got the following debug log:
Hope this can help somebody. |
@sergei-maertens I can reproduce the bug with your test on my CI, FreeBSD jail with python3.10.10, and not with python3.10.9. Python is clearly the culprit. Haven't tried on my Linux box, but probably it would result in the same phenomenon. |
Compiling python from the main branch gives a python that doesn't have this issue. I tried to revert to 762745a in there to see if the repo state there results in a |
@karolyi if you try compiling Python from a different commit, say the same commit that was released as |
@carljm the tag v3.10.10 refers a commit in the 3.10 branch, and the entire 3.10 branch is broken now in terms of this bug as of the referred merge. Reverting the referred (cherry-picked) commit in the 3.10 branch fixed the issue, so the culprit is indeed 762745a, however checking out to the main branch version at this commit results in a working 3.12. Hence, git bisect isn't helping here. Something is in there in the main branch that fixes this issue. Probably architectural changes that are way harder for me to figure out. |
The reason I ask is because on Ubuntu Linux, as mentioned above, I couldn't reproduce the issue even on 3.10 branch, when compiling Python myself. So are you saying that when compiling Python yourself, you are able to repro the issue from 3.10 branch, but not from main? I didn't see that clearly mentioned above. (When you say you reproed with 3.10.10 and not 3.10.9, it's not clear where those Pythons came from.) |
Interesting. I'm testing on my FreeBSD box now, since I don't use docker and I don't want to pollute my linux box (Manjaro) with a self compiled python. I might just try it with an actual Linux VM. Also the production server which is an ubuntu 20.04 (focal fossa) using http://ppa.launchpad.net/deadsnakes/ppa/ubuntu/, suffers from this issue on python 3.10.10 as well.
Those were the official tarballs from the python.org site, until I started compiling from the github git repo. |
Working on a fix... |
Marking release blocker, cc @pablogsal @ambv |
threading.local()
changes
@karolyi Can you try applying the following patch on 3.10 and test it? It is #102222 backported to 3.10 head. diff --git a/Python/pystate.c b/Python/pystate.c
index df98eb11bb..c7a6af5da8 100644
--- a/Python/pystate.c
+++ b/Python/pystate.c
@@ -293,11 +293,19 @@ interpreter_clear(PyInterpreterState *interp, PyThreadState *tstate)
_PyErr_Clear(tstate);
}
+ // Clear the current/main thread state last.
HEAD_LOCK(runtime);
- for (PyThreadState *p = interp->tstate_head; p != NULL; p = p->next) {
+ PyThreadState *p = interp->tstate_head;
+ HEAD_UNLOCK(runtime);
+ while (p != NULL) {
+ // See https://github.com/python/cpython/issues/102126
+ // Must be called without HEAD_LOCK held as it can deadlock
+ // if any finalizer tries to acquire that lock.
PyThreadState_Clear(p);
+ HEAD_LOCK(runtime);
+ p = p->next;
+ HEAD_UNLOCK(runtime);
}
- HEAD_UNLOCK(runtime);
Py_CLEAR(interp->audit_hooks);
|
Yes, will do, hold on. In the meantime, you can look into c314198, it's the commit with which the main branch doesn't hang, reverting it causes the main branch to hang. |
This is a GC timing issue which is unpredictable by nature. Any unrelated can change the way objects are deallocated or when finalizers are called and possibly delay it. |
There is an error while compiling with your patch:
|
If you meant |
Yeah, sorry for the typo. Thanks for checking. |
I've also checked my use case with django-haystack and django, the process hanging disappeared there as well. This seems to be the right solution. |
Not sure if this the right place to ask, but can we get a hotfix version until 3.10.11 arrives? |
…states (pythonGH-102222). (cherry picked from commit 5f11478) Co-authored-by: Kumar Aditya <[email protected]>
… states (pythonGH-102222). (cherry picked from commit 5f11478) Co-authored-by: Kumar Aditya <[email protected]>
… states (pythonGH-102222). (cherry picked from commit 5f11478) Co-authored-by: Kumar Aditya <[email protected]>
Thanks for checking with us, but unfortunately this doesn't qualify for a hotfix release of the interpreter. The release process is very involved and unfortunately we cannot do it as much as we would like. This fix will be available in the next regular bugfix release of the interpreter, which is in a month or two. |
…GH-102222) (#102236) (cherry picked from commit 5f11478)
… state… (python#102235) [3.10] pythonGH-102126: fix deadlock at shutdown when clearing thread states (pythonGH-102222). (cherry picked from commit 5f11478)
Starting with the upgrade to Hypothesis 6.103.4 we got hangs when pytest exits. This is caused by: HypothesisWorks/hypothesis#4013 combined with: python/cpython#102126 which was fixed in Python 3.10.11, but the latest 3.10 packaged by Archlinux was 3.10.10. Thus, we instead build a newer 3.10 from the AUR. This bumps the build time up to about 20 minutes on my machine, which is probably acceptable since those are nightly builds only anyways. We could probably half that by disabling --enable-optimization, but that would be at the cost of making the actual test runs (which run more often) slower. Closes #8247
… state… (python#102235) [3.10] pythonGH-102126: fix deadlock at shutdown when clearing thread states (pythonGH-102222). (cherry picked from commit 5f11478)
HypothesisWorks/hypothesis#3585 is reproducible on Python 3.10.10 but not 3.10.9, and so we suspect that #100922 may have introduced a (rare?) deadlock while fixing the data race in #100892.
The downstream bug report on Hypothesis includes a reliable (but not minimal reproducer) on OSX - though it's unclear whether this might be an artifact of different patch versions of CPython on the various machines which have checked so far.
Linked PRs
The text was updated successfully, but these errors were encountered: