-
Notifications
You must be signed in to change notification settings - Fork 206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Editor : Fix deadlock caused by garbage collection of Settings #5893
Editor : Fix deadlock caused by garbage collection of Settings #5893
Conversation
I don't feel I 100% understand the specifics of this, but the code all looks good to me. As for your question on the second commit: I see what you mean about the message not being that useful compared to a stack trace. Since you've already written the code to detect it though, one option would be: print the message, but don't return - that would mean that you probably will get a hang, but the message would make it a lot clearer why you're getting the hang? The message could even contain instructions, or point to a webpage about how to collect a stack trace? Oh, I just noticed the crash on linux-debug though, that's kinda concerning. |
I'm pretty sure that's an unrelated problem - one we've seen sporadically for a while. I have a feeling its more related to the testing than the reality, and haven't found time to look into it yet. |
Pushed one more commit, which - as discussed offline - keeps the detection of deadlocks but removes the mitigation, so we have a greater chance of catching one should it ever occur again. Did it as a separate commit as I'm still in two minds... |
86b65ed
to
3d8dd6f
Compare
The problematic sequence of operations was this : 1. Destroy Editor. But Settings node lives on, because it is a wrapped RefCounted object and hence requires garbage collection. 2. Start unrelated BackgroundTask, which inadvertently triggers `IECore.RefCounted.collectGarbage()` on a background thread. 3. Settings node is destroyed on background thread by the garbage collection. All plugs are disconnected before destruction, including the `__scriptNode` plug. 4. Disconnections cause cancellation of background tasks associated with the ScriptNode, via `BackgroundTask::cancelAffectedTasks()`. Although the Settings node has no parent, the ScriptNode is still found due to the (about to be removed) connection to the `__scriptNode` plug. 5. `BackgroundTask::cancelAndWait()` never returns, because it is being called from the task's own thread. 6. The UI thread then waits for the task to finish, and we have complete deadlock. This is worked around by removing the `__scriptNode` plug connection on the main thread at the time the Editor is destroyed. Why is this only happening now? Because we only introduced the Settings node and the `__scriptNode` plug mechanism recently in 830de76. But we have always had lots of other Python-derived nodes that require garbage collection, so why weren't _they_ causing problems? Because when they are collected, they will have no parent, and the standard way of finding the ScriptNode for cancellation is to look for a ScriptNode ancestor. The special case using the `__scriptNode` plug only applies to the Settings node. Longer term it would be good to come up with a better mechanism than the `__scriptNode` plug, but I think this is a sufficient workaround in the meantime. Fixes #5877
This would also have avoided the deadlock fixed in the previous commit. I'm not sure if its worth keeping or not. On the one hand, it would identify similar problems in future without exposing the user to hangs. But on the other hand, the warning makes it harder to trace the exact source of the problem, whereas a hung process makes that relatively easy with `eu-stack`.
But keep detection, so in the unlikely event this happens we can prompt the user to provide a useful bug report.
3d8dd6f
to
51a4f99
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! We'll go with the detection of deadlocks but no mitigation. I've tweaked Changes.md to reflect that as an Improvement
as we still had the mitigation listed under API
, then rebased to fix the inevitable Changes.md merge conflict...
I believe this fixes #5877. Either of the two commits is sufficient on its own to avoid the hang. I think we definitely want to keep the first one as it is dealing most directly with the cause. I'm in two minds about keeping the second one - have a read of the commit message and see what you think.
Would be good to get this into a release before anybody else gets hit by this problem...