Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Editor : Fix deadlock caused by garbage collection of Settings #5893

Merged
merged 3 commits into from
Jun 14, 2024

Conversation

johnhaddon
Copy link
Member

I believe this fixes #5877. Either of the two commits is sufficient on its own to avoid the hang. I think we definitely want to keep the first one as it is dealing most directly with the cause. I'm in two minds about keeping the second one - have a read of the commit message and see what you think.

Would be good to get this into a release before anybody else gets hit by this problem...

@danieldresser-ie
Copy link
Contributor

I don't feel I 100% understand the specifics of this, but the code all looks good to me.

As for your question on the second commit: I see what you mean about the message not being that useful compared to a stack trace. Since you've already written the code to detect it though, one option would be: print the message, but don't return - that would mean that you probably will get a hang, but the message would make it a lot clearer why you're getting the hang? The message could even contain instructions, or point to a webpage about how to collect a stack trace?

Oh, I just noticed the crash on linux-debug though, that's kinda concerning.

@johnhaddon
Copy link
Member Author

I just noticed the crash on linux-debug though, that's kinda concerning.

I'm pretty sure that's an unrelated problem - one we've seen sporadically for a while. I have a feeling its more related to the testing than the reality, and haven't found time to look into it yet.

@johnhaddon
Copy link
Member Author

Pushed one more commit, which - as discussed offline - keeps the detection of deadlocks but removes the mitigation, so we have a greater chance of catching one should it ever occur again. Did it as a separate commit as I'm still in two minds...

@johnhaddon johnhaddon force-pushed the garbageCollectionDeadlockWorkaround branch from 86b65ed to 3d8dd6f Compare June 12, 2024 11:42
The problematic sequence of operations was this :

1. Destroy Editor. But Settings node lives on, because it is a wrapped   RefCounted object and hence requires garbage collection.
2. Start unrelated BackgroundTask, which inadvertently triggers `IECore.RefCounted.collectGarbage()` on a background thread.
3. Settings node is destroyed on background thread by the garbage collection. All plugs are disconnected before destruction, including the `__scriptNode` plug.
4. Disconnections cause cancellation of background tasks associated with
the ScriptNode, via `BackgroundTask::cancelAffectedTasks()`. Although the Settings node has no parent, the ScriptNode
is still found due to the (about to be removed) connection to the
`__scriptNode` plug.
5. `BackgroundTask::cancelAndWait()` never returns, because it is being called from the task's own thread.
6. The UI thread then waits for the task to finish, and we have complete deadlock.

This is worked around by removing the `__scriptNode` plug connection on the main thread at the time the Editor is destroyed.

Why is this only happening now? Because we only introduced the Settings node and the `__scriptNode` plug mechanism recently in 830de76.

But we have always had lots of other Python-derived nodes that require garbage collection, so why weren't _they_ causing problems? Because when they are collected, they will have no parent, and the standard way of finding the ScriptNode for cancellation is to look for a ScriptNode ancestor. The special case using the `__scriptNode` plug only applies to the Settings node.

Longer term it would be good to come up with a better mechanism than the `__scriptNode` plug, but I think this is a sufficient workaround in the meantime.

Fixes #5877
This would also have avoided the deadlock fixed in the previous commit. I'm not sure if its worth keeping or not. On the one hand, it would identify similar problems in future without exposing the user to hangs. But on the other hand, the warning makes it harder to trace the exact source of the problem, whereas a hung process makes that relatively easy with `eu-stack`.
But keep detection, so in the unlikely event this happens we can prompt the user to provide a useful bug report.
@murraystevenson murraystevenson force-pushed the garbageCollectionDeadlockWorkaround branch from 3d8dd6f to 51a4f99 Compare June 14, 2024 18:17
Copy link
Contributor

@murraystevenson murraystevenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! We'll go with the detection of deadlocks but no mitigation. I've tweaked Changes.md to reflect that as an Improvement as we still had the mitigation listed under API, then rebased to fix the inevitable Changes.md merge conflict...

@murraystevenson murraystevenson merged commit ff5358d into 1.4_maintenance Jun 14, 2024
5 checks passed
@johnhaddon johnhaddon deleted the garbageCollectionDeadlockWorkaround branch June 20, 2024 08:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants