-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make proxy tests with LocalCUDACluster
asynchronous
#1084
Make proxy tests with LocalCUDACluster
asynchronous
#1084
Conversation
Codecov ReportBase: 0.00% // Head: 87.04% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## branch-23.02 #1084 +/- ##
=================================================
+ Coverage 0.00% 87.04% +87.04%
=================================================
Files 26 18 -8
Lines 3439 2300 -1139
=================================================
+ Hits 0 2002 +2002
+ Misses 3439 298 -3141
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
LocalCUDACluster
asynchronous
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks @pentschev !
/merge |
Thanks @madsbk for the review/approval! |
After dask/distributed#7429 was merged, some of those tests started hanging and I could confirm there were two threads concurrently attempting to take the UCX spinlock and the GIL, which led to such deadlock. UCX-Py is currently not thread-safe, and indeed can cause problems like this should two or more threads attempt to call communication routines that will required the UCX spinlock. My theory is that the synchronous cluster will indeed cause communication on the main thread (in this case, the
pytest
thread) upon attempting to shutdown the cluster, instead of only within the Distributed communication thread, likely being the reason behind the test hanging.Asynchronous Distributed clusters seem not to cause any communication from the main thread, but only in the communication thread as expected, thus making the tests asynchronous suffice to resolve such issues. In practice, it's unlikely that people will use sync Distributed clusters from the same process (as pytest does), and thus it's improbable to happen in real use-cases.