-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase close timeout of Nanny
in LocalCUDACluster
#1260
Increase close timeout of Nanny
in LocalCUDACluster
#1260
Conversation
Tests in CI have been failing more often, but those errors can't be reproduced locally. This is possibly related to `Nanny`'s internal mechanism to establish timeouts to kill processes, perhaps due to higher load on the servers, tasks take longer and killing processes takes into account the overall time taken to establish a timeout, which is then drastically reduced leaving little time to actually shutdown processes. It is also not possible to programatically set a different timeout given existing Distributed's API, which currently calls `close()` without arguments in `SpecCluster._correct_state_internal()`. Given the limitations described above, a new class is added by this change with the sole purpose of rewriting the timeout for `Nanny.close()` method with an increased value, and then use the new class when launching `LocalCUDACluster` via the `worker_class` argument.
/merge |
Thanks @quasiben ! |
I am probably misunderstanding the underlying issue, but is there a risk that with this patch the tests are no longer adequately capturing in how users would actually use the software and thus reduce the utility of the tests? |
It is possible that this may occur in real use cases, but I've tried locally running the tests that have been failing in Dask-CUDA thousands of times over many a period of many weeks/months and I couldn't reproduce this issue once, which leads me to believe this is due to high-load in CI. In any case, those issues only occur when the cluster is shutting down, so even if it manifests itself for users it could be problematic only in the event that the user relies on the process' exit code, which is indeed not great but unfortunately shutdown is a difficult issue to properly resolve in Distributed. |
Tests in CI have been failing more often, but those errors can't be reproduced locally. This is possibly related to
Nanny
's internal mechanism to establish timeouts to kill processes, perhaps due to higher load on the servers, tasks take longer and killing processes takes into account the overall time taken to establish a timeout, which is then drastically reduced leaving little time to actually shutdown processes. It is also not possible to programatically set a different timeout given existing Distributed's API, which currently callsclose()
without arguments inSpecCluster._correct_state_internal()
.Given the limitations described above, a new class is added by this change with the sole purpose of rewriting the timeout for
Nanny.close()
method with an increased value, and then use the new class when launchingLocalCUDACluster
via theworker_class
argument.