Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase close timeout of Nanny in LocalCUDACluster #1260

Merged
merged 2 commits into from
Oct 12, 2023

Conversation

pentschev
Copy link
Member

Tests in CI have been failing more often, but those errors can't be reproduced locally. This is possibly related to Nanny's internal mechanism to establish timeouts to kill processes, perhaps due to higher load on the servers, tasks take longer and killing processes takes into account the overall time taken to establish a timeout, which is then drastically reduced leaving little time to actually shutdown processes. It is also not possible to programatically set a different timeout given existing Distributed's API, which currently calls close() without arguments in SpecCluster._correct_state_internal().

Given the limitations described above, a new class is added by this change with the sole purpose of rewriting the timeout for Nanny.close() method with an increased value, and then use the new class when launching LocalCUDACluster via the worker_class argument.

Tests in CI have been failing more often, but those errors can't be
reproduced locally. This is possibly related to `Nanny`'s internal
mechanism to establish timeouts to kill processes, perhaps due to higher
load on the servers, tasks take longer and killing processes takes into
account the overall time taken to establish a timeout, which is then
drastically reduced leaving little time to actually shutdown processes.
It is also not possible to programatically set a different timeout given
existing Distributed's API, which currently calls `close()` without
arguments in `SpecCluster._correct_state_internal()`.

Given the limitations described above, a new class is added by this
change with the sole purpose of rewriting the timeout for
`Nanny.close()` method with an increased value, and then use the new
class when launching `LocalCUDACluster` via the `worker_class` argument.
@pentschev pentschev added bug Something isn't working 3 - Ready for Review Ready for review by team non-breaking Non-breaking change labels Oct 12, 2023
@pentschev pentschev requested a review from a team as a code owner October 12, 2023 13:01
@github-actions github-actions bot added the python python code needed label Oct 12, 2023
@quasiben
Copy link
Member

/merge

@rapids-bot rapids-bot bot merged commit 48de0c5 into rapidsai:branch-23.12 Oct 12, 2023
21 checks passed
@pentschev
Copy link
Member Author

Thanks @quasiben !

@csadorf
Copy link

csadorf commented Oct 30, 2023

I am probably misunderstanding the underlying issue, but is there a risk that with this patch the tests are no longer adequately capturing in how users would actually use the software and thus reduce the utility of the tests?

@pentschev
Copy link
Member Author

I am probably misunderstanding the underlying issue, but is there a risk that with this patch the tests are no longer adequately capturing in how users would actually use the software and thus reduce the utility of the tests?

It is possible that this may occur in real use cases, but I've tried locally running the tests that have been failing in Dask-CUDA thousands of times over many a period of many weeks/months and I couldn't reproduce this issue once, which leads me to believe this is due to high-load in CI. In any case, those issues only occur when the cluster is shutting down, so even if it manifests itself for users it could be problematic only in the event that the user relies on the process' exit code, which is indeed not great but unfortunately shutdown is a difficult issue to properly resolve in Distributed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working non-breaking Non-breaking change python python code needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants