Skip to content
This repository has been archived by the owner on Feb 10, 2021. It is now read-only.

Fix DRMAACluster.scale_down and drop Adaptive._retire_workers #85

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

jakirkham
Copy link
Member

Fixes #65
Replaces #81

Rewrites DRMAACluster.scale_down to use DRMAA to terminate the workers. Also makes use of DRMAACluster.scale_down in DRMAACluster.stop_workers (where this code was pulled from). With this change, it should now be possible to drop Adaptive._retire_workers, which is already present in the Adaptive parent class.

cc @azjps

jakirkham added 3 commits May 20, 2018 19:25
Pass all of `worker_ids` through a `set` before converting them to a
`list` to ensure there are no duplicates.
As the purpose of `scale_down` is to shutdown workers through the
scheduler directly, pull code from `stop_workers` to do exactly this
with DRMAA. Update `stop_workers` to make use of `scale_down` for this
functionality as well. Should help ensure `DRMAACluster` matches more
closely to the expected `Cluster` specification.
The `_retire_workers` method seems to largely duplicate the same in
`distributed`'s `Adaptive`. So just go ahead and drop our
implementation.

This was tried before, but didn't work do to duplicate `retire_workers`
calls to the `Scheduler` in both `Adaptive._retire_workers` and
`DRMAACluster.scale_down`. However as the behavior of
`DRMAACluster.scale_down` has now been corrected, it should now be
possible to drop our implementation of `Adaptive._retire_workers`. Hence
we now do drop it here.
@jakirkham jakirkham force-pushed the fix_scale_down_workers branch from d6712d5 to 279c3e5 Compare May 20, 2018 23:26
@jakirkham jakirkham requested a review from azjps May 21, 2018 00:20
@jakirkham
Copy link
Member Author

Have tested this on our cluster and in a couple testing Docker containers. Seems to work well.

Only minor thing is we are not seeing logging from this line in distributed. Since we do see logging messages about scaling down (added in this PR in the scale_down method), that line does get run somehow, but maybe not correctly. Since we did see the old retiring logging messages before and the only difference is the use of yield or not, think it may have something to do with our usage of a coroutine for scale_down, which is not explicitly included in the Cluster spec. Raised issue ( dask/distributed#2004 ) to see if the Cluster spec could make scale_up and scale_down coroutines.

@jakirkham
Copy link
Member Author

Should add stop_workers does behave correctly on this front and it does use yield with scale_down.

@jakirkham jakirkham force-pushed the fix_scale_down_workers branch from 81955bd to d004a3c Compare May 24, 2018 06:00
Instead of having coroutines used for `scale_up` and `scale_down`, use
regular methods in their place. Move the coroutines into internal spec.
This breaks the API of dask-drmaa. However is technically more correct
given the Cluster API's current expectations.
@jakirkham jakirkham force-pushed the fix_scale_down_workers branch from d004a3c to 750cd0c Compare May 24, 2018 06:02
@azjps
Copy link
Collaborator

azjps commented May 26, 2018

Looks good to me, code changes seem sensible and looked fine when I lightly tested with our cluster. Nice work cleaning it up 😃

@jakirkham
Copy link
Member Author

Thanks for testing @azjps. Glad to hear it worked.

Generally am happy with this as well. Just on the fence about including the last commit. Any thoughts?

@jakirkham jakirkham mentioned this pull request Jun 22, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants