Fix DRMAACluster.scale_down and drop Adaptive._retire_workers #85

jakirkham · 2018-05-20T22:51:22Z

Fixes #65
Replaces #81

Rewrites DRMAACluster.scale_down to use DRMAA to terminate the workers. Also makes use of DRMAACluster.scale_down in DRMAACluster.stop_workers (where this code was pulled from). With this change, it should now be possible to drop Adaptive._retire_workers, which is already present in the Adaptive parent class.

cc @azjps

Pass all of `worker_ids` through a `set` before converting them to a `list` to ensure there are no duplicates.

As the purpose of `scale_down` is to shutdown workers through the scheduler directly, pull code from `stop_workers` to do exactly this with DRMAA. Update `stop_workers` to make use of `scale_down` for this functionality as well. Should help ensure `DRMAACluster` matches more closely to the expected `Cluster` specification.

The `_retire_workers` method seems to largely duplicate the same in `distributed`'s `Adaptive`. So just go ahead and drop our implementation. This was tried before, but didn't work do to duplicate `retire_workers` calls to the `Scheduler` in both `Adaptive._retire_workers` and `DRMAACluster.scale_down`. However as the behavior of `DRMAACluster.scale_down` has now been corrected, it should now be possible to drop our implementation of `Adaptive._retire_workers`. Hence we now do drop it here.

jakirkham · 2018-05-24T04:43:42Z

Have tested this on our cluster and in a couple testing Docker containers. Seems to work well.

Only minor thing is we are not seeing logging from this line in distributed. Since we do see logging messages about scaling down (added in this PR in the scale_down method), that line does get run somehow, but maybe not correctly. Since we did see the old retiring logging messages before and the only difference is the use of yield or not, think it may have something to do with our usage of a coroutine for scale_down, which is not explicitly included in the Cluster spec. Raised issue ( dask/distributed#2004 ) to see if the Cluster spec could make scale_up and scale_down coroutines.

jakirkham · 2018-05-24T05:39:23Z

Should add stop_workers does behave correctly on this front and it does use yield with scale_down.

Instead of having coroutines used for `scale_up` and `scale_down`, use regular methods in their place. Move the coroutines into internal spec. This breaks the API of dask-drmaa. However is technically more correct given the Cluster API's current expectations.

azjps · 2018-05-26T00:43:38Z

Looks good to me, code changes seem sensible and looked fine when I lightly tested with our cluster. Nice work cleaning it up 😃

jakirkham · 2018-05-27T01:19:19Z

Thanks for testing @azjps. Glad to hear it worked.

Generally am happy with this as well. Just on the fence about including the last commit. Any thoughts?

This reverts commit 750cd0c.

jakirkham mentioned this pull request May 20, 2018

Move back towards Distributed's _retire_workers method #65

Open

jakirkham force-pushed the fix_scale_down_workers branch from 35da8f6 to 9ffeca3 Compare May 20, 2018 23:22

jakirkham added 3 commits May 20, 2018 19:25

Drop duplicate worker IDs in stop_workers

ae99a84

Pass all of `worker_ids` through a `set` before converting them to a `list` to ensure there are no duplicates.

jakirkham force-pushed the fix_scale_down_workers branch from d6712d5 to 279c3e5 Compare May 20, 2018 23:26

jakirkham requested a review from azjps May 21, 2018 00:20

jakirkham force-pushed the fix_scale_down_workers branch from 81955bd to d004a3c Compare May 24, 2018 06:00

jakirkham force-pushed the fix_scale_down_workers branch from d004a3c to 750cd0c Compare May 24, 2018 06:02

jakirkham mentioned this pull request May 27, 2018

Making Cluster's scale_up/scale_down methods coroutines dask/distributed#2004

Closed

Revert "Match Cluster's use of regular methods for scaling"

ca9c04e

This reverts commit 750cd0c.

jakirkham mentioned this pull request Jun 22, 2018

Closing all workers #92

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix DRMAACluster.scale_down and drop Adaptive._retire_workers #85

Fix DRMAACluster.scale_down and drop Adaptive._retire_workers #85

jakirkham commented May 20, 2018

jakirkham commented May 24, 2018

jakirkham commented May 24, 2018

azjps commented May 26, 2018

jakirkham commented May 27, 2018

Fix DRMAACluster.scale_down and drop Adaptive._retire_workers #85

Are you sure you want to change the base?

Fix DRMAACluster.scale_down and drop Adaptive._retire_workers #85

Conversation

jakirkham commented May 20, 2018

jakirkham commented May 24, 2018

jakirkham commented May 24, 2018

azjps commented May 26, 2018

jakirkham commented May 27, 2018