Ensure adaptive scaling is properly awaited and closed #4720

fjetter · 2021-04-20T15:25:41Z

Primarily this ensures that scale_up/down is properly awaited in case the sync method is used which is more common than the loop.add_callback.
It also ensures the adaptive PC is stopped properly once a cluster shuts down.
A bit cleanup. There has been a test which raises error logs and I had issues figuring out what it actually was doing. It also was refactored a few times since its original inception.

distributed/tests/test_scheduler.py

fjetter · 2021-04-20T15:29:42Z

distributed/deploy/tests/test_adaptive.py

@@ -19,43 +20,6 @@
 )


-@pytest.mark.asyncio
-async def test_simultaneous_scale_up_and_down(cleanup):


This test was added in #1608 to disallow simultaneous up-/down-scaling but the current implementation doesn't work at all since the API changed. Instead it only raises error logs

distributed/deploy/cluster.py

fjetter · 2021-04-21T12:44:11Z

Test failure is #4651 / #4719

jrbourbeau

Thanks @fjetter! I left a few small comments, but overall the changes here look good to me

distributed/deploy/tests/test_adaptive.py

distributed/tests/test_scheduler.py

jrbourbeau · 2021-04-22T02:29:14Z

distributed/deploy/tests/test_adaptive.py

+
+
+@pytest.mark.asyncio
+async def test_adaptive_stopped():


Given that this uses an async cluster, do we need the async_wait_for + timeout or can we just assert the various attributes directly? For example, when I make the following changes locally this test still passes:

diff --git a/distributed/deploy/tests/test_adaptive.py b/distributed/deploy/tests/test_adaptive.py index 39284302..8cf4f381 100644 --- a/distributed/deploy/tests/test_adaptive.py +++ b/distributed/deploy/tests/test_adaptive.py @@ -466,16 +466,8 @@ async def test_adaptive_stopped(): """ async with LocalCluster(n_workers=0, asynchronous=True) as cluster: async with Client(cluster, asynchronous=True) as client: - instance = cluster.adapt(interval="10ms") + pc = cluster.adapt(interval="10ms").periodic_callback + assert pc is not None + assert pc.is_running() is not None - await async_wait_for( - lambda: instance.periodic_callback is not None, timeout=5 - ) - - await async_wait_for( - lambda: instance.periodic_callback.is_running() is not None, timeout=5 - ) - - pc = instance.periodic_callback - - await async_wait_for(lambda: pc.is_running() is not None, timeout=5) + assert pc.is_running() is not None

Well, at the very least we'll need to have one wait in between which actually waits for the entire thing to start. If the PC was never started, the conditions are trivially true. I'll add a comment and what I can remove for it to still work

All of the is not None checks where useless which is why the test did not fail. I properly assert for bool now and now I need to wait

fjetter · 2021-04-22T09:36:01Z

distributed/deploy/spec.py

+            # Need to call stop here before we close all servers to avoid having
+            # dangling tasks in the ioloop
+            with suppress(AttributeError):
+                self._adaptive.stop()


This is a bit unfortunate since I need to call it up here and not down in Cluster. If I do this only in cluster, the even loop seems to close too soon and we have still pending tasks from AdaptiveCore.adapt.

I'm wondering if we ever considered adding PYTHONASYNCIODEBUG=1 to our test suite which would raise in these instances. Not sure how much would break or if this is a bad idea in general

* Ensure adaptive scaling is properly awaited and closed * review comments * Ensure no tasks are pending when closing adaptive cluster * remvoe assert in stop * break cyclic ref in adaptive core

fjetter commented Apr 20, 2021

View reviewed changes

distributed/tests/test_scheduler.py Outdated Show resolved Hide resolved

fjetter commented Apr 20, 2021

View reviewed changes

jrbourbeau reviewed Apr 20, 2021

View reviewed changes

distributed/deploy/cluster.py Outdated Show resolved Hide resolved

fjetter force-pushed the cleanup/adaptive branch from 77b5a88 to 6b32e66 Compare April 21, 2021 10:25

jrbourbeau reviewed Apr 22, 2021

View reviewed changes

fjetter commented Apr 22, 2021

View reviewed changes

fjetter added 3 commits May 26, 2021 14:14

Ensure adaptive scaling is properly awaited and closed

3012edd

review comments

04ffe2e

Ensure no tasks are pending when closing adaptive cluster

0ac3f5c

fjetter force-pushed the cleanup/adaptive branch from bc0b646 to 0ac3f5c Compare May 26, 2021 12:15

remvoe assert in stop

733fe63

fjetter force-pushed the cleanup/adaptive branch from 9c6137c to 733fe63 Compare May 26, 2021 15:15

break cyclic ref in adaptive core

edfa024

fjetter force-pushed the cleanup/adaptive branch from 974bc3b to edfa024 Compare May 27, 2021 16:35

fjetter merged commit 93e2869 into dask:main May 28, 2021

fjetter deleted the cleanup/adaptive branch May 28, 2021 07:50

This was referenced Jan 5, 2022

ENH: Close Adaptive during Cluster shutdown #5093

Closed

Option to shutdown adaptive cluster #1898

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure adaptive scaling is properly awaited and closed #4720

Ensure adaptive scaling is properly awaited and closed #4720

fjetter commented Apr 20, 2021

fjetter Apr 20, 2021

fjetter commented Apr 21, 2021

jrbourbeau left a comment

jrbourbeau Apr 22, 2021

fjetter Apr 22, 2021

fjetter Apr 22, 2021

fjetter Apr 22, 2021

Ensure adaptive scaling is properly awaited and closed #4720

Ensure adaptive scaling is properly awaited and closed #4720

Conversation

fjetter commented Apr 20, 2021

fjetter Apr 20, 2021

Choose a reason for hiding this comment

fjetter commented Apr 21, 2021

jrbourbeau left a comment

Choose a reason for hiding this comment

jrbourbeau Apr 22, 2021

Choose a reason for hiding this comment

fjetter Apr 22, 2021

Choose a reason for hiding this comment

fjetter Apr 22, 2021

Choose a reason for hiding this comment

fjetter Apr 22, 2021

Choose a reason for hiding this comment