Add no worker timeout for scheduler #8371

FTang21 · 2023-11-26T18:57:21Z

Tests added / passed
Passes pre-commit run --all-files

Add a new variable no_worker_timeout that checks if there are tasks still waiting to be processed, but no worker is processing them.
Update logic to include this behavior without changing the old logic too much.

GPUtester · 2023-11-26T18:57:23Z

Can one of the admins verify this patch?

Admins can comment ok to test to allow this one PR to run or add to allowlist to allow all future PRs from the same author to run.

github-actions · 2023-11-26T19:49:04Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

27 files + 1 27 suites +1 9h 54m 41s ⏱️ + 45m 34s
3 957 tests + 5 3 846 ✅ + 7 109 💤 ± 0 2 ❌ - 2
49 774 runs +2 587 47 483 ✅ +2 511 2 289 💤 +78 2 ❌ - 2

For more details on these failures, see this check.

Results for commit 50c7024. ± Comparison against base commit b95cf96.

♻️ This comment has been updated with latest results.

fjetter

Can you please add a unit test for this? If you add the timeout as a config value it should be much easier to write one.

distributed/scheduler.py

hendrikmakait · 2023-12-18T14:02:53Z

@fjetter: Do you have time for another review on this PR?

fjetter · 2023-12-21T12:42:16Z

Good looks fine but there are a couple of test failures that appear to be related

There is a test failure in distributed.tests.test_config that raises a config schema error. That's likely connected to the new option.

test_idle_timeout_no_workers also failed once. This is likely a little flaky but I'd be surprised if this isn't connected. I recommend running the test a couple of times concurrently. This typically triggers those flaky things reliably, e.g. pytest -x distributed/tests/test_scheduler.py::test_idle_timeout_no_workers -nauto --count=100 (requires pytest-xdist and pytest-repeat)

crusaderky · 2024-01-02T14:19:13Z

@FTang21 do you have time to look into the oustanding issues?

distributed/scheduler.py

distributed/distributed-schema.yaml

distributed/scheduler.py

distributed/tests/test_scheduler.py

crusaderky · 2024-01-04T15:15:47Z

I can envision how users may get confused by the cluster shutting down unexpectedly. Could you please add logging that explain which timeout tripped the timeout?

crusaderky · 2024-01-11T12:20:59Z

distributed/scheduler.py

@@ -3860,9 +3863,12 @@ async def post(self):
            pc = PeriodicCallback(self.check_worker_ttl, self.worker_ttl * 1000)
            self.periodic_callbacks["worker-ttl"] = pc

-        pc = PeriodicCallback(self.check_idle, (self.idle_timeout or 1) * 1000 / 4)
+        pc = PeriodicCallback(self.check_idle, 250)


These methods are tiny and running them 4 times per second is inconsequential.
I would be annoyed if I had set a timeout of 2h and the shutdown was 30min later than expected.
Also, some third-party SchedulerPlugin could read from self.idle_since or self.no_workers_since and break when the user sets a timeout.

hendrikmakait · 2024-01-11T13:43:57Z

distributed/scheduler.py

+    worker_ttl: float | None
+    idle_since: float | None
+    idle_timeout: float | None
+    no_workers_since: float | None  # Note: not None iff there are pending tasks
+    no_workers_timeout: float | None


In the spirit of #8190, is there a reason to make these public?

As a rule of thumb it makes sense to me to keep the variables that hold a public setting as public too.
I don't have a strong opinion on the two _since variables or the two check_ methods, so I'll change them to private.

I'll change the pre-existing attributes in a follow-up PR though.

As a rule of thumb it makes sense to me to keep the variables that hold a public setting as public too.

👍, makes sense to me.

distributed/tests/test_scheduler.py

hendrikmakait · 2024-01-11T13:53:19Z

distributed/scheduler.py

+        if self.status in (Status.closing, Status.closed):
+            return  # pragma: nocover
+
+        if (not self.queued and not self.unrunnable) or (self.queued and self.workers):


Should we also check whether we have tasks in processing (regardless of queued tasks) and return if that's the case?

I don't think so.
The intent is to shut down the cluster if there are tasks to run but nowhere to run them.

"there's a task to run" translates to queued,unrunnable, or processing.

unrunnable tasks can exist if there are no workers whatsoever and queueing is disabled, or if there are some workers but they've all been excluded by task restrictions. In both cases their existance should trip the timeout.

queued tasks can exist either if there are no workers whatsoever or all workers are busy with other tasks. So if there are any workers on the cluster, we can just assume they're already busy with other rootish tasks and reset the timeout.

processing tasks exist if there's somewhere to run them, which should cause the timeout to reset.

Doesn't that translate to:

Suggested change

if (not self.queued and not self.unrunnable) or (self.queued and self.workers):

if self.processing or (not self.queued and not self.unrunnable) or (self.queued and self.workers):

Also, let me rephrase:
Should we also check whether we have tasks in processing (regardless of queued tasks) and reset no_workers_since if that's the case?

SchedulerState.processing sadly doesn't exist; you'd need any(ws.processing for ws in self.workers.values()) (like check_idle does).

Do you see a use case that the current logic doesn't cover? I can't think of any...

@gen_cluster( client=True, nthreads=[("", 1)], config={"distributed.scheduler.no-workers-timeout": "100ms"}, ) async def test_no_workers_timeout_with_worker(c, s, a): """Do not trip no-workers-timeout when there are tasks processing""" import time s._check_no_workers() await asyncio.sleep(0.2) assert s.status == Status.running f1 = c.submit(time.sleep, 2) f2 = c.submit(inc, 1, key="x", workers=["127.0.0.2:1234"]) await f1 assert s.status == Status.running

would kill the scheduler before we're able to complete f1. From what I understand, we only ever want to kill the cluster if there's nothing that could possibly be processed with the current set of workers.

👀 I had not seen that use case. Thank you. Fixed.

Co-authored-by: Hendrik Makait <[email protected]>

crusaderky · 2024-01-11T14:50:25Z

@hendrikmakait all comments have been addressed

hendrikmakait · 2024-01-11T15:14:31Z

distributed/scheduler.py

+        if self.status in (Status.closing, Status.closed):
+            return  # pragma: nocover
+
+        if (not self.queued and not self.unrunnable) or (self.queued and self.workers):


@gen_cluster( client=True, nthreads=[("", 1)], config={"distributed.scheduler.no-workers-timeout": "100ms"}, ) async def test_no_workers_timeout_with_worker(c, s, a): """Do not trip no-workers-timeout when there are tasks processing""" import time s._check_no_workers() await asyncio.sleep(0.2) assert s.status == Status.running f1 = c.submit(time.sleep, 2) f2 = c.submit(inc, 1, key="x", workers=["127.0.0.2:1234"]) await f1 assert s.status == Status.running

would kill the scheduler before we're able to complete f1. From what I understand, we only ever want to kill the cluster if there's nothing that could possibly be processed with the current set of workers.

hendrikmakait · 2024-01-11T15:21:17Z

distributed/distributed.yaml

    idle-timeout: null      # Shut down after this duration, like "1h" or "30 minutes"
+    no-workers-timeout: 20m # Shut down if there are tasks but no workers to process them


Given that idle-timeout is currently null, I'd also default to null for no-workers-timeout. If I don't want to shut down my cluster when there's nothing at all to be done, I probably also don't want to shut it down if I have something to be done but lack the means to do so.

Well, there's the use case of adaptive clusters that scale down to zero or almost zero. There you likely want to keep the scheduler always running, but if the cluster hangs e.g. while 100 CPU workers are up because a single GPU worker failed to start, you want to tear it down quickly.

However, I agree that None is generally a more desirable default particularly for non-adaptive situations.

Well, there's the use case of adaptive clusters that scale down to zero or almost zero. There you likely want to keep the scheduler always running, but if the cluster hangs e.g. while 100 CPU workers are up because a single GPU worker failed to start, you want to tear it down quickly.

Fair point

crusaderky · 2024-01-11T16:29:20Z

@hendrikmakait all comments addressed

hendrikmakait

Thanks! LGTM assuming CI ends up green(ish).

FTang21 added 2 commits November 26, 2023 12:38

Add no worker timeout for scheduler

f9f4c8e

update with pre-commit

28f3f92

FTang21 requested a review from fjetter as a code owner November 26, 2023 18:57

fjetter reviewed Nov 27, 2023

View reviewed changes

distributed/scheduler.py Outdated Show resolved Hide resolved

distributed/scheduler.py Outdated Show resolved Hide resolved

FTang21 and others added 5 commits November 28, 2023 17:11

Merge branch 'dask:main' into main

28891d3

Add idle-no-worker-timeout to yaml files as a config option

0923268

Merge branch 'main' of https://github.com/FTang21/distributed

15fe0a7

Add unit test for when tasks are unrunnable and timeout to be idle

aa6c918

Update no_worker_since to begin as None to maintain old behavior

0288779

FTang21 requested a review from fjetter December 10, 2023 02:35

crusaderky requested changes Jan 4, 2024

View reviewed changes

Merge branch 'main' into FTang21

81f63bb

crusaderky force-pushed the main branch from 699ebff to 00b54eb Compare January 11, 2024 12:19

crusaderky reviewed Jan 11, 2024

View reviewed changes

crusaderky self-requested a review January 11, 2024 12:27

crusaderky self-assigned this Jan 11, 2024

crusaderky removed their request for review January 11, 2024 12:27

Overhaul

092574b

crusaderky force-pushed the main branch from 00b54eb to 092574b Compare January 11, 2024 12:34

crusaderky self-requested a review January 11, 2024 13:31

hendrikmakait reviewed Jan 11, 2024

View reviewed changes

crusaderky and others added 2 commits January 11, 2024 14:16

Update distributed/tests/test_scheduler.py

b403353

Co-authored-by: Hendrik Makait <[email protected]>

Private attributes

1056669

hendrikmakait self-requested a review January 11, 2024 14:54

hendrikmakait requested changes Jan 11, 2024

View reviewed changes

hendrikmakait reviewed Jan 11, 2024

View reviewed changes

crusaderky added 3 commits January 11, 2024 16:21

Review

c675f87

fix test

ae56655

disable by default

50c7024

hendrikmakait approved these changes Jan 11, 2024

View reviewed changes

crusaderky merged commit 9fb41e3 into dask:main Jan 11, 2024
30 of 33 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add no worker timeout for scheduler #8371

Add no worker timeout for scheduler #8371

FTang21 commented Nov 26, 2023 •

edited

Loading

GPUtester commented Nov 26, 2023

github-actions bot commented Nov 26, 2023 •

edited

Loading

fjetter left a comment

hendrikmakait commented Dec 18, 2023

fjetter commented Dec 21, 2023

crusaderky commented Jan 2, 2024

crusaderky commented Jan 4, 2024

crusaderky Jan 11, 2024 •

edited

Loading

hendrikmakait Jan 11, 2024

crusaderky Jan 11, 2024 •

edited

Loading

crusaderky Jan 11, 2024

hendrikmakait Jan 11, 2024

hendrikmakait Jan 11, 2024

crusaderky Jan 11, 2024 •

edited

Loading

hendrikmakait Jan 11, 2024

hendrikmakait Jan 11, 2024

crusaderky Jan 11, 2024

hendrikmakait Jan 11, 2024

crusaderky Jan 11, 2024

crusaderky commented Jan 11, 2024

hendrikmakait Jan 11, 2024

hendrikmakait Jan 11, 2024

crusaderky Jan 11, 2024 •

edited

Loading

hendrikmakait Jan 11, 2024

crusaderky commented Jan 11, 2024

hendrikmakait left a comment

	if (not self.queued and not self.unrunnable) or (self.queued and self.workers):
	if self.processing or (not self.queued and not self.unrunnable) or (self.queued and self.workers):

		idle-timeout: null # Shut down after this duration, like "1h" or "30 minutes"
		no-workers-timeout: 20m # Shut down if there are tasks but no workers to process them

Add no worker timeout for scheduler #8371

Add no worker timeout for scheduler #8371

Conversation

FTang21 commented Nov 26, 2023 • edited Loading

GPUtester commented Nov 26, 2023

github-actions bot commented Nov 26, 2023 • edited Loading

Unit Test Results

fjetter left a comment

Choose a reason for hiding this comment

hendrikmakait commented Dec 18, 2023

fjetter commented Dec 21, 2023

crusaderky commented Jan 2, 2024

crusaderky commented Jan 4, 2024

crusaderky Jan 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky Jan 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky Jan 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky commented Jan 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky Jan 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crusaderky commented Jan 11, 2024

hendrikmakait left a comment

Choose a reason for hiding this comment

FTang21 commented Nov 26, 2023 •

edited

Loading

github-actions bot commented Nov 26, 2023 •

edited

Loading

crusaderky Jan 11, 2024 •

edited

Loading

crusaderky Jan 11, 2024 •

edited

Loading

crusaderky Jan 11, 2024 •

edited

Loading

crusaderky Jan 11, 2024 •

edited

Loading