Don't stop Adaptive on error #8871

hendrikmakait · 2024-09-10T17:05:55Z

Tests added / passed
Passes pre-commit run --all-files

github-actions · 2024-09-10T17:14:17Z

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

25 files ± 0 25 suites ±0 10h 22m 58s ⏱️ + 8m 24s
4 128 tests + 2 4 014 ✅ + 3 110 💤 - 1 4 ❌ ±0
47 686 runs +24 45 586 ✅ +35 2 095 💤 - 12 5 ❌ +1

For more details on these failures, see this check.

Results for commit b42358a. ± Comparison against base commit 80b3af5.

This pull request removes 6 and adds 8 tests. Note that renamed tests count towards both.

distributed.deploy.tests.test_adaptive ‑ test_adaptive_scale_down_override
distributed.deploy.tests.test_adaptive_core ‑ test_adapt_oserror_safe_target
distributed.deploy.tests.test_adaptive_core ‑ test_adapt_oserror_scale
distributed.deploy.tests.test_adaptive_core ‑ test_adapt_stop_del
distributed.deploy.tests.test_adaptive_core ‑ test_adaptive_logs_stopping_once
distributed.deploy.tests.test_adaptive_core ‑ test_interval

distributed.deploy.tests.test_adaptive ‑ test_adapt_callback_logs_error_in_scale_down
distributed.deploy.tests.test_adaptive ‑ test_adapt_gets_stopped_on_cluster_close
distributed.deploy.tests.test_adaptive ‑ test_adapt_logs_error_in_safe_target
distributed.deploy.tests.test_adaptive ‑ test_adapt_stop_del
distributed.deploy.tests.test_adaptive ‑ test_adaptive_logs_stopping_once[False]
distributed.deploy.tests.test_adaptive ‑ test_adaptive_logs_stopping_once[True]
distributed.deploy.tests.test_adaptive ‑ test_adaptive_stops_on_cluster_status_change
distributed.deploy.tests.test_adaptive ‑ test_interval

♻️ This comment has been updated with latest results.

jacobtomlinson

This seems fine to me. It's not clear from reading #2904 why that stop was introduced in the first place.

hendrikmakait · 2024-09-10T17:41:33Z

This seems fine to me. It's not clear from reading #2904 why that stop was introduced in the first place.

My gut feeling is that it was introduced to cover the case where the cluster closes, and the scheduler becomes unavailable. I think this should be covered by the Cluster object (and is indeed handled by Cluster). The open question is how to enforce this from subclasses or duck-typed clusters. (This is why I moved this to draft.)

jacobtomlinson · 2024-09-11T09:00:33Z

Yeah I feel like the cluster object should stop adapting if the scheduler connection closes. I think we want to implement a protocol for cluster objects that sets out expectations and formalises the API a little. But for now I think implementing it in SpecCluster should be enough?

hendrikmakait · 2024-09-13T09:19:01Z

@jacobtomlinson: Conceptually, I agree with your approach, but for now I'd like to go the more conversative route of having Adaptive itself check whether its cluster is still running. This has caused the PR to blow up a bit because some functionality and tests needed to move between adaptive and adaptive_core.

jacobtomlinson

This looks great, thanks for taking the time here Hendrik.

Have you tried these changes out with any of the projects that use adaptive scaling? AFAIK that's only dask-jobqueue and dask-cloudprovider today.

ntabris · 2024-09-16T21:38:38Z

distributed/deploy/adaptive.py

+        cluster: Cluster,
+        interval: str | float | timedelta | None = None,
+        minimum: int | None = None,
+        maximum: float | None = None,


Hm, this is a little confusing given that we actually either want int or math.inf, per

distributed/distributed/deploy/adaptive_core.py

Lines 112 to 113 in 8bafad5

if not isinstance(maximum, int) and not math.isinf(maximum):

raise TypeError(f"maximum must be int or inf; got {maximum}")

I don't have strong feelings here, this just follow the numeric tower in PEP 484, i.e., int | float can be simplified to float. Unfortunately there isn't a canonical INF literal that I'm aware of, instead we have math.inf, float("inf"), np.inf, and more. Happy to change it back if people find it confusing.

Intuitively I would say that None is equivalent to math.inf in this case. I expect most users would assume that setting the value to None would not set an upper bound. They might even assume that setting it to 0 does the same.

Should we attempt to define an INF literal in dask.typing that we can use here?

Sounds reasonable but let's do that in a different PR. I'll roll my changes to the type back for now.

Note that None is not equivalent to math.inf in this case. Instead, we read the config value if None is provided.

hendrikmakait · 2024-09-23T12:23:18Z

Have you tried these changes out with any of the projects that use adaptive scaling? AFAIK that's only dask-jobqueue and dask-cloudprovider today.

I've executed both test suites locally once without failures. There were quite a few skipped tests, so YMMV.

jacobtomlinson · 2024-09-27T14:00:17Z

🎉

Don't stop Adaptive on error

2b2a699

hendrikmakait requested review from jacobtomlinson and fjetter as code owners September 10, 2024 17:05

jacobtomlinson approved these changes Sep 10, 2024

View reviewed changes

hendrikmakait marked this pull request as draft September 10, 2024 17:37

Fix import

4852bd2

Refactor and add bi-directional checking for adaptive stopping

831675f

hendrikmakait requested a review from jacobtomlinson September 13, 2024 09:17

hendrikmakait marked this pull request as ready for review September 13, 2024 09:19

jacobtomlinson reviewed Sep 16, 2024

View reviewed changes

ntabris reviewed Sep 16, 2024

View reviewed changes

hendrikmakait added 3 commits September 23, 2024 13:58

int | float

1a92d24

int | float

488480b

Merge branch 'main' into dont-stop-adaptive-on-oserror

b42358a

jacobtomlinson approved these changes Sep 23, 2024

View reviewed changes

hendrikmakait merged commit e52da46 into dask:main Sep 27, 2024
23 of 30 checks passed

hendrikmakait deleted the dont-stop-adaptive-on-oserror branch September 27, 2024 11:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't stop Adaptive on error #8871

Don't stop Adaptive on error #8871

hendrikmakait commented Sep 10, 2024

github-actions bot commented Sep 10, 2024 •

edited

Loading

jacobtomlinson left a comment

hendrikmakait commented Sep 10, 2024

jacobtomlinson commented Sep 11, 2024

hendrikmakait commented Sep 13, 2024

jacobtomlinson left a comment

ntabris Sep 16, 2024

hendrikmakait Sep 17, 2024

jacobtomlinson Sep 17, 2024

hendrikmakait Sep 23, 2024

hendrikmakait Sep 23, 2024

hendrikmakait commented Sep 23, 2024

jacobtomlinson commented Sep 27, 2024

	if not isinstance(maximum, int) and not math.isinf(maximum):
	raise TypeError(f"maximum must be int or inf; got {maximum}")

Don't stop Adaptive on error #8871

Don't stop Adaptive on error #8871

Conversation

hendrikmakait commented Sep 10, 2024

github-actions bot commented Sep 10, 2024 • edited Loading

Unit Test Results

jacobtomlinson left a comment

Choose a reason for hiding this comment

hendrikmakait commented Sep 10, 2024

jacobtomlinson commented Sep 11, 2024

hendrikmakait commented Sep 13, 2024

jacobtomlinson left a comment

Choose a reason for hiding this comment

ntabris Sep 16, 2024

Choose a reason for hiding this comment

hendrikmakait Sep 17, 2024

Choose a reason for hiding this comment

jacobtomlinson Sep 17, 2024

Choose a reason for hiding this comment

hendrikmakait Sep 23, 2024

Choose a reason for hiding this comment

hendrikmakait Sep 23, 2024

Choose a reason for hiding this comment

hendrikmakait commented Sep 23, 2024

jacobtomlinson commented Sep 27, 2024

github-actions bot commented Sep 10, 2024 •

edited

Loading