-
-
Notifications
You must be signed in to change notification settings - Fork 720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't stop Adaptive on error #8871
Don't stop Adaptive on error #8871
Conversation
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 25 files ± 0 25 suites ±0 10h 22m 58s ⏱️ + 8m 24s For more details on these failures, see this check. Results for commit b42358a. ± Comparison against base commit 80b3af5. This pull request removes 6 and adds 8 tests. Note that renamed tests count towards both.
♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine to me. It's not clear from reading #2904 why that stop was introduced in the first place.
My gut feeling is that it was introduced to cover the case where the cluster closes, and the scheduler becomes unavailable. I think this should be covered by the |
Yeah I feel like the cluster object should stop adapting if the scheduler connection closes. I think we want to implement a protocol for cluster objects that sets out expectations and formalises the API a little. But for now I think implementing it in |
@jacobtomlinson: Conceptually, I agree with your approach, but for now I'd like to go the more conversative route of having Adaptive itself check whether its cluster is still running. This has caused the PR to blow up a bit because some functionality and tests needed to move between |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, thanks for taking the time here Hendrik.
Have you tried these changes out with any of the projects that use adaptive scaling? AFAIK that's only dask-jobqueue
and dask-cloudprovider
today.
distributed/deploy/adaptive.py
Outdated
cluster: Cluster, | ||
interval: str | float | timedelta | None = None, | ||
minimum: int | None = None, | ||
maximum: float | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, this is a little confusing given that we actually either want int
or math.inf
, per
distributed/distributed/deploy/adaptive_core.py
Lines 112 to 113 in 8bafad5
if not isinstance(maximum, int) and not math.isinf(maximum): | |
raise TypeError(f"maximum must be int or inf; got {maximum}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have strong feelings here, this just follow the numeric tower in PEP 484, i.e., int | float
can be simplified to float
. Unfortunately there isn't a canonical INF
literal that I'm aware of, instead we have math.inf
, float("inf")
, np.inf
, and more. Happy to change it back if people find it confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intuitively I would say that None
is equivalent to math.inf
in this case. I expect most users would assume that setting the value to None
would not set an upper bound. They might even assume that setting it to 0
does the same.
Should we attempt to define an INF
literal in dask.typing
that we can use here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable but let's do that in a different PR. I'll roll my changes to the type back for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that None
is not equivalent to math.inf
in this case. Instead, we read the config value if None
is provided.
I've executed both test suites locally once without failures. There were quite a few skipped tests, so YMMV. |
🎉 |
Closes #8808
pre-commit run --all-files