-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Errors when scaling up cluster no longer propagate to client side #8309
Comments
How was that error propagated to the client before? |
So, the code snippet you are posting shows what I would expect to happen. I'm surprised to hear that this worked differently before |
I think it propagated through the cluster.sync(cluster._correct_state) call. At least without that, I don't see any errors raised (even before #8233 ) |
That makes sense. What exactly is your expectation? The public API did not change so I assume you are internally using |
I expect:
To raise a RuntimeError. With current main it does not. |
I think I have a fix. The cluster's |
I would like us to find a way for public API to make sense for you. I wouldn't even want to guarantee that I does sound like you want a public version |
Ah, thanks. That issue does capture the kind of things I was observing.
Yes, I think so. Rationale for this is that in dask-cuda when we boot the cluster we have to do things with the worker spec to ensure that each worker binds to the appropriate GPU (if there is more than one GPU on a node). To do this In this scenario, we'd like to be able to report failures back to the user immediately. I think I agree that it makes sense that |
I believe the "wait for scaleup" is typically done with
I don't have a strong preference of which API to use for this but I suggest to not use a method that is marked as internal with an underscore. Are you interested in looking into a public version of this? |
I had a first minimal go at restoring the old behaviour (in #8314) and will endeavour to do something more sensible (after some annual leave, so in about two weeks). IIUC, I think an appropriate fix would be to address (at least some of) the problems raised in #5919. Since just moving |
Describe the issue:
Since #8233, scaling up a cluster with a worker plugin that produces an error no longer propagates that error onto the client side. Creating such a cluster with a non-zero number of workers initially does produce an error.
Is this intended behaviour? And if so, is there a new blessed way of propagating errors when scaling up a cluster?
Minimal Complete Verifiable Example:
Running with distributed
2023.10.0-21-gb62b7005
The text was updated successfully, but these errors were encountered: