-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ray Azure Autoscaler] new unique_id generation process leads to Azure killing & relaunching running VMs #31538
Comments
good point, looks like the unique_id was over-applied here. we should revert this to the old approach |
Thanks @gramhagen . Do you have time to file a PR for the suggested fix? |
yeah, let me create a pr, will need a little time to test it though |
looks like this is a problem with the deployment name, I just updated my changes to use the unique vm name for the deployment too. (and included the cluster name) to make it a little clearer if people have nodes from multiple clusters in the same rg. |
yeah, the change here should resolve that: https://github.com/ray-project/ray/pull/31645/files (line 258) |
Thank you! |
Marking P2 because the Ray Azure launcher is community maintained (we should still try to fix it!) |
Since we're on this potential bug fix release, another thing I've been seeing on my end is that, even though Do you know why this is happening? |
afaik, there is no way to destroy resources that are not nodes when running it's best to deploy all resources into a resource group and delete that after shutting down the cluster. but you can also use the unique key to identify resources corresponding to a specific cluster and delete those manually. I think it will be a significant amount of work to automate that because there are dependencies that need to be accounted for in the order of deleting network resources. |
Hi @gramhagen, thanks for working on a fix! What's the state of affairs with this issue? The problem still exists for me. Is there a workaround with current ray? |
waiting for #31645 to close, you could grab the changes from that branch to build, but it will be easier to wait for the ci tools to build the wheel for you once the pr is merged. |
…ays (#31645) This reverts prior changes to node naming which led to non-unique names, causing constant node refreshing Currently the Azure autoscaler blocks on node destruction, so that was removed in this change Related issue number Closes #31538 Closes #25971 --------- Signed-off-by: Scott Graham <[email protected]> Co-authored-by: Scott Graham <[email protected]>
…ays (ray-project#31645) This reverts prior changes to node naming which led to non-unique names, causing constant node refreshing Currently the Azure autoscaler blocks on node destruction, so that was removed in this change Related issue number Closes ray-project#31538 Closes ray-project#25971 --------- Signed-off-by: Scott Graham <[email protected]> Co-authored-by: Scott Graham <[email protected]> Signed-off-by: Jack He <[email protected]>
…ays (ray-project#31645) This reverts prior changes to node naming which led to non-unique names, causing constant node refreshing Currently the Azure autoscaler blocks on node destruction, so that was removed in this change Related issue number Closes ray-project#31538 Closes ray-project#25971 --------- Signed-off-by: Scott Graham <[email protected]> Co-authored-by: Scott Graham <[email protected]> Signed-off-by: Edward Oakes <[email protected]>
…ays (ray-project#31645) This reverts prior changes to node naming which led to non-unique names, causing constant node refreshing Currently the Azure autoscaler blocks on node destruction, so that was removed in this change Related issue number Closes ray-project#31538 Closes ray-project#25971 --------- Signed-off-by: Scott Graham <[email protected]> Co-authored-by: Scott Graham <[email protected]>
…ays (ray-project#31645) This reverts prior changes to node naming which led to non-unique names, causing constant node refreshing Currently the Azure autoscaler blocks on node destruction, so that was removed in this change Related issue number Closes ray-project#31538 Closes ray-project#25971 --------- Signed-off-by: Scott Graham <[email protected]> Co-authored-by: Scott Graham <[email protected]> Signed-off-by: elliottower <[email protected]>
…ays (ray-project#31645) This reverts prior changes to node naming which led to non-unique names, causing constant node refreshing Currently the Azure autoscaler blocks on node destruction, so that was removed in this change Related issue number Closes ray-project#31538 Closes ray-project#25971 --------- Signed-off-by: Scott Graham <[email protected]> Co-authored-by: Scott Graham <[email protected]> Signed-off-by: Jack He <[email protected]>
What happened + What you expected to happen
@gramhagen @richardliaw
The new release of Ray 2.1 and 2.2 creates a bug where
unique_id
is a fixed formula instead of olduuid4()
random one.The problem is, b/c Ray by default launches 5 instances and gradually scale up, the new
unique_id
scheme will keep re-using a pool of 5 unique ids and Azure as a result, will keep killing and restarting existing healthy running VMs.You won't be able to scale more than 5, and also, your existing running VMs will constantly be killed and then relaunched.
This is old (Ray 2.0):
ray/python/ray/autoscaler/_private/_azure/node_provider.py
Line 224 in 2947e23
This is new: (Ray 2.1 or 2.2):
ray/python/ray/autoscaler/_private/_azure/node_provider.py
Line 226 in f1b8bfd
This is a pretty serious bug IMO.
Versions / Dependencies
Ray = 2.1/2.2 (<= 2.0 is fine)
Reproduction script
Just launch a cluster and you will see it.
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: