[Ray Azure Autoscaler] new unique_id generation process leads to Azure killing & relaunching running VMs #31538

mlguruz · 2023-01-09T19:31:28Z

What happened + What you expected to happen

The new release of Ray 2.1 and 2.2 creates a bug where unique_id is a fixed formula instead of old uuid4() random one.
The problem is, b/c Ray by default launches 5 instances and gradually scale up, the new unique_id scheme will keep re-using a pool of 5 unique ids and Azure as a result, will keep killing and restarting existing healthy running VMs.

You won't be able to scale more than 5, and also, your existing running VMs will constantly be killed and then relaunched.

This is old (Ray 2.0):

ray/python/ray/autoscaler/_private/_azure/node_provider.py

Line 224 in 2947e23

unique_id = uuid4().hex[:VM_NAME_UUID_LEN]

This is new: (Ray 2.1 or 2.2):

ray/python/ray/autoscaler/_private/_azure/node_provider.py

Line 226 in f1b8bfd

name=name_tag, id=self.provider_config["unique_id"]

This is a pretty serious bug IMO.

Versions / Dependencies

Ray = 2.1/2.2 (<= 2.0 is fine)

Reproduction script

Just launch a cluster and you will see it.

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

gramhagen · 2023-01-09T21:52:47Z

good point, looks like the unique_id was over-applied here. we should revert this to the old approach

zhe-thoughts · 2023-01-10T04:51:17Z

Thanks @gramhagen . Do you have time to file a PR for the suggested fix?

gramhagen · 2023-01-12T19:56:55Z

yeah, let me create a pr, will need a little time to test it though

mlguruz · 2023-01-12T21:16:50Z

yeah, let me create a pr, will need a little time to test it though

@gramhagen

I manually modified the node_provider.py, using the old uuid way of generating unique_id, and mount the files to both head and worker nodes, via ray autoscaler yaml entrypoint callbacks.

I can confirm that I'm able to scale up to more instances now.

One potential bug that I would remind is this:

I haven't got time to read all the code, but my gut tells me somewhere there's a sleep to allow Azure to spin up instances? (I could be wrong)

Will this cause any trouble?

This error (screenshot) I've seen in almost every release.

gramhagen · 2023-01-12T21:33:14Z

is error (screenshot) I've seen in almost every release.

looks like this is a problem with the deployment name, I just updated my changes to use the unique vm name for the deployment too. (and included the cluster name) to make it a little clearer if people have nodes from multiple clusters in the same rg.

mlguruz · 2023-01-12T21:43:24Z

Yes - I think for every cluster creation, it is re-using the same deployment name hence overwriting old deployments:

This might not be what we want as we might wanna check older deployment details for debugging.

Is there away to improve this?

gramhagen · 2023-01-12T22:46:04Z

yeah, the change here should resolve that: https://github.com/ray-project/ray/pull/31645/files (line 258)

mlguruz · 2023-01-12T22:52:26Z

yeah, the change here should resolve that: https://github.com/ray-project/ray/pull/31645/files (line 258)

Thank you!
That makes sense, so that each deployment is unique.

zhe-thoughts · 2023-01-20T05:17:45Z

Marking P2 because the Ray Azure launcher is community maintained (we should still try to fix it!)

mlguruz · 2023-01-20T17:13:23Z

@gramhagen

Since we're on this potential bug fix release, another thing I've been seeing on my end is that, even though ray down runs successfully, there are usually auxiliary components for VMs left that are not properly cleaned up, things like nic / os disk etc. (please see screenshot below)

Do you know why this is happening?

gramhagen · 2023-01-23T20:35:16Z

afaik, there is no way to destroy resources that are not nodes when running ray down which executes commands.py:teardown_cluster()

it's best to deploy all resources into a resource group and delete that after shutting down the cluster. but you can also use the unique key to identify resources corresponding to a specific cluster and delete those manually. I think it will be a significant amount of work to automate that because there are dependencies that need to be accounted for in the order of deleting network resources.

mrksr · 2023-02-28T09:50:31Z

Hi @gramhagen, thanks for working on a fix! What's the state of affairs with this issue? The problem still exists for me. Is there a workaround with current ray?

gramhagen · 2023-02-28T17:55:56Z

waiting for #31645 to close, you could grab the changes from that branch to build, but it will be easier to wait for the ci tools to build the wheel for you once the pr is merged.

…ays (#31645) This reverts prior changes to node naming which led to non-unique names, causing constant node refreshing Currently the Azure autoscaler blocks on node destruction, so that was removed in this change Related issue number Closes #31538 Closes #25971 --------- Signed-off-by: Scott Graham <[email protected]> Co-authored-by: Scott Graham <[email protected]>

…ays (ray-project#31645) This reverts prior changes to node naming which led to non-unique names, causing constant node refreshing Currently the Azure autoscaler blocks on node destruction, so that was removed in this change Related issue number Closes ray-project#31538 Closes ray-project#25971 --------- Signed-off-by: Scott Graham <[email protected]> Co-authored-by: Scott Graham <[email protected]> Signed-off-by: Jack He <[email protected]>

…ays (ray-project#31645) This reverts prior changes to node naming which led to non-unique names, causing constant node refreshing Currently the Azure autoscaler blocks on node destruction, so that was removed in this change Related issue number Closes ray-project#31538 Closes ray-project#25971 --------- Signed-off-by: Scott Graham <[email protected]> Co-authored-by: Scott Graham <[email protected]> Signed-off-by: Edward Oakes <[email protected]>

…ays (ray-project#31645) This reverts prior changes to node naming which led to non-unique names, causing constant node refreshing Currently the Azure autoscaler blocks on node destruction, so that was removed in this change Related issue number Closes ray-project#31538 Closes ray-project#25971 --------- Signed-off-by: Scott Graham <[email protected]> Co-authored-by: Scott Graham <[email protected]>

…ays (ray-project#31645) This reverts prior changes to node naming which led to non-unique names, causing constant node refreshing Currently the Azure autoscaler blocks on node destruction, so that was removed in this change Related issue number Closes ray-project#31538 Closes ray-project#25971 --------- Signed-off-by: Scott Graham <[email protected]> Co-authored-by: Scott Graham <[email protected]> Signed-off-by: elliottower <[email protected]>

…ays (ray-project#31645) This reverts prior changes to node naming which led to non-unique names, causing constant node refreshing Currently the Azure autoscaler blocks on node destruction, so that was removed in this change Related issue number Closes ray-project#31538 Closes ray-project#25971 --------- Signed-off-by: Scott Graham <[email protected]> Co-authored-by: Scott Graham <[email protected]> Signed-off-by: Jack He <[email protected]>

mlguruz added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 9, 2023

gramhagen mentioned this issue Jan 12, 2023

[azure][autoscaler] Fix Azure autoscaler node naming and deletion delays #31645

Merged

7 tasks

zhe-thoughts added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 20, 2023

wuisawesome closed this as completed in #31645 Feb 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ray Azure Autoscaler] new unique_id generation process leads to Azure killing & relaunching running VMs #31538

[Ray Azure Autoscaler] new unique_id generation process leads to Azure killing & relaunching running VMs #31538

mlguruz commented Jan 9, 2023 •

edited

Loading

gramhagen commented Jan 9, 2023

zhe-thoughts commented Jan 10, 2023

gramhagen commented Jan 12, 2023

mlguruz commented Jan 12, 2023

gramhagen commented Jan 12, 2023

mlguruz commented Jan 12, 2023

gramhagen commented Jan 12, 2023

mlguruz commented Jan 12, 2023

zhe-thoughts commented Jan 20, 2023

mlguruz commented Jan 20, 2023

gramhagen commented Jan 23, 2023

mrksr commented Feb 28, 2023

gramhagen commented Feb 28, 2023

[Ray Azure Autoscaler] new unique_id generation process leads to Azure killing & relaunching running VMs #31538

[Ray Azure Autoscaler] new unique_id generation process leads to Azure killing & relaunching running VMs #31538

Comments

mlguruz commented Jan 9, 2023 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

gramhagen commented Jan 9, 2023

zhe-thoughts commented Jan 10, 2023

gramhagen commented Jan 12, 2023

mlguruz commented Jan 12, 2023

gramhagen commented Jan 12, 2023

mlguruz commented Jan 12, 2023

gramhagen commented Jan 12, 2023

mlguruz commented Jan 12, 2023

zhe-thoughts commented Jan 20, 2023

mlguruz commented Jan 20, 2023

gramhagen commented Jan 23, 2023

mrksr commented Feb 28, 2023

gramhagen commented Feb 28, 2023

mlguruz commented Jan 9, 2023 •

edited

Loading