Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backward Compatibility][Spot] Avoid cluster leakage by ray yaml overwritten and reduce spot controller cost on AWS #1235

Merged
merged 10 commits into from
Oct 14, 2022

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Oct 12, 2022

It resubmits #1221 , and add backward compatibility for it.

Update (Oct 13th):

Our current default instance created on the AWS has the volume with the maximum IOPs, i.e. 16000. However, by default, the instance created on the AWS console only has 3000 IOPs by default.
Our volume costs (16000-3000) * 0.005 / 30 / 24 = $0.09 / hour, which is 20% of the cost for our default CPU machine on AWS m6i.2xlarge.

Todo item: make this IOPs configurable across the clouds (probably simplify it by only having three tiers: low, medium, high).

Tested:

  • sky cpunode -c test-iops
  • sky launch --disk-size 1024
  • sky launch --disk-size 50
  • sky gpunode -c test-t4 --gpus t4
  • sky launch -c bk-iops --num-nodes 2 (before this PR); sky stop bk-iops; sky start bk-iops it starts the same 2 instances.

@Michaelvll Michaelvll changed the title Reduce iops aws with backward compatibility [AWS] Reduce iops aws with backward compatibility Oct 12, 2022
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice @Michaelvll! Questions:

sky/backends/backend_utils.py Outdated Show resolved Hide resolved
sky/backends/backend_utils.py Outdated Show resolved Hide resolved
@Michaelvll
Copy link
Collaborator Author

Does the mechanism here also fix #640? E.g., put "everything below and including file_mounts can and should be overwritten" into the key set.

This is a great point! I added the keys used for calculating the launch_hash in the key set. Supposedly, the issue will be fixed.

Would https://github.com/skypilot-org/skypilot/blob/master/tests/backward_comaptibility_tests.sh catch the back compat issue? Thinking whether next time we can just enforce running that test to catch such issues more easily.

I just updated the script so that it can capture the problem with unexpectedly launching another instance. The script needs some refactorization. Probably, we can do it in another PR?

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! We can update the PR title/desc to reflect

# - keeping the auth is not enough becuase the content of the key file will be used
# for calculating the hash.
# TODO(zhwu): Keep in sync with the fields used in https://github.com/ray-project/ray/blob/e4ce38d001dbbe09cd21c497fedd03d692b2be3e/python/ray/autoscaler/_private/commands.py#L687-L701
_RAY_YAML_KEYS_TO_RESTORE_FOR_BACK_COMPATIBILITY = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it true that none of the following contribute to the launch hash?

  • resources under ray.head.default:
  • any of
cluster_name: {{cluster_name}}

# The maximum number of workers nodes to launch in addition to the head node.
max_workers: {{num_nodes - 1}}
upscaling_speed: {{num_nodes - 1}}
idle_timeout_minutes: 60
  • head_node_type: ray.head.default

Just figuring out whether we should preserve these fields.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added the cluster_name in the key set. For the resources and head_node_type, I think it would be better to add them when we actually want to modify those fields and find them affecting backward compatibility in the future.
As far as I understand, the max_workers, upscaling_speed and idle_timeout_minutes won't affect the launch hash.

sky/backends/backend_utils.py Outdated Show resolved Hide resolved
@@ -733,8 +750,11 @@ def write_cluster_config(to_provision: 'resources.Resources',
yaml_path = _get_yaml_path_from_cluster_name(cluster_name)
old_yaml_content = None
if os.path.exists(yaml_path):
with open(yaml_path, 'r') as f:
old_yaml_content = f.read()
if force_overwrite:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would be the case that the cluster didn't exist but this file exists? If this is an exceptional case to guard against, rename it to keep_launch_fields_in_existing_config: bool?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The configure yaml files may not be correctly deleted if an error happens during sky.down. I did find multiple yaml files in the folder ~/.sky/generated not belonging to any existing cluster.
Good point! Let me rename the variable.

sky/backends/backend_utils.py Outdated Show resolved Hide resolved
@Michaelvll Michaelvll changed the title [AWS] Reduce iops aws with backward compatibility [Backward Compatibility][Spot] Avoid cluster leakage by ray yaml overwritten and reduce spot controller cost on AWS Oct 13, 2022
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - this is awesome to have @Michaelvll.

@Michaelvll
Copy link
Collaborator Author

Michaelvll commented Oct 13, 2022

Tested:

  • ./tests/backward_compatibility_tests.sh
  • ./tests/run_smoke_tests.sh

@Michaelvll Michaelvll merged commit ed4b3f2 into master Oct 14, 2022
@Michaelvll Michaelvll deleted the reduce-iops-aws branch October 14, 2022 00:11
ewzeng pushed a commit to ewzeng/skypilot that referenced this pull request Oct 24, 2022
…written and reduce spot controller cost on AWS (skypilot-org#1235)

* set the default iops to be same as console for AWS

* fix

* add backward compatibility

* Address comments

* fix backward_compatibility_test

* Add backward test for discarding old cluster

* update backward

* less output

* address comments
ewzeng pushed a commit to ewzeng/skypilot that referenced this pull request Oct 24, 2022
…written and reduce spot controller cost on AWS (skypilot-org#1235)

* set the default iops to be same as console for AWS

* fix

* add backward compatibility

* Address comments

* fix backward_compatibility_test

* Add backward test for discarding old cluster

* update backward

* less output

* address comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Back-compat: don't overwrite certain cluster configs in yaml if cluster exists
2 participants