Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop/down clusters with new provisioner #2121

Merged
merged 6 commits into from
Jul 3, 2023
Merged

Conversation

suquark
Copy link
Collaborator

@suquark suquark commented Jun 22, 2023

This PR is part of the new provisioner PR, which handles stopping/terminating of clusters. This PR only includes the implementation for AWS, but it can serve as a starting point for other clouds.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@suquark suquark changed the title [WIP] Stop/down clusters with new provisioner Stop/down clusters with new provisioner Jun 22, 2023
@suquark suquark requested a review from Michaelvll June 22, 2023 12:04
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for separating this PR out from the new provisioner PR @suquark! I really like how tidy these codes look. Left several questions and thought, mainly about how we decouple things. We can jump on a call when more discussion is needed.

sky/provision/common.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/provision/aws/utils.py Outdated Show resolved Hide resolved
sky/provision/aws/instance.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Show resolved Hide resolved
sky/provision/aws/instance.py Show resolved Hide resolved
@suquark suquark requested a review from Michaelvll June 24, 2023 12:09
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @suquark! The code mostly looks good to me. While I met the following error during the test with pytest tests/test_smoke.py --aws, i.e. the self.run_on_head does not work well when the cluster does not exist.

  File "/home/gcpuser/skypilot/sky/backends/cloud_vm_ray_backend.py", line 3295, in teardown_no_lock
    if not self.run_on_head(handle, 'ray stop --force'):
  File "/home/gcpuser/skypilot/sky/utils/common_utils.py", line 256, in _record
    return f(*args, **kwargs)
  File "/home/gcpuser/skypilot/sky/backends/cloud_vm_ray_backend.py", line 3632, in run_on_head
    head_ip = backend_utils.get_head_ip(handle, _FETCH_IP_MAX_ATTEMPTS)
  File "/home/gcpuser/skypilot/sky/utils/common_utils.py", line 256, in _record
    return f(*args, **kwargs)
  File "/home/gcpuser/skypilot/sky/backends/backend_utils.py", line 1575, in get_head_ip
    head_ip = _query_head_ip_with_retries(handle.cluster_yaml, max_attempts)
  File "/home/gcpuser/skypilot/sky/backends/backend_utils.py", line 1375, in _query_head_ip_with_retries
    raise RuntimeError('Failed to get head ip') from e
RuntimeError: Failed to get head ip

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
sky/backends/provision_lib.py Outdated Show resolved Hide resolved
sky/backends/provision_lib.py Outdated Show resolved Hide resolved
sky/provision/aws/instance.py Outdated Show resolved Hide resolved
sky/provision/aws/instance.py Outdated Show resolved Hide resolved
sky/provision/aws/instance.py Outdated Show resolved Hide resolved
@suquark suquark requested a review from Michaelvll June 26, 2023 10:14
@suquark
Copy link
Collaborator Author

suquark commented Jun 26, 2023

Thanks for your review! could you also tell me the failed tests? I can test if by myself

@suquark suquark force-pushed the new_provisioner_stop_down branch from 900dd3a to 6469dd5 Compare June 26, 2023 15:13
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @suquark! Just passed the test smoothly pytest tests/test_smoke.py. LGTM.

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
@suquark suquark merged commit 3c77d0e into master Jul 3, 2023
@suquark suquark deleted the new_provisioner_stop_down branch July 3, 2023 06:24
@concretevitamin
Copy link
Member

Woohoo!! Very excited to see this @suquark!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants