Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor failover when resources have some unsupported features on certain cloud #2245

Merged
merged 2 commits into from
Jul 17, 2023

Conversation

cblmemo
Copy link
Collaborator

@cblmemo cblmemo commented Jul 15, 2023

The failover function in the optimizer helps us get rid of invalid resources for a certain cloud. Say if we specify --disk-tier high when sky launch, we wanted the optimizer only pick clouds that support the customized disk_tier feature.

Previously, we have to manually modify _get_feasible_launchable_resources function in all cloud to handle the special case: return an empty list to the optimizer when one unsupported feature is used, so that the optimizer will skip this cloud since no valid resources are found. This PR refactors the implementation of this function, by describing unsupported features in resources with clouds.CloudImplementationFeatures, and adding a check before we step in cloud-specific _get_feasible_launchable_resources functions.

After this PR is merged, #1910 and #2210 could have similar implementations when docker/ports are specified.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (NOTICE: I only have a subscription for AWS & GCP & Azure & Lambda)
    • sky launch --disk-tier high and only AWS & GCP are listed
    • sky launch --disk-tier medium and only AWS & GCP & Azure are listed
    • sky launch --disk-tier low and only AWS & GCP & Azure are listed
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

cblmemo added a commit to cblmemo/skypilot that referenced this pull request Jul 16, 2023
@Michaelvll Michaelvll self-requested a review July 16, 2023 23:29
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome @cblmemo! We have been thinking of having this kind of refactoring for a while. LGTM.

@cblmemo cblmemo merged commit 3a8ae21 into skypilot-org:master Jul 17, 2023
@cblmemo cblmemo deleted the refactor-feasible-resources branch July 17, 2023 00:17
@Michaelvll Michaelvll mentioned this pull request Jul 17, 2023
2 tasks
cblmemo added a commit that referenced this pull request Aug 1, 2023
* finished AWS & GCP, test passed on GCP

* GCP now create rules with cluster-specified tags

* Creating new SGs on AWS and configuring user specified rules

* new SG name

* move to resources & allow tcp only

* disable request of newer ports in task.yaml

* lint

* delete firewall rules & aws SGs after teardown

* add smoke test & cluster name length limit

* format

* add doc

* add error when not using aws & gcp

* add failover

* remove redundant length checking

* temporary remove failover and wait for #2245

* move to new provisioner api

* add unsupported feature

* add ports to resources.get_required_cloud_features

* use CloudImplementationFeatures to validate port

* remove assert for interrupted launching process

* renaming provision package

* AWS: only create new SG when port is specified

* nit: variable name

* remove redundant error message since check is done in resources validation

* support port range

* refactor GCP implementation

* remove unused function

* rename gcp rule name to avoid dependency and name too long

* remove redundant tags

* remove redundant silent argument

* change cluster name hash to truncate + hash

* format

* remove create firewall wait function

* rename hash_cluster_name to truncate_and_hash_cluster_name

* fix error when launching TPU node

* disable TPU with ports for now

* format

* enable tpu nodes

* enable tpu vm

* raise error instead of assert

* Update sky/skylet/providers/gcp/config.py

Co-authored-by: Zhanghao Wu <[email protected]>

---------

Co-authored-by: Zhanghao Wu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants