skypilot-org · concretevitamin · Sep 7, 2023 · Aug 16, 2023 · Aug 18, 2023 · Aug 24, 2023
diff --git a/README.md b/README.md
@@ -1,4 +1,57 @@
 # new-cloud-skeleton
+
+*ℹ️ NOTE: This guide will change as SkyPilot develops. Please check back often to make sure you have the most up-to-date version. The read me in this doc has been derived from the [Google Doc](https://docs.google.com/document/d/1iuPyQ47HloKuHfOYjcRNz7HAlUxFHs2WjMqedYmcPVQ/edit#heading=h.nby65cfuzxoq)*
+
 Skeleton repo for what's needed to add a new cloud.
 
 Detailed instructions upcoming! Contact the dev team on [SkyPilot Slack](https://slack.skypilot.co/) to get access.
+
+## Introduction to SkyPilot
+
+[SkyPilot](https://github.com/skypilot-org/skypilot) is an intercloud broker -- a framework for running workloads on any cloud. Here are some useful links to learn more:
+
+1. [Introductory Blogpost](https://medium.com/@zongheng_yang/skypilot-ml-and-data-science-on-any-cloud-with-massive-cost-savings-244189cc7c0f) [Start here if you are new]
+2. [Documentation](https://skypilot.readthedocs.io/en/latest/)
+3. [The Sky Above the Clouds](https://arxiv.org/abs/2205.07147)
+4. [GitHub](https://github.com/skypilot-org/skypilot)
+
+## How does SkyPilot work?
+
+Here's a simplified overview of SkyPilot's architecture.
+
+TODO: diagram in google doc
+
+In this diagram, the user has two clouds enabled (AWS and GCP). This is what happens when a user launches a job with sky launch:
+
+1. The optimizer reads AWS Catalog and GCP Catalog and runs an algorithm to decide which cloud to run the job on. (Let's suppose the optimizer chooses AWS.) This information is then sent to the provisioner+executor.
+   - A catalog is a list of instance types and their prices.
+2. The provisioner+executor executes ray commands to launch a cluster on AWS.
+   - AWS Node Provider is the interface between ray and AWS, translating ray function calls to AWS API calls.
+3. Once the cluster is launched, the provisioner+executor ssh’s into the cluster to execute some AWS Setup commands. This is used to download some important packages on the cluster.
+4. The provisioner+executor submits the job to the cluster and the cluster runs the job.
+
+When all is done, the user can run sky down and provisioner+executor will tear down the cluster by executing more ray commands.
+
+## Getting Started
+
+Now let's say you have a new cloud, called FluffyCloud, that you want SkyPilot to support. What do you need to do?
+
+You need to:
+
+1. Write a NodeProvider for FluffyCloud. This is the most important part.
+2. Add the FluffyCloud catalog to SkyPilot and write functions that read this catalog.
+3. Write FluffyCloud setup code.
+4. Add FluffyCloud credential check to verify locally stored credentials. This is needed for a user to enable FluffyCloud.
+
+For reference, here is an actual merged PR for adding a new cloud to help you estimate what is required:
+
+- [Lambda Cloud](https://github.com/skypilot-org/skypilot/pull/1557)
+
+By completing the following steps, you will be able to run SkyPilot on FluffyCloud.
+
+- [Step 0](/docs/integration_steps/step_0-api-library.md)
+- [Step 1](/docs/integration_steps/step_1-node-provider.md)
+- [Step 2](/docs/integration_steps/step_2-catalog.md)
+- [Step 3](/docs/integration_steps/step_3-setup-code.md)
+- [Step 4](/docs/integration_steps/step_4-setup-code.md.md)
+- [Step 5](/docs/integration_steps/step_5-e2e-failover.md)
diff --git a/docs/integration_steps/step_0-api-library.md b/docs/integration_steps/step_0-api-library.md
@@ -0,0 +1,14 @@
+# Cloud Python Library Coverage
+
+This document is primarily for the mantainers of python libaries for their perspective cloud. It provides a list of functions that SkyPilot will need, by integrating these functions into your python library you can drastically reduce the amount of work and complexity needed to add your cloud to SkyPilot.
+
+## Function List
+
+1. Launch an instance
+2. Remove an instance
+3. Set instance tags
+4. List instances
+
+## API Wrapper/Middleware
+
+It is likely that you will require a wrapper to return the output from your python library into the format required by SkyPilot.
diff --git a/docs/integration_steps/step_1-node-provider.md b/docs/integration_steps/step_1-node-provider.md
@@ -0,0 +1,15 @@
+# Node Provider
+
+NodeProvider is the interface that ray uses to interact with cloud providers. First, you should read through the NodeProvider class definition [here](https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/node_provider.py). The docstrings give a good idea of what the NodeProvider class is.
+
+## Implementing a Node Provider
+
+1. Create the directory `sky/skylet/providers/{cloud_name}`
+2. Add `__init__.py` to the directory and add the following code:
+
+    ```python
+    from sky.skylet.providers.{cloud_name}.node_provider import {CloudName}NodeProvider
+    ```
+
+3. Copy the `node_provider.py` template into the directory.
+4. Complete the template. The template has comments to guide you through the process.
diff --git a/docs/integration_steps/step_2-catalog.md b/docs/integration_steps/step_2-catalog.md
@@ -0,0 +1,21 @@
+# Cloud Catalog
+
+A catalog is a CSV file under [SkyPilot Catalogs](https://github.com/skypilot-org/skypilot-catalog)
+
+| Field              | Type   | Description                                                                         |
+|--------------------|--------|-------------------------------------------------------------------------------------|
+| `InstanceType`     | string | The type of instance.                                                               |
+| `vCPUs`            | float  | The number of virtual CPUs.                                                         |
+| `MemoryGiB`        | float  | The amount of memory in GB.                                                         |
+| `AcceleratorName`  | string | The name of accelerators (GPU/TPU).                                                 |
+| `AcceleratorCount` | float  | The number of accelerators (GPU/TPU).                                               |
+| `GPUInfo`          | string | The human readable information of the GPU (not used in code).                       |
+| `Region`           | string | The region of the resource.                                                         |
+| `AvailabilityZone` | string | The availability zone of the resource (can be empty if not supported in the cloud). |
+| `Price`            | float  | The price of the resource.                                                          |
+| `SpotPrice`        | float  | The spot price of the resource.                                                     |
+
+
+## Parsing Catalog
+
+Create a copy of `fluffycloud_catalog.py` and place it at `sky/clouds/service_catalog/{cloudname_catalog}.py`. Aside from renaming fluffly cloud to your cloud name, you do not need to make additional changes.
diff --git a/docs/integration_steps/step_3-cloud-class.md b/docs/integration_steps/step_3-cloud-class.md
@@ -0,0 +1,9 @@
+# Cloud Class
+
+This class calls some service catalog functions and contains code that checks FluffyCloud credentials. Many of the functions in this class are also straightforward to implement.
+
+Start by creating a copy of `fluffycloud/fluffycloud.py` and place it at `sky/clouds/fluffycloud.py`
+
+## Credentials file
+
+The credentials file contains the users credentials required to access your cloud. You can specify the location of files to check for credentials by adding them to the `_CREDENTIAL_FILES` list.
diff --git a/docs/integration_steps/step_4-setup-code.md b/docs/integration_steps/step_4-setup-code.md
@@ -0,0 +1,16 @@
+# Setup Code
+
+This code is executed after a cluster is launched (via ssh). Most of this code is very similar to the existing setup code for other clouds, and may almost be a copy-paste.
+
+Create a copy of `fluffycloud-ray.yml.js` and place it at `sky/templates/<cloudname>-ray.yml.j2`
+
+## Ray Backend
+
+Open `sky/backends/cloud_vm_ray_backend.py` and edit the `_get_cluster_config_template` function to include the new cloud.
+
+Open `sky/backends/cloud_vm_ray_backend.py` and edit the `_add_auth_to_cluster_config` function to include the new cloud.
+
+
+### Authentication
+
+Cloud authentication is handled by `sky/authentication.py`. The `setup_<cloud>_authentication` functions will be called on every cluster provisioning request.
diff --git a/docs/integration_steps/step_5-e2e-failover.md b/docs/integration_steps/step_5-e2e-failover.md
@@ -0,0 +1,52 @@
+# Cloud E2E Failover
+
+Copy the following functions into sky/backends/cloud_vm_ray_backend.py be sure to update your \<CloudName\>
+
+```python
+ def _update_blocklist_on_<cloudname>_error(
+            self, launchable_resources: 'resources_lib.Resources',
+            region: 'clouds.Region', zones: Optional[List['clouds.Zone']],
+            stdout: str, stderr: str):
+        del zones  # Unused.
+        style = colorama.Style
+        stdout_splits = stdout.split('\n')
+        stderr_splits = stderr.split('\n')
+        errors = [
+            s.strip()
+            for s in stdout_splits + stderr_splits
+            if '<CloudName>Error:' in s.strip()
+        ]
+        if not errors:
+            logger.info('====== stdout ======')
+            for s in stdout_splits:
+                print(s)
+            logger.info('====== stderr ======')
+            for s in stderr_splits:
+                print(s)
+            with ux_utils.print_exception_no_traceback():
+                raise RuntimeError('Errors occurred during provision; '
+                                   'check logs above.')
+
+        logger.warning(f'Got error(s) in {region.name}:')
+        messages = '\n\t'.join(errors)
+        logger.warning(f'{style.DIM}\t{messages}{style.RESET_ALL}')
+        # NOTE: you can check out other clouds' implementations of this function,
+        # which may intelligently block a whole zone / whole region depending on
+        # the errors thrown.
+        self._blocked_resources.add(launchable_resources.copy(zone=None))
+```
+
+Within the `_update_blocklist_on_error` function add your cloud to the handlers dictionary
+
+```python
+def _update_blocklist_on_error(
+    ...
+    handlers = {
+        ...
+        # TODO Add this
+        clouds.FluffyCloud: self._update_blocklist_on_fluffycloud_error,
+        ...
+    }
+    ...
+...
+```
diff --git a/fluffycloud/fluffycloud.py b/fluffycloud/fluffycloud.py
@@ -3,12 +3,15 @@
 from typing import Dict, Iterator, List, Optional, Tuple
 
 from sky import clouds
+from sky import status_lib
 from sky.clouds import service_catalog
 
 if typing.TYPE_CHECKING:
     # Renaming to avoid shadowing variables.
     from sky import resources as resources_lib
 
+import fluffycloud_api as fc_api
+
 _CREDENTIAL_FILES = [
     # credential files for FluffyCloud,
 ]
@@ -21,11 +24,12 @@ class FluffyCloud(clouds.Cloud):
     _CLOUD_UNSUPPORTED_FEATURES = {
         clouds.CloudImplementationFeatures.STOP: 'FluffyCloud does not support stopping VMs.',
         clouds.CloudImplementationFeatures.AUTOSTOP: 'FluffyCloud does not support stopping VMs.',
-        clouds.CloudImplementationFeatures.MULTI_NODE: 'Multi-node is not supported by the FluffyCloud implementation yet.',
+        clouds.CloudImplementationFeatures.MULTI_NODE: 'Multi-node is not supported by the FluffyCloud implementation yet.'
+    }
     ########
     # TODO #
     ########
-    _MAX_CLUSTER_NAME_LEN_LIMIT = # TODO
+    _MAX_CLUSTER_NAME_LEN_LIMIT =  # TODO
 
     _regions: List[clouds.Region] = []
 
@@ -38,7 +42,6 @@ def _cloud_unsupported_features(
     def _max_cluster_name_length(cls) -> Optional[int]:
         return cls._MAX_CLUSTER_NAME_LEN_LIMIT
 
-
     @classmethod
     def regions(cls) -> List[clouds.Region]:
         if not cls._regions:
@@ -71,6 +74,14 @@ def regions_with_offering(cls, instance_type: Optional[str],
             regions = [r for r in regions if r.name == region]
         return regions
 
+    @classmethod
+    def get_vcpus_mem_from_instance_type(
+        cls,
+        instance_type: str,
+    ) -> Tuple[Optional[float], Optional[float]]:
+        # FILL_IN: cloudname
+        return service_catalog.get_vcpus_mem_from_instance_type(instance_type, clouds='<cloudname>')
+
     @classmethod
     def zones_provision_loop(
         cls,
@@ -108,12 +119,9 @@ def accelerators_to_hourly_cost(self,
                                     region: Optional[str] = None,
                                     zone: Optional[str] = None) -> float:
         del accelerators, use_spot, region, zone  # unused
-        ########
-        # TODO #
-        ########
-        # This function assumes accelerators are included as part of instance
-        # type. If not, you will need to change this. (However, you can do
-        # this later; `return 0.0` is a good placeholder.)
+        # FILL_IN: If accelerator costs are not included in instance_type cost,
+        # return the cost of the accelerators here. If accelerators are
+        # included in instance_type cost, return 0.0.
         return 0.0
 
     def get_egress_cost(self, num_gigabytes: float) -> float:
@@ -132,10 +140,8 @@ def is_same_cloud(self, other: clouds.Cloud) -> bool:
         return isinstance(other, FluffyCloud)
 
     @classmethod
-    def get_default_instance_type(cls,
-                                  cpus: Optional[str] = None) -> Optional[str]:
-        return service_catalog.get_default_instance_type(cpus=cpus,
-                                                         clouds='fluffycloud')
+    def get_default_instance_type(cls, cpus: Optional[str] = None) -> Optional[str]:
+        return service_catalog.get_default_instance_type(cpus=cpus, clouds='fluffycloud')
 
     @classmethod
     def get_accelerators_from_instance_type(
@@ -178,8 +184,8 @@ def make_deploy_resources_variables(
             'region': region.name,
         }
 
-    def get_feasible_launchable_resources(self,
-                                          resources: 'resources_lib.Resources'):
+    def _get_feasible_launchable_resources(self,
+                                           resources: 'resources_lib.Resources'):
         if resources.use_spot:
             return ([], [])
         if resources.instance_type is not None:
@@ -218,7 +224,7 @@ def _make(instance_list):
         assert len(accelerators) == 1, resources
         acc, acc_count = list(accelerators.items())[0]
         (instance_list, fuzzy_candidate_list
-        ) = service_catalog.get_instance_type_for_accelerator(
+         ) = service_catalog.get_instance_type_for_accelerator(
             acc,
             acc_count,
             use_spot=resources.use_spot,
@@ -267,3 +273,33 @@ def accelerator_in_region_or_zone(self,
                                       zone: Optional[str] = None) -> bool:
         return service_catalog.accelerator_in_region_or_zone(
             accelerator, acc_count, region, zone, 'fluffycloud')
+
+    @classmethod
+    def query_status(cls, name: str, tag_filters: Dict[str, str],
+                     region: Optional[str], zone: Optional[str],
+                     **kwargs) -> List[status_lib.ClusterStatus]:
+        del tag_filters, region, zone, kwargs  # Unused.
+
+        # FILL_IN: For the status map, map the FluffyCloud status to the SkyPilot status.
+        # SkyPilot status is defined in sky/status_lib.py
+        # Example: status_map = {'CREATING': status_lib.ClusterStatus.INIT, ...}
+        # The keys are the FluffyCloud status, and the values are the SkyPilot status.
+        status_map = {
+            'CREATING': status_lib.ClusterStatus.INIT,
+            'EDITING': status_lib.ClusterStatus.INIT,
+            'RUNNING': status_lib.ClusterStatus.UP,
+            'STARTING': status_lib.ClusterStatus.INIT,
+            'RESTARTING': status_lib.ClusterStatus.INIT,
+            'STOPPING': status_lib.ClusterStatus.STOPPED,
+            'STOPPED': status_lib.ClusterStatus.STOPPED,
+            'TERMINATING': None,
+            'TERMINATED': None,
+        }
+        status_list = []
+        vms = fc_api.list_instances()
+        for node in vms:
+            if node['name'] == name:
+                node_status = status_map[node['status']]
+                if node_status is not None:
+                    status_list.append(node_status)
+        return status_list
diff --git a/fluffycloud/fluffycloud_api.py b/fluffycloud/fluffycloud_api.py
@@ -1,27 +1,35 @@
-def launch(name:str,
-           instance_type:str,
-           region:str,
-           api_key:str,
-           ssh_key_name:str):
+from typing import Dict
+
+
+def launch(name: str,
+           instance_type: str,
+           region: str,
+           api_key: str,
+           ssh_key_name: str):
     """Launches an INSTANCE_TYPE instance in region REGION with given NAME.
-
+    The instance_type refers to the type found in the catalog.
+
+
     API_KEY is a secret registered with FluffyCloud. It is per-user.
-    
+
     SSH_KEY_NAME corresponds to a ssh key registered with FluffyCloud.
     After launching, the user can ssh into INSTANCE_TYPE with that ssh key.
-    
+
     Returns INSTANCE_ID if successful, otherwise returns None.
     """
 
-def remove(instance_id:str, api_key:str):
+
+def remove(instance_id: str, api_key: str):
     """Removes instance with given INSTANCE_ID."""
-
-def set_tags(instance_id:str, tags:Dict, api_key:str)
+
+
+def set_tags(instance_id: str, tags: Dict, api_key: str):
     """Set tags for instance with given INSTANCE_ID."""
-
-def list_instances(api_key:str):
+
+
+def list_instances(api_key: str):
     """Lists instances associated with API_KEY.
-    
+
     Returns a dictionary:
     {
         instance_id_1: