Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wip: looking for feedback #1

Merged
merged 20 commits into from
Sep 7, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,57 @@
# new-cloud-skeleton

*ℹ️ NOTE: This guide will change as SkyPilot develops. Please check back often to make sure you have the most up-to-date version. The read me in this doc has been derived from the [Google Doc](https://docs.google.com/document/d/1iuPyQ47HloKuHfOYjcRNz7HAlUxFHs2WjMqedYmcPVQ/edit#heading=h.nby65cfuzxoq)*

Skeleton repo for what's needed to add a new cloud.

Detailed instructions upcoming! Contact the dev team on [SkyPilot Slack](https://slack.skypilot.co/) to get access.

## Introduction to SkyPilot

[SkyPilot](https://github.com/skypilot-org/skypilot) is an intercloud broker -- a framework for running workloads on any cloud. Here are some useful links to learn more:

1. [Introductory Blogpost](https://medium.com/@zongheng_yang/skypilot-ml-and-data-science-on-any-cloud-with-massive-cost-savings-244189cc7c0f) [Start here if you are new]
2. [Documentation](https://skypilot.readthedocs.io/en/latest/)
3. [The Sky Above the Clouds](https://arxiv.org/abs/2205.07147)
4. [GitHub](https://github.com/skypilot-org/skypilot)

## How does SkyPilot work?

Here's a simplified overview of SkyPilot's architecture.

TODO: diagram in google doc

In this diagram, the user has two clouds enabled (AWS and GCP). This is what happens when a user launches a job with sky launch:
justinmerrell marked this conversation as resolved.
Show resolved Hide resolved

1. The optimizer reads AWS Catalog and GCP Catalog and runs an algorithm to decide which cloud to run the job on. (Let's suppose the optimizer chooses AWS.) This information is then sent to the provisioner+executor.
- A catalog is a list of instance types and their prices.
2. The provisioner+executor executes ray commands to launch a cluster on AWS.
- AWS Node Provider is the interface between ray and AWS, translating ray function calls to AWS API calls.
3. Once the cluster is launched, the provisioner+executor ssh’s into the cluster to execute some AWS Setup commands. This is used to download some important packages on the cluster.
4. The provisioner+executor submits the job to the cluster and the cluster runs the job.

When all is done, the user can run sky down and provisioner+executor will tear down the cluster by executing more ray commands.

## Getting Started

Now let's say you have a new cloud, called FluffyCloud, that you want SkyPilot to support. What do you need to do?

You need to:

1. Write a NodeProvider for FluffyCloud. This is the most important part.
2. Add the FluffyCloud catalog to SkyPilot and write functions that read this catalog.
3. Write FluffyCloud setup code.
4. Add FluffyCloud credential check to verify locally stored credentials. This is needed for a user to enable FluffyCloud.

For reference, here is an actual merged PR for adding a new cloud to help you estimate what is required:

- [Lambda Cloud](https://github.com/skypilot-org/skypilot/pull/1557)

By completing the following steps, you will be able to run SkyPilot on FluffyCloud.

- [Step 0](/docs/integration_steps/step_0-api-library.md)
- [Step 1](/docs/integration_steps/step_1-node-provider.md)
- [Step 2](/docs/integration_steps/step_2-catalog.md)
- [Step 3](/docs/integration_steps/step_3-setup-code.md)
- [Step 4](/docs/integration_steps/step_4-setup-code.md.md)
- [Step 5](/docs/integration_steps/step_5-e2e-failover.md)
14 changes: 14 additions & 0 deletions docs/integration_steps/step_0-api-library.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Cloud Python Library Coverage

This document is primarily for the mantainers of python libaries for their perspective cloud. It provides a list of functions that SkyPilot will need, by integrating these functions into your python library you can drastically reduce the amount of work and complexity needed to add your cloud to SkyPilot.

## Function List

1. Launch an instance
2. Remove an instance
3. Set instance tags
4. List instances

## API Wrapper/Middleware

It is likely that you will require a wrapper to return the output from your python library into the format required by SkyPilot.
15 changes: 15 additions & 0 deletions docs/integration_steps/step_1-node-provider.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Node Provider

NodeProvider is the interface that ray uses to interact with cloud providers. First, you should read through the NodeProvider class definition [here](https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/node_provider.py). The docstrings give a good idea of what the NodeProvider class is.

## Implementing a Node Provider

1. Create the directory `sky/skylet/providers/{cloud_name}`
2. Add `__init__.py` to the directory and add the following code:

```python
from sky.skylet.providers.{cloud_name}.node_provider import {CloudName}NodeProvider
```

3. Copy the `node_provider.py` template into the directory.
4. Complete the template. The template has comments to guide you through the process.
21 changes: 21 additions & 0 deletions docs/integration_steps/step_2-catalog.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Cloud Catalog

A catalog is a CSV file under [SkyPilot Catalogs](https://github.com/skypilot-org/skypilot-catalog)

| Field | Type | Description |
|--------------------|--------|-------------------------------------------------------------------------------------|
| `InstanceType` | string | The type of instance. |
| `vCPUs` | float | The number of virtual CPUs. |
| `MemoryGiB` | float | The amount of memory in GB. |
| `AcceleratorName` | string | The name of accelerators (GPU/TPU). |
| `AcceleratorCount` | float | The number of accelerators (GPU/TPU). |
| `GPUInfo` | string | The human readable information of the GPU (not used in code). |
| `Region` | string | The region of the resource. |
| `AvailabilityZone` | string | The availability zone of the resource (can be empty if not supported in the cloud). |
| `Price` | float | The price of the resource. |
| `SpotPrice` | float | The spot price of the resource. |


## Parsing Catalog

Create a copy of `fluffycloud_catalog.py` and place it at `sky/clouds/service_catalog/{cloudname_catalog}.py`. Aside from renaming fluffly cloud to your cloud name, you do not need to make additional changes.
9 changes: 9 additions & 0 deletions docs/integration_steps/step_3-cloud-class.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Cloud Class

This class calls some service catalog functions and contains code that checks FluffyCloud credentials. Many of the functions in this class are also straightforward to implement.

Start by creating a copy of `fluffycloud/fluffycloud.py` and place it at `sky/clouds/fluffycloud.py`

## Credentials file

The credentials file contains the users credentials required to access your cloud. You can specify the location of files to check for credentials by adding them to the `_CREDENTIAL_FILES` list.
16 changes: 16 additions & 0 deletions docs/integration_steps/step_4-setup-code.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Setup Code

This code is executed after a cluster is launched (via ssh). Most of this code is very similar to the existing setup code for other clouds, and may almost be a copy-paste.

Create a copy of `fluffycloud-ray.yml.js` and place it at `sky/templates/<cloudname>-ray.yml.j2`

## Ray Backend

Open `sky/backends/cloud_vm_ray_backend.py` and edit the `_get_cluster_config_template` function to include the new cloud.

Open `sky/backends/cloud_vm_ray_backend.py` and edit the `_add_auth_to_cluster_config` function to include the new cloud.


### Authentication

Cloud authentication is handled by `sky/authentication.py`. The `setup_<cloud>_authentication` functions will be called on every cluster provisioning request.
52 changes: 52 additions & 0 deletions docs/integration_steps/step_5-e2e-failover.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Cloud E2E Failover

Copy the following functions into sky/backends/cloud_vm_ray_backend.py be sure to update your \<CloudName\>

```python
def _update_blocklist_on_<cloudname>_error(
self, launchable_resources: 'resources_lib.Resources',
region: 'clouds.Region', zones: Optional[List['clouds.Zone']],
stdout: str, stderr: str):
del zones # Unused.
style = colorama.Style
stdout_splits = stdout.split('\n')
stderr_splits = stderr.split('\n')
errors = [
s.strip()
for s in stdout_splits + stderr_splits
if '<CloudName>Error:' in s.strip()
]
if not errors:
logger.info('====== stdout ======')
for s in stdout_splits:
print(s)
logger.info('====== stderr ======')
for s in stderr_splits:
print(s)
with ux_utils.print_exception_no_traceback():
raise RuntimeError('Errors occurred during provision; '
'check logs above.')

logger.warning(f'Got error(s) in {region.name}:')
messages = '\n\t'.join(errors)
logger.warning(f'{style.DIM}\t{messages}{style.RESET_ALL}')
# NOTE: you can check out other clouds' implementations of this function,
# which may intelligently block a whole zone / whole region depending on
# the errors thrown.
self._blocked_resources.add(launchable_resources.copy(zone=None))
justinmerrell marked this conversation as resolved.
Show resolved Hide resolved
```

Within the `_update_blocklist_on_error` function add your cloud to the handlers dictionary

```python
def _update_blocklist_on_error(
...
handlers = {
...
# TODO Add this
clouds.FluffyCloud: self._update_blocklist_on_fluffycloud_error,
...
}
...
...
```
68 changes: 52 additions & 16 deletions fluffycloud/fluffycloud.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,15 @@
from typing import Dict, Iterator, List, Optional, Tuple

from sky import clouds
from sky import status_lib
from sky.clouds import service_catalog

if typing.TYPE_CHECKING:
# Renaming to avoid shadowing variables.
from sky import resources as resources_lib

import fluffycloud_api as fc_api

_CREDENTIAL_FILES = [
# credential files for FluffyCloud,
]
Expand All @@ -21,11 +24,12 @@ class FluffyCloud(clouds.Cloud):
_CLOUD_UNSUPPORTED_FEATURES = {
clouds.CloudImplementationFeatures.STOP: 'FluffyCloud does not support stopping VMs.',
clouds.CloudImplementationFeatures.AUTOSTOP: 'FluffyCloud does not support stopping VMs.',
clouds.CloudImplementationFeatures.MULTI_NODE: 'Multi-node is not supported by the FluffyCloud implementation yet.',
clouds.CloudImplementationFeatures.MULTI_NODE: 'Multi-node is not supported by the FluffyCloud implementation yet.'
}
########
# TODO #
########
_MAX_CLUSTER_NAME_LEN_LIMIT = # TODO
_MAX_CLUSTER_NAME_LEN_LIMIT = # TODO

_regions: List[clouds.Region] = []

Expand All @@ -38,7 +42,6 @@ def _cloud_unsupported_features(
def _max_cluster_name_length(cls) -> Optional[int]:
return cls._MAX_CLUSTER_NAME_LEN_LIMIT


@classmethod
def regions(cls) -> List[clouds.Region]:
if not cls._regions:
Expand Down Expand Up @@ -71,6 +74,14 @@ def regions_with_offering(cls, instance_type: Optional[str],
regions = [r for r in regions if r.name == region]
return regions

@classmethod
def get_vcpus_mem_from_instance_type(
cls,
instance_type: str,
) -> Tuple[Optional[float], Optional[float]]:
# FILL_IN: cloudname
return service_catalog.get_vcpus_mem_from_instance_type(instance_type, clouds='<cloudname>')

@classmethod
def zones_provision_loop(
cls,
Expand Down Expand Up @@ -108,12 +119,9 @@ def accelerators_to_hourly_cost(self,
region: Optional[str] = None,
zone: Optional[str] = None) -> float:
del accelerators, use_spot, region, zone # unused
########
# TODO #
########
# This function assumes accelerators are included as part of instance
# type. If not, you will need to change this. (However, you can do
# this later; `return 0.0` is a good placeholder.)
# FILL_IN: If accelerator costs are not included in instance_type cost,
# return the cost of the accelerators here. If accelerators are
# included in instance_type cost, return 0.0.
return 0.0

def get_egress_cost(self, num_gigabytes: float) -> float:
Expand All @@ -132,10 +140,8 @@ def is_same_cloud(self, other: clouds.Cloud) -> bool:
return isinstance(other, FluffyCloud)

@classmethod
def get_default_instance_type(cls,
cpus: Optional[str] = None) -> Optional[str]:
return service_catalog.get_default_instance_type(cpus=cpus,
clouds='fluffycloud')
def get_default_instance_type(cls, cpus: Optional[str] = None) -> Optional[str]:
return service_catalog.get_default_instance_type(cpus=cpus, clouds='fluffycloud')

@classmethod
def get_accelerators_from_instance_type(
Expand Down Expand Up @@ -178,8 +184,8 @@ def make_deploy_resources_variables(
'region': region.name,
}

def get_feasible_launchable_resources(self,
resources: 'resources_lib.Resources'):
def _get_feasible_launchable_resources(self,
resources: 'resources_lib.Resources'):
if resources.use_spot:
return ([], [])
if resources.instance_type is not None:
Expand Down Expand Up @@ -218,7 +224,7 @@ def _make(instance_list):
assert len(accelerators) == 1, resources
acc, acc_count = list(accelerators.items())[0]
(instance_list, fuzzy_candidate_list
) = service_catalog.get_instance_type_for_accelerator(
) = service_catalog.get_instance_type_for_accelerator(
acc,
acc_count,
use_spot=resources.use_spot,
Expand Down Expand Up @@ -267,3 +273,33 @@ def accelerator_in_region_or_zone(self,
zone: Optional[str] = None) -> bool:
return service_catalog.accelerator_in_region_or_zone(
accelerator, acc_count, region, zone, 'fluffycloud')

@classmethod
def query_status(cls, name: str, tag_filters: Dict[str, str],
justinmerrell marked this conversation as resolved.
Show resolved Hide resolved
region: Optional[str], zone: Optional[str],
**kwargs) -> List[status_lib.ClusterStatus]:
del tag_filters, region, zone, kwargs # Unused.

# FILL_IN: For the status map, map the FluffyCloud status to the SkyPilot status.
# SkyPilot status is defined in sky/status_lib.py
# Example: status_map = {'CREATING': status_lib.ClusterStatus.INIT, ...}
# The keys are the FluffyCloud status, and the values are the SkyPilot status.
status_map = {
'CREATING': status_lib.ClusterStatus.INIT,
'EDITING': status_lib.ClusterStatus.INIT,
'RUNNING': status_lib.ClusterStatus.UP,
'STARTING': status_lib.ClusterStatus.INIT,
'RESTARTING': status_lib.ClusterStatus.INIT,
'STOPPING': status_lib.ClusterStatus.STOPPED,
'STOPPED': status_lib.ClusterStatus.STOPPED,
'TERMINATING': None,
'TERMINATED': None,
}
status_list = []
vms = fc_api.list_instances()
for node in vms:
if node['name'] == name:
node_status = status_map[node['status']]
if node_status is not None:
status_list.append(node_status)
return status_list
36 changes: 22 additions & 14 deletions fluffycloud/fluffycloud_api.py
Original file line number Diff line number Diff line change
@@ -1,27 +1,35 @@
def launch(name:str,
instance_type:str,
region:str,
api_key:str,
ssh_key_name:str):
from typing import Dict


def launch(name: str,
instance_type: str,
region: str,
api_key: str,
ssh_key_name: str):
"""Launches an INSTANCE_TYPE instance in region REGION with given NAME.

The instance_type refers to the type found in the catalog.


API_KEY is a secret registered with FluffyCloud. It is per-user.

SSH_KEY_NAME corresponds to a ssh key registered with FluffyCloud.
After launching, the user can ssh into INSTANCE_TYPE with that ssh key.

Returns INSTANCE_ID if successful, otherwise returns None.
"""

def remove(instance_id:str, api_key:str):

def remove(instance_id: str, api_key: str):
"""Removes instance with given INSTANCE_ID."""

def set_tags(instance_id:str, tags:Dict, api_key:str)


def set_tags(instance_id: str, tags: Dict, api_key: str):
"""Set tags for instance with given INSTANCE_ID."""

def list_instances(api_key:str):


def list_instances(api_key: str):
"""Lists instances associated with API_KEY.

Returns a dictionary:
{
instance_id_1:
Expand Down
Loading