Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GCP] Support private IPs for GCP #2819

Merged
merged 28 commits into from
Dec 1, 2023
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 49 additions & 0 deletions docs/source/cloud-setup/cloud-permissions/gcp.rst
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -267,3 +267,52 @@ See details in :ref:`config-yaml`. Example use cases include using a private VP
VPC with fine-grained constraints, typically created via Terraform or manually.

The custom VPC should contain the :ref:`required firewall rules <gcp-minimum-firewall-rules>`.


.. _gcp-use-internal-ips:


Using Internal IPs
-----------------------
For security reason, users may only want to use internal IPs for SkyPilot instances.
To do so, you can use SkyPilot's global config file ``~/.sky/config.yaml`` to specify the ``gcp.use_internal_ips`` and ``gcp.ssh_proxy_command`` field (to see the detailed syntax, see :ref:`config-yaml`):
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: yaml

gcp:
use_internal_ips: true
# VPC with NAT setup, see below
vpc_name: my-vpc-name
ssh_proxy_command: ssh -W %h:%p -o StrictHostKeyChecking=no [email protected]

The ``gcp.ssh_proxy_command`` field is optional. If SkyPilot is run on a machine that can directly access the internal IPs of the instances, it can be omitted. Otherwise, it should be set to a command that can be used to proxy SSH connections to the internal IPs of the instances.


Cloud NAT Setup
~~~~~~~~~~~~~~~~

Instances created with internal IPs only on GCP cannot access public internet by default. To make sure SkyPilot can install the dependencies correctly on the instances,
cloud NAT needs to be setup for the VPC (see `GCP's documentation <https://cloud.google.com/nat/docs/overview>`__ for details).


Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
Cloud NAT is a regional resource, so it will need to be created in each region that SkyPilot will be used in. To limit SkyPilot to use some specific regions only, you can specify the ``gcp.ssh_proxy_command`` to be a dict mapping from region to the SSH proxy command for that region (see :ref:`config-yaml` for details):
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

.. code-block:: yaml

gcp:
use_internal_ips: true
vpc_name: my-vpc-name
ssh_proxy_command:
us-west1: ssh -W %h:%p -o StrictHostKeyChecking=no [email protected]
us-east1: ssh -W %h:%p -o StrictHostKeyChecking=no [email protected]

If proxy is not needed, but the regions need to be limited, you can set the ``gcp.ssh_proxy_command`` to be a dict mapping from region to ``null``:

.. code-block:: yaml

gcp:
use_internal_ips: true
vpc_name: my-vpc-name
ssh_proxy_command:
us-west1: null
us-east1: null
25 changes: 25 additions & 0 deletions docs/source/reference/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,31 @@ Available fields and semantics:
# will be added.
vpc_name: skypilot-vpc

# Should instances be assigned private IPs only? (optional)
#
# Set to true to use private IPs to communicate between the local client and
# any SkyPilot nodes. This requires the networking stack be properly set up.
#
# This flag is typically set together with 'vpc_name' above and
# 'ssh_proxy_command' below.
#
# Default: false.
use_internal_ips: true
# SSH proxy command (optional).
#
# Please refer to the aws.ssh_proxy_command section above for more details.
### Format 1 ###
# A string; the same proxy command is used for all regions.
ssh_proxy_command: ssh -W %h:%p -i ~/.ssh/sky-key -o StrictHostKeyChecking=no ec2-user@<jump server public ip>
### Format 2 ###
# A dict mapping region names to region-specific proxy commands.
# NOTE: This restricts SkyPilot's search space for this cloud to only use
# the specified regions and not any other regions in this cloud.
ssh_proxy_command:
us-east-1: ssh -W %h:%p -p 1234 -o StrictHostKeyChecking=no [email protected]
us-east-2: ssh -W %h:%p -i ~/.ssh/sky-key -o StrictHostKeyChecking=no ec2-user@<jump server public ip>
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved


# Reserved capacity (optional).
#
# The specific reservation to be considered when provisioning clusters on GCP.
Expand Down
18 changes: 8 additions & 10 deletions sky/backends/backend_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@
from sky.skylet import constants
from sky.skylet import log_lib
from sky.usage import usage_lib
from sky.utils import cluster_yaml_utils
from sky.utils import command_runner
from sky.utils import common_utils
from sky.utils import controller_utils
Expand All @@ -64,7 +65,6 @@

# NOTE: keep in sync with the cluster template 'file_mounts'.
SKY_REMOTE_APP_DIR = '~/.sky/sky_app'
SKY_RAY_YAML_REMOTE_PATH = '~/.sky/sky_ray.yml'
# Exclude subnet mask from IP address regex.
IP_ADDR_REGEX = r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}(?!/\d{1,2})\b'
SKY_REMOTE_PATH = '~/.sky/wheels'
Expand Down Expand Up @@ -1017,14 +1017,13 @@ def write_cluster_config(
'user': get_cleaned_username(
os.environ.get(constants.USER_ENV_VAR, '')),

# AWS only:
'aws_vpc_name': skypilot_config.get_nested(('aws', 'vpc_name'),
None),
# Private IPs
concretevitamin marked this conversation as resolved.
Show resolved Hide resolved
'use_internal_ips': skypilot_config.get_nested(
('aws', 'use_internal_ips'), False),
# Not exactly AWS only, but we only test it's supported on AWS
# for now:
(str(cloud).lower(), 'use_internal_ips'), False),
'ssh_proxy_command': ssh_proxy_command,
'vpc_name': skypilot_config.get_nested(
(str(cloud).lower(), 'vpc_name'), None),

# User-supplied instance tags.
'instance_tags': instance_tags,

Expand All @@ -1033,8 +1032,6 @@ def write_cluster_config(
'resource_group': f'{cluster_name}-{region_name}',

# GCP only:
'gcp_vpc_name': skypilot_config.get_nested(('gcp', 'vpc_name'),
None),
'gcp_project_id': gcp_project_id,
'specific_reservations': filtered_specific_reservations,
'num_specific_reserved_workers': num_specific_reserved_workers,
Expand All @@ -1057,7 +1054,8 @@ def write_cluster_config(
'sky_remote_path': SKY_REMOTE_PATH,
'sky_local_path': str(local_wheel_path),
# Add yaml file path to the template variables.
'sky_ray_yaml_remote_path': SKY_RAY_YAML_REMOTE_PATH,
'sky_ray_yaml_remote_path':
cluster_yaml_utils.SKY_CLUSTER_YAML_REMOTE_PATH,
'sky_ray_yaml_local_path':
tmp_yaml_path
if not isinstance(cloud, clouds.Local) else yaml_path,
Expand Down
19 changes: 12 additions & 7 deletions sky/backends/cloud_vm_ray_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -201,6 +201,7 @@ def __init__(self):
# For n nodes gang scheduling.
self._has_gang_scheduling = False
self._num_nodes = 0
self._provider_name = None
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved

self._has_register_run_fn = False

Expand Down Expand Up @@ -312,6 +313,9 @@ def add_gang_scheduling_placement_group_and_setup(
self._has_gang_scheduling = True
self._num_nodes = num_nodes

if envs is None:
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
envs = {}

bundles = [copy.copy(resources_dict) for _ in range(num_nodes)]
# Set CPU to avoid ray hanging the resources allocation
# for remote functions, since the task will request 1 CPU
Expand Down Expand Up @@ -503,11 +507,12 @@ def add_ray_task(self,
f'placement_group_bundle_index={gang_scheduling_id})')

sky_env_vars_dict_str = [
textwrap.dedent("""\
sky_env_vars_dict = {}
textwrap.dedent(f"""\
sky_env_vars_dict = {{}}
sky_env_vars_dict['SKYPILOT_NODE_IPS'] = job_ip_list_str
# Environment starting with `SKY_` is deprecated.
sky_env_vars_dict['SKY_NODE_IPS'] = job_ip_list_str
sky_env_vars_dict['SKYPILOT_PROVIDER_NAME'] = {self._provider_name}
concretevitamin marked this conversation as resolved.
Show resolved Hide resolved
""")
]

Expand Down Expand Up @@ -2481,6 +2486,9 @@ def is_provided_ips_valid(ips: Optional[List[Optional[str]]]) -> bool:
return (ips is not None and len(ips) == self.num_node_ips and
all(ip is not None for ip in ips))

use_internal_ips = skypilot_config.get_nested(keys=(str(
self.launched_resources.cloud).lower(), 'use_internal_ips'),
default_value=False)
if is_provided_ips_valid(external_ips):
logger.debug(f'Using provided external IPs: {external_ips}')
cluster_external_ips = typing.cast(List[str], external_ips)
Expand All @@ -2491,7 +2499,7 @@ def is_provided_ips_valid(ips: Optional[List[Optional[str]]]) -> bool:
handle=self,
head_ip_max_attempts=max_attempts,
worker_ip_max_attempts=max_attempts,
get_internal_ips=False)
get_internal_ips=use_internal_ips)
concretevitamin marked this conversation as resolved.
Show resolved Hide resolved

if self.cached_external_ips == cluster_external_ips:
logger.debug('Skipping the fetching of internal IPs as the cached '
Expand All @@ -2505,10 +2513,7 @@ def is_provided_ips_valid(ips: Optional[List[Optional[str]]]) -> bool:
f'cached ({self.cached_external_ips}), new ({cluster_external_ips})'
)

is_cluster_aws = (self.launched_resources is not None and
isinstance(self.launched_resources.cloud, clouds.AWS))
if is_cluster_aws and skypilot_config.get_nested(
keys=('aws', 'use_internal_ips'), default_value=False):
if (self.launched_resources is not None and use_internal_ips):
# Optimization: if we know use_internal_ips is True (currently
# only exposed for AWS), then our AWS NodeProvider is
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
# guaranteed to pick subnets that will not assign public IPs,
Expand Down
2 changes: 2 additions & 0 deletions sky/serve/service.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@
from sky.serve import serve_state
from sky.serve import serve_utils
from sky.utils import common_utils
from sky.utils import controller_utils
from sky.utils import subprocess_utils
from sky.utils import ux_utils

Expand Down Expand Up @@ -252,4 +253,5 @@ def _start(service_name: str, tmp_task_yaml: str, job_id: int):
# We start process with 'spawn', because 'fork' could result in weird
# behaviors; 'spawn' is also cross-platform.
multiprocessing.set_start_method('spawn', force=True)
controller_utils.setup_proxy_command_on_controller()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment in skypilot_config. Feels safer to avoid overwriting.

_start(args.service_name, args.task_yaml, args.job_id)
16 changes: 4 additions & 12 deletions sky/skylet/events.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,12 @@
import yaml

from sky import sky_logging
from sky.backends import backend_utils
from sky.backends import cloud_vm_ray_backend
from sky.serve import serve_utils
from sky.skylet import autostop_lib
from sky.skylet import job_lib
from sky.spot import spot_utils
from sky.utils import cluster_yaml_utils
from sky.utils import common_utils
from sky.utils import ux_utils

Expand Down Expand Up @@ -104,8 +104,8 @@ class AutostopEvent(SkyletEvent):
def __init__(self):
super().__init__()
autostop_lib.set_last_active_time_to_now()
self._ray_yaml_path = os.path.abspath(
os.path.expanduser(backend_utils.SKY_RAY_YAML_REMOTE_PATH))
self._ray_yaml_path = cluster_yaml_utils.get_cluster_yaml_absolute_path(
)

def _run(self):
autostop_config = autostop_lib.get_autostop_config()
Expand Down Expand Up @@ -139,16 +139,8 @@ def _stop_cluster(self, autostop_config):
cloud_vm_ray_backend.CloudVmRayBackend.NAME):
autostop_lib.set_autostopping_started()

provider_name = cluster_yaml_utils.get_provider_name()
config = common_utils.read_yaml(self._ray_yaml_path)
concretevitamin marked this conversation as resolved.
Show resolved Hide resolved

provider_module = config['provider']['module']
# Examples:
# 'sky.skylet.providers.aws.AWSNodeProviderV2' -> 'aws'
# 'sky.provision.aws' -> 'aws'
provider_search = re.search(r'(?:providers|provision)\.(\w+)\.?',
provider_module)
assert provider_search is not None, config
provider_name = provider_search.group(1).lower()
if provider_name in ('aws', 'gcp'):
logger.info('Using new provisioner to stop the cluster.')
self._stop_cluster_with_new_provisioner(autostop_config, config,
Expand Down
7 changes: 6 additions & 1 deletion sky/skylet/providers/gcp/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -910,6 +910,8 @@ def _configure_subnet(config, compute):
],
}
]
if config["provider"].get("use_internal_ips", False):
default_interfaces[0].pop("accessConfigs")
concretevitamin marked this conversation as resolved.
Show resolved Hide resolved

for node_config in node_configs:
# The not applicable key will be removed during node creation
Expand All @@ -920,7 +922,10 @@ def _configure_subnet(config, compute):
# TPU
if "networkConfig" not in node_config:
node_config["networkConfig"] = copy.deepcopy(default_interfaces)[0]
node_config["networkConfig"].pop("accessConfigs")
# TPU doesn't have accessConfigs
node_config["networkConfig"].pop("accessConfigs", None)
if config["provider"].get("use_internal_ips", False):
node_config["networkConfig"]["enableExternalIps"] = False

return config

Expand Down
39 changes: 16 additions & 23 deletions sky/skypilot_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,6 @@
import yaml

from sky import sky_logging
from sky.clouds import cloud_registry
from sky.utils import common_utils
from sky.utils import schemas

Expand All @@ -74,6 +73,7 @@

# The loaded config.
_dict = None
_loaded_config_path = None


def get_nested(keys: Iterable[str], default_value: Any) -> Any:
Expand Down Expand Up @@ -119,6 +119,19 @@ def set_nested(keys: Iterable[str], value: Any) -> Dict[str, Any]:
return to_return


def overwrite_config_file(config: dict) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe? It feels hard to reason about. Any ways to avoid this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the controller is launching a task, it should use the task-specific skypilot_config decided by SKYPILOT_CONFIG env var, and we have to update the proxy command setting in that config.
Added a docstr to for calling this with cautious.

If needed we can save the new config in a new config_file and reset the os.environ['SKYPILOT_CONFIG'] to the new config_file and set _dict to the new config.

I don't think this will make much difference than directly overwrite the original config file. Wdyt?

Copy link
Member

@concretevitamin concretevitamin Nov 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. What's the reason for replacing the previous way? I.e., directly write out the properly modified config to a local temp file and upload it. It seems like local client does know where the controller is/is going to be launched?
  2. If there's a strong reason to do it this way, how about renaming this to unsafe_overwrite_config_file_on_controller or something like that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for replacing the previous way? I.e., directly write out the properly modified config to a local temp file and upload it. It seems like local client does know where the controller is/is going to be launched?

The previous way was done before the controller is actually UP (failover may happen), which means we don't know which cloud the controller will use when we run sky.launch for the controller. The current way replaces the config on the controller, so it has the latest information about the controller itself.

If there's a strong reason to do it this way, how about renaming this to unsafe_overwrite_config_file_on_controller or something like that?

Yes, renaming it sounds good to me.

"""Overwrites the config file with the current config."""
global _dict, _loaded_config_path
if _loaded_config_path is None:
raise RuntimeError('No config file loaded.')
common_utils.validate_schema(config,
schemas.get_config_schema(),
f'Invalid config YAML: {config!r}',
skip_none=False)
common_utils.dump_yaml(_loaded_config_path, config)
_dict = config


def to_dict() -> Dict[str, Any]:
"""Returns a deep-copied version of the current config."""
global _dict
Expand All @@ -127,27 +140,8 @@ def to_dict() -> Dict[str, Any]:
return {}


def _syntax_check_for_ssh_proxy_command(cloud: str) -> None:
ssh_proxy_command_config = get_nested((cloud.lower(), 'ssh_proxy_command'),
None)
if ssh_proxy_command_config is None or isinstance(ssh_proxy_command_config,
str):
return

if isinstance(ssh_proxy_command_config, dict):
for region, cmd in ssh_proxy_command_config.items():
if cmd and not isinstance(cmd, str):
raise ValueError(
f'Invalid ssh_proxy_command config for region {region!r} '
f'(expected a str): {cmd!r}')
return
raise ValueError(
'Invalid ssh_proxy_command config (expected a str or a dict with '
f'region names as keys): {ssh_proxy_command_config!r}')


def _try_load_config() -> None:
global _dict
global _dict, _loaded_config_path
config_path_via_env_var = os.environ.get(ENV_VAR_SKYPILOT_CONFIG)
if config_path_via_env_var is not None:
config_path = config_path_via_env_var
Expand All @@ -156,6 +150,7 @@ def _try_load_config() -> None:
config_path = os.path.expanduser(config_path)
if os.path.exists(config_path):
logger.debug(f'Using config path: {config_path}')
_loaded_config_path = config_path
try:
_dict = common_utils.read_yaml(config_path)
logger.debug(f'Config loaded:\n{pprint.pformat(_dict)}')
Expand All @@ -168,8 +163,6 @@ def _try_load_config() -> None:
f'Invalid config YAML ({config_path}): ',
skip_none=False)

for cloud in cloud_registry.CLOUD_REGISTRY:
_syntax_check_for_ssh_proxy_command(cloud)
logger.debug('Config syntax check passed.')


Expand Down
1 change: 1 addition & 0 deletions sky/spot/controller.py
Original file line number Diff line number Diff line change
Expand Up @@ -523,4 +523,5 @@ def start(job_id, dag_yaml, retry_until_up):
# We start process with 'spawn', because 'fork' could result in weird
# behaviors; 'spawn' is also cross-platform.
multiprocessing.set_start_method('spawn', force=True)
controller_utils.setup_proxy_command_on_controller()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment in skypilot_config. Feels safer to avoid overwriting.

start(args.job_id, args.dag_yaml, args.retry_until_up)
4 changes: 2 additions & 2 deletions sky/templates/aws-ray.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -36,9 +36,9 @@ provider:
security_group:
# AWS config file must include security group name
GroupName: {{security_group}}
{% if aws_vpc_name is not none %}
{% if vpc_name is not none %}
# NOTE: This is a new field added by SkyPilot to force use a specific VPC.
vpc_name: {{aws_vpc_name}}
vpc_name: {{vpc_name}}
{% endif %}
{%- if docker_login_config is not none %}
# We put docker login config in provider section because ray's schema disabled
Expand Down
Loading