Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Lambda] Fix termination for lambda #3410

Merged
merged 2 commits into from
Apr 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions sky/clouds/lambda_cloud.py
Original file line number Diff line number Diff line change
Expand Up @@ -280,6 +280,7 @@ def query_status(cls, name: str, tag_filters: Dict[str, str],
'booting': status_lib.ClusterStatus.INIT,
'active': status_lib.ClusterStatus.UP,
'unhealthy': status_lib.ClusterStatus.INIT,
'terminating': None,
'terminated': None,
}
# TODO(ewzeng): filter by hash_filter_string to be safe
Expand Down
6 changes: 5 additions & 1 deletion sky/skylet/providers/lambda_cloud/node_provider.py
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,7 @@ def _match_tags(vm: Dict[str, Any]):
def _get_internal_ip(node: Dict[str, Any]):
# TODO(ewzeng): cache internal ips in metadata file to reduce
# ssh overhead.
if node['external_ip'] is None:
if node['external_ip'] is None or node['status'] != 'active':
node['internal_ip'] = None
return
runner = command_runner.SSHCommandRunner(node['external_ip'],
Expand All @@ -172,6 +172,10 @@ def _get_internal_ip(node: Dict[str, Any]):
self.metadata.refresh([node['id'] for node in vms])
self._guess_and_add_missing_tags(vms)
nodes = [_extract_metadata(vm) for vm in filter(_match_tags, vms)]
nodes = [
node for node in nodes
if node['status'] not in ['terminating', 'terminated']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious - is terminating a new state added by lambda?

Copy link
Collaborator Author

@Michaelvll Michaelvll Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it is newly added, but I saw the terminating state in the response from the lambda API during the state query. It should be added now. : )

{'_content': b'{"data": [{"id": "9acba6b6ea0a4f01a832d0d6936c2d4f", "name": "test-lambda-084e-head", "ip": "129.158.54.150", "region": {"name": "us-east-1", "description": "Virginia, USA"}, "instance_type": {"name": "gpu_1x_a10", "price_cents_per_hour": 75, "description": "1x A10 (24 GB PCIe)", "specs": {"vcpus": 30, "memory_gib": 200, "storage_gib": 1400, "gpus": 1}}, 
"status": "terminating", 
"ssh_key_names": ["sky-key-084e3d6c"], "file_system_names": [], "hostname": "129-158-54-150.cloud.lambdalabs.com", "jupyter_token": "4a2aa85687fc4ff3bfe4e870c99b38c0", "jupyter_url": "https://jupyter-743363eeca7c4968b525ea825951ea2b.lambdaspaces.com/?token=4a2aa85687fc4ff3bfe4e870c99b38c0"}]}', '_content_consumed': True, '_next': None, 'status_code': 200, 'headers': {'Date': 'Tue, 02 Apr 2024 21:11:47 GMT', 'Content-Type': 'application/json', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'vary': 'Accept-Encoding, Cookie', 'x-frame-options': 'SAMEORIGIN', 'strict-transport-security': 'max-age=300; includeSubDomains', 'x-content-type-options': 'nosniff', 'referrer-policy': 'same-origin', 'Content-Encoding': 'gzip', 'CF-Cache-Status': 'DYNAMIC', 'Server': 'cloudflare', 'CF-RAY': '86e3d037aa330622-IAD'}, 'raw': <urllib3.response.HTTPResponse object at 0x14dae6beb2e0>, 'url': 'https://cloud.lambdalabs.com/api/v1/instances', 'encoding': 'utf-8', 'history': [], 'reason': 'OK', 'cookies': <RequestsCookieJar[]>, 'elapsed': datetime.timedelta(microseconds=609443), 'request': <PreparedRequest [GET]>, 'connection': <requests.adapters.HTTPAdapter object at 0x14dae6bea2f0>}

]
subprocess_utils.run_in_parallel(_get_internal_ip, nodes)
self.cached_nodes = {node['id']: node for node in nodes}
return self.cached_nodes
Expand Down
Loading