Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Lambda] Fix termination for lambda #3410

Merged
merged 2 commits into from
Apr 3, 2024
Merged

[Lambda] Fix termination for lambda #3410

merged 2 commits into from
Apr 3, 2024

Conversation

Michaelvll
Copy link
Collaborator

This includes the terminating status for Lambda Cloud, and non_terminated_nodes should filter out the terminating and terminated nodes.

sky launch -c test-lambda --cloud lambda --gpus A10; sky down test-lambda
Error out with:

  File "/home/azureuser/skypilot/sky/clouds/utils/lambda_utils.py", line 183, in remove_instances
    response = _try_request_with_backoff(
  File "/home/azureuser/skypilot/sky/clouds/utils/lambda_utils.py", line 118, in _try_request_with_backoff
    raise_lambda_error(response)
  File "/home/azureuser/skypilot/sky/clouds/utils/lambda_utils.py", line 95, in raise_lambda_error
    raise LambdaCloudError(f'{code}: {message}')
sky.clouds.utils.lambda_utils.LambdaCloudError: global/unknown: An unknown error occurred.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky launch -c test-lambda --cloud lambda --gpus A10; sky down test-lambda
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@Michaelvll Michaelvll changed the title [Provisioner] Fix termination for lambda [Lambda] Fix termination for lambda Apr 2, 2024
@Michaelvll Michaelvll requested a review from romilbhardwaj April 2, 2024 21:15
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this @Michaelvll - I tested with sky launch -c myclus --cloud lambda --gpus A10 and sky down myclus and it works.

@@ -172,6 +172,10 @@ def _get_internal_ip(node: Dict[str, Any]):
self.metadata.refresh([node['id'] for node in vms])
self._guess_and_add_missing_tags(vms)
nodes = [_extract_metadata(vm) for vm in filter(_match_tags, vms)]
nodes = [
node for node in nodes
if node['status'] not in ['terminating', 'terminated']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious - is terminating a new state added by lambda?

Copy link
Collaborator Author

@Michaelvll Michaelvll Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it is newly added, but I saw the terminating state in the response from the lambda API during the state query. It should be added now. : )

{'_content': b'{"data": [{"id": "9acba6b6ea0a4f01a832d0d6936c2d4f", "name": "test-lambda-084e-head", "ip": "129.158.54.150", "region": {"name": "us-east-1", "description": "Virginia, USA"}, "instance_type": {"name": "gpu_1x_a10", "price_cents_per_hour": 75, "description": "1x A10 (24 GB PCIe)", "specs": {"vcpus": 30, "memory_gib": 200, "storage_gib": 1400, "gpus": 1}}, 
"status": "terminating", 
"ssh_key_names": ["sky-key-084e3d6c"], "file_system_names": [], "hostname": "129-158-54-150.cloud.lambdalabs.com", "jupyter_token": "4a2aa85687fc4ff3bfe4e870c99b38c0", "jupyter_url": "https://jupyter-743363eeca7c4968b525ea825951ea2b.lambdaspaces.com/?token=4a2aa85687fc4ff3bfe4e870c99b38c0"}]}', '_content_consumed': True, '_next': None, 'status_code': 200, 'headers': {'Date': 'Tue, 02 Apr 2024 21:11:47 GMT', 'Content-Type': 'application/json', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'vary': 'Accept-Encoding, Cookie', 'x-frame-options': 'SAMEORIGIN', 'strict-transport-security': 'max-age=300; includeSubDomains', 'x-content-type-options': 'nosniff', 'referrer-policy': 'same-origin', 'Content-Encoding': 'gzip', 'CF-Cache-Status': 'DYNAMIC', 'Server': 'cloudflare', 'CF-RAY': '86e3d037aa330622-IAD'}, 'raw': <urllib3.response.HTTPResponse object at 0x14dae6beb2e0>, 'url': 'https://cloud.lambdalabs.com/api/v1/instances', 'encoding': 'utf-8', 'history': [], 'reason': 'OK', 'cookies': <RequestsCookieJar[]>, 'elapsed': datetime.timedelta(microseconds=609443), 'request': <PreparedRequest [GET]>, 'connection': <requests.adapters.HTTPAdapter object at 0x14dae6bea2f0>}

@Michaelvll Michaelvll merged commit ffc8618 into master Apr 3, 2024
20 checks passed
@Michaelvll Michaelvll deleted the lambda-termination branch April 3, 2024 03:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants