-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Lambda Labs #1557
Conversation
provisioning bug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome @ewzeng. Some observations while trying it out. To stress test, I didn't exactly follow the steps ;)
- Switched to this PR, immediately ran
sky check
, saw GCP and Lambda disabled (expected). Then,
» sky launch --cloud lambda
Enabling Compute Engine API (free of charge; this may take a minute)...
Failed. Detailed output:
ERROR: (gcloud) The project property must be set to a valid project ID, not the project name [None]
To set your project, run:
$ gcloud config set project PROJECT_ID
or to unset it, run:
$ gcloud config unset project
sky.exceptions.ResourcesUnavailableError: Task sky-cmd requires Lambda which is not enabled. To enable access, run sky check , or change the cloud requirement
The GCP output is unexpected, while the last line is. Is this reproducible on your end?
- RE the step
api_key=[YOUR_API_KEY] to ~/.lambda/lambda_keys.
Is it possible to make it so that users can simply place[YOUR_API_KEY]
in the file?
I'd also propose changing it to ~/.lambda_labs/api_key
(former = to be more precise; latter = to use their terminology).
-
(For discussion) I feel ambivalent about the code name
Lambda
(which is less precise/can cause misunderstanding with AWS Lambda, but easier to type) , vs. the longer nameLambda Labs
, in--cloud
and in catalog folder name. May be worth polling the dev team once the PR settles. (Personally I think the shorter name is ok.) -
(Still not following the steps)
» sky launch 1 ↵
⠋ Updating Lambda catalog: lambda/vms.csv
E 12-27 10:34:07 common.py:120] Failed to fetch Lambda catalog lambda/vms.csv. Please check your internet connection.
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://raw.githubusercontent.com/skypilot-org/skypilot-catalog/master/catalogs/v5/lambda/vms.csv
Since I have not placed the catalog file manually, this exits launching even for other clouds. I think the desired behavior should be keep going with the OK clouds.
Is this behavior also on master (e.g., if we manually remove + change AWS catalog's URL)? If so, it's okay and perhaps add a TODO.
- (For discussion) Now I have the catalog file. Slightly surprised by other clouds' default VM type being CPU-based, while Lambda's is GPU-based:
» sky cpunode 1 ↵
I 12-27 10:37:09 optimizer.py:606] == Optimizer ==
I 12-27 10:37:09 optimizer.py:618] Target: minimizing cost
I 12-27 10:37:09 optimizer.py:629] Estimated cost: $0.4 / hour
I 12-27 10:37:09 optimizer.py:629]
I 12-27 10:37:09 optimizer.py:686] Considered resources (1 node):
I 12-27 10:37:09 optimizer.py:714] ------------------------------------------------------------------------
I 12-27 10:37:09 optimizer.py:714] CLOUD INSTANCE vCPUs ACCELERATORS COST ($) CHOSEN
I 12-27 10:37:09 optimizer.py:714] ------------------------------------------------------------------------
I 12-27 10:37:09 optimizer.py:714] AWS m6i.2xlarge 8 - 0.38 ✔
I 12-27 10:37:09 optimizer.py:714] Azure Standard_D8_v4 8 - 0.38
I 12-27 10:37:09 optimizer.py:714] Lambda gpu_1x_a100_sxm4 30 A100:1 1.10
I 12-27 10:37:09 optimizer.py:714] ------------------------------------------------------------------------
Similar surprise when I typed sky launch
and saw the table.
I think we can discuss / poll whether this output is okay or too surprising.
- With
sky launch --cloud lambda -i1
there's a long stack trace. Maybe use
with ux_utils.print_exception_no_traceback():
raise ...
- Tried
V100:8
I 12-27 10:47:06 cloud_vm_ray_backend.py:1311] Launching on Lambda europe-central-1 ()
W 12-27 10:47:09 cloud_vm_ray_backend.py:762] Got error(s) in europe-central-1:
W 12-27 10:47:09 cloud_vm_ray_backend.py:764] LambdaError: instance-operations/launch/insufficient-capacity: Not enough capacity to fulfill launch request.
Nits
- Can we remove the
()
after the region name LambdaError
->LambdaLabsError
?
sky launch --cloud lambda -i1 --down --num-nodes 2
seems to proceed without an error saying >1 node is currently not supported.
I 12-27 11:36:07 cloud_vm_ray_backend.py:1311] Launching on Lambda us-east-1 ()
I 12-27 11:38:43 log_utils.py:45] Head node is up.
I 12-27 11:39:50 cloud_vm_ray_backend.py:1421] Successfully provisioned or found existing head VM. Waiting for workers.
E 12-27 11:42:52 backend_utils.py:1015] Timed out: waited for more than 90 seconds for new workers to be provisioned, but no progress.
E 12-27 11:42:52 cloud_vm_ray_backend.py:1181] *** Failed provisioning the cluster. ***
E 12-27 11:42:52 cloud_vm_ray_backend.py:1183] *** Terminating the failed cluster. ***
I 12-27 11:43:18 cloud_vm_ray_backend.py:1311] Launching on Lambda us-west-2 ()
I 12-27 11:45:50 log_utils.py:45] Head node is up.
...
At some point I ctrl-c'd this and saw 2 instances in console, one in us-east-1 (Virginia) and one in India ( asia-south-1 ). The former should've been terminated in the log above?
Thanks for the detailed comments @concretevitamin!
|
Thanks for the fantastic work @ewzeng! I tried out the PR following the instructions given above:
|
Thanks for the comments @Michaelvll!
|
Is a regular Lambda user able to use the non-sxm4 A100? I think it would be better that a Lambda Labs user can use SkyPilot for A100 out of box by specifying
Sounds good! Let's do it in a future PR. |
Yes, a regular Lambda user should be able to use the non-sxm4 A100 (if there is availability). |
Thanks for picking this up @ewzeng! The progress is super exciting. I just tried launching/tearing down a gpunode. The VM got spun up and removed successfully, however the autogenerated SSH key is still there in the dashboard. Shouldn't the SSH key that was generated in the Lambda console also be removed? |
Thanks for the review @gmittal! I made the ssh key is per-user (not per-cluster), so we don't need to remove it from the console. (The ssh key is actually just |
@Michaelvll I pushed a fix for the |
I reordered the Lambda catalog so that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great thanks for the excellent work @ewzeng! Just did a quick pass (will do a more thorough one later). The --use-spot
and --gpus A100
works well now.
Please remember to submit a PR for the catalog to this repo https://github.com/skypilot-org/skypilot-catalog, so that the user can automatically download the catalog from our repo.
I met the following issue:
sky launch -c test-lambda --gpus A100 echo hi
fails
> sky launch -c test-lambda --gpus A100 echo hi
Task from command: echo hi
I 01-04 00:30:29 optimizer.py:606] == Optimizer ==
I 01-04 00:30:29 optimizer.py:617] Target: minimizing cost
I 01-04 00:30:29 optimizer.py:629] Estimated cost: $1.1 / hour
I 01-04 00:30:29 optimizer.py:629]
I 01-04 00:30:29 optimizer.py:685] Considered resources (1 node):
I 01-04 00:30:29 optimizer.py:714] ------------------------------------------------------------------------
I 01-04 00:30:29 optimizer.py:714] CLOUD INSTANCE vCPUs ACCELERATORS COST ($) CHOSEN
I 01-04 00:30:29 optimizer.py:714] ------------------------------------------------------------------------
I 01-04 00:30:29 optimizer.py:714] Lambda gpu_1x_a100_sxm4 30 A100:1 1.10 ✔
I 01-04 00:30:29 optimizer.py:714] GCP a2-highgpu-1g 12 A100:1 3.67
I 01-04 00:30:29 optimizer.py:714] ------------------------------------------------------------------------
I 01-04 00:30:29 optimizer.py:714]
I 01-04 00:30:29 optimizer.py:729] Multiple Lambda instances satisfy A100:1. The cheapest Lambda(gpu_1x_a100_sxm4, {'A100': 1}) is considered among:
I 01-04 00:30:29 optimizer.py:729] ['gpu_1x_a100_sxm4', 'gpu_1x_a100'].
I 01-04 00:30:29 optimizer.py:729]
I 01-04 00:30:29 optimizer.py:735] To list more details, run 'sky show-gpus A100'.
Launching a new cluster 'test-lambda'. Proceed? [Y/n]:
Traceback (most recent call last):
File "/Users/zhwu/miniconda3/envs/sky-dev/bin/sky", line 33, in <module>
sys.exit(load_entry_point('skypilot', 'console_scripts', 'sky')())
File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/Users/zhwu/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/skypilot-lambda/sky/utils/common_utils.py", line 214, in _record
return f(*args, **kwargs)
File "/Users/zhwu/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/skypilot-lambda/sky/cli.py", line 1009, in invoke
return super().invoke(ctx)
File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/Users/zhwu/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/skypilot-lambda/sky/utils/common_utils.py", line 235, in _record
return f(*args, **kwargs)
File "/Users/zhwu/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/skypilot-lambda/sky/cli.py", line 1223, in launch
_launch_with_confirm(
File "/Users/zhwu/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/skypilot-lambda/sky/cli.py", line 717, in _launch_with_confirm
if resource.cloud.is_same_cloud(sky.Lambda()):
AttributeError: 'NoneType' object has no attribute 'is_same_cloud'
@Michaelvll I am once again asking for your review :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick fix @ewzeng! The combined smoke_tests looks excellent. Left several comments, mostly for readability. : )
Thanks for the comments @Michaelvll @concretevitamin! I tried to address them all. Important updates:
In particular, please rename Potentially unfinished:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick fix @ewzeng! Several final comments.
An issue:
- It seems autodown does not work for me:
sky launch -c test-lambda -i 1 --down
Thanks for the review (once again) @Michaelvll! Hmm, I ran autodown a few times and they worked each time. Are you sure you are launching on Lambda Cloud? Can you reproduce this bug? I pushed some updates. There are two things I wasn't sure about, so I left the conversations unresolved (internal ip and task.num_nodes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for the excellent PR for Lambda Labs @ewzeng! It looks good to me now. Just tried the autodown again and it works. It is probably because of some issues with my environment.
Thanks for the review @Michaelvll (sorry for giving you so much work). I just merged from master to make things work with commit 76eed62. Once I finish running the Lambda tests for a final time, I will squash & merge this pr. |
* Apply gmittal's lambda lab PR (skypilot-org#1136) on top of commit ad37a47 * Basic working Lambda Labs support * Add error handling for Lambda Labs API and small lambda-ray.yml bugfix * Add automatic key generation, improve sky check, and resolve import bug * Improve Lambda Labs launch code and error handling * Remove bootstrap_config, change metadata file design, and resolve provisioning bug * Make autodown work on Lambda Labs * Add basic tests and improve lambda-ray.yml.j2 bugfix * Add sky cancel test and do not allow Lambda nodes to stop * Polish provider code and change local metadata path to avoid clutter * Update and move catalog out of repo * Clean up code * Cleanup and add CLI logs test * Disallow --num-nodes > 1 and rename some variables * Do not let optimizer consider Lambda Labs when launching spot * Fix issues arising from merge * Address Michaelvll comments Nits, improve error handling for autostop and --num-nodes > 1, regions_with_offering bugfix * Address infwinston comments Nits, lambda_keys format, improve error handling for autostop and --use-spot * Update Lambda Labs help string * Move Lambda Lab tests into smoke tests and change local tag file location * Improve remote node detection * Change tag file scheme * Add comments and change region_zone lookup * Use same tag file path for local and remote * Remove is_remote file * Clean up imports in Lambda Labs node_provider * Make optimizer skip clouds that do not implement requested_features * Rename Lambda Labs client functions, nits * Improve requested_features implementation, nits * Add type annotations, nits * Improve pytest serialization logic * Improve requested_features, introduce CloudImplementationFeatures enums * Update lambda_utils.Metadata, address nits * Fix conftest.py bug introduced in previous commit * Update test comment * Rename Lambda Labs -> Lambda Cloud * Fix tag file reuse bug * Testing nit * Fix auth bug and address nits * Address final nits * Fix typing issues from merge * Provide basic support for cpus in resource specification * Improve 'cpu' resource specification for Lambda Cloud
* Apply gmittal's lambda lab PR (skypilot-org#1136) on top of commit ad37a47 * Basic working Lambda Labs support * Add error handling for Lambda Labs API and small lambda-ray.yml bugfix * Add automatic key generation, improve sky check, and resolve import bug * Improve Lambda Labs launch code and error handling * Remove bootstrap_config, change metadata file design, and resolve provisioning bug * Make autodown work on Lambda Labs * Add basic tests and improve lambda-ray.yml.j2 bugfix * Add sky cancel test and do not allow Lambda nodes to stop * Polish provider code and change local metadata path to avoid clutter * Update and move catalog out of repo * Clean up code * Cleanup and add CLI logs test * Disallow --num-nodes > 1 and rename some variables * Do not let optimizer consider Lambda Labs when launching spot * Fix issues arising from merge * Address Michaelvll comments Nits, improve error handling for autostop and --num-nodes > 1, regions_with_offering bugfix * Address infwinston comments Nits, lambda_keys format, improve error handling for autostop and --use-spot * Update Lambda Labs help string * Move Lambda Lab tests into smoke tests and change local tag file location * Improve remote node detection * Change tag file scheme * Add comments and change region_zone lookup * Use same tag file path for local and remote * Remove is_remote file * Clean up imports in Lambda Labs node_provider * Make optimizer skip clouds that do not implement requested_features * Rename Lambda Labs client functions, nits * Improve requested_features implementation, nits * Add type annotations, nits * Improve pytest serialization logic * Improve requested_features, introduce CloudImplementationFeatures enums * Update lambda_utils.Metadata, address nits * Fix conftest.py bug introduced in previous commit * Update test comment * Rename Lambda Labs -> Lambda Cloud * Fix tag file reuse bug * Testing nit * Fix auth bug and address nits * Address final nits * Fix typing issues from merge * Provide basic support for cpus in resource specification * Improve 'cpu' resource specification for Lambda Cloud
This PR adds support for Lambda Labs GPU Cloud.
How to try out [updated 1/27/2023]:
Setup
api_key = [YOUR_API_KEY]
to~/.lambda_cloud/lambda_keys
.sky check
.Launch
Some things you can run:
sky gpunode --instance-type gpu_1x_a100_sxm4
sky launch --cloud lambda examples/minimal.yaml --down
sky launch --cloud lambda --gpus A100 examples/huggingface_glue_imdb_app.yaml
Features and Limitations
Some limitations are:
--image-id
is not supported (Lambda Cloud does not provide this feature)Everything else should work. If you find a feature that is not supported, please let me know!
Acknowledgements
A large part of this PR is based on @gmittal's earlier work (#1136).
All suggestions and feedback are welcome! @concretevitamin @Michaelvll @infwinston @romilbhardwaj @gmittal