Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Lambda Labs #1136

Closed
wants to merge 8 commits into from
Closed

Add support for Lambda Labs #1136

wants to merge 8 commits into from

Conversation

gmittal
Copy link
Collaborator

@gmittal gmittal commented Aug 30, 2022

This is a (very rough) first pass at implementing a node provider for Lambda Labs which has some of the cheapest cloud GPUs available.

TODO

  • Single-node support (e.g. sky gpunode works e2e with failover)
  • sky check setup
  • Clean up clusters correctly (remove SSH keys, etc.)
  • Implement multi-node
  • Lots of tests... (YAML, partial cluster provisioning failure, etc.)

All suggestions and feedback welcome!

@gmittal gmittal marked this pull request as draft August 30, 2022 05:38
ewzeng added a commit to ewzeng/skypilot that referenced this pull request Dec 9, 2022
@Michaelvll Michaelvll linked an issue Dec 13, 2022 that may be closed by this pull request
ewzeng added a commit to ewzeng/skypilot that referenced this pull request Dec 22, 2022
ewzeng added a commit that referenced this pull request Jan 30, 2023
* Apply gmittal's lambda lab PR (#1136) on top of commit ad37a47

* Basic working Lambda Labs support

* Add error handling for Lambda Labs API and small lambda-ray.yml bugfix

* Add automatic key generation, improve sky check, and resolve import bug

* Improve Lambda Labs launch code and error handling

* Remove bootstrap_config, change metadata file design, and resolve
provisioning bug

* Make autodown work on Lambda Labs

* Add basic tests and improve lambda-ray.yml.j2 bugfix

* Add sky cancel test and do not allow Lambda nodes to stop

* Polish provider code and change local metadata path to avoid clutter

* Update and move catalog out of repo

* Clean up code

* Cleanup and add CLI logs test

* Disallow --num-nodes > 1 and rename some variables

* Do not let optimizer consider Lambda Labs when launching spot

* Fix issues arising from merge

* Address Michaelvll comments

Nits, improve error handling for autostop and --num-nodes > 1,
regions_with_offering bugfix

* Address infwinston comments

Nits, lambda_keys format, improve error handling for autostop and
--use-spot

* Update Lambda Labs help string

* Move Lambda Lab tests into smoke tests and change local tag file
location

* Improve remote node detection

* Change tag file scheme

* Add comments and change region_zone lookup

* Use same tag file path for local and remote

* Remove is_remote file

* Clean up imports in Lambda Labs node_provider

* Make optimizer skip clouds that do not implement requested_features

* Rename Lambda Labs client functions, nits

* Improve requested_features implementation, nits

* Add type annotations, nits

* Improve pytest serialization logic

* Improve requested_features, introduce CloudImplementationFeatures enums

* Update lambda_utils.Metadata, address nits

* Fix conftest.py bug introduced in previous commit

* Update test comment

* Rename Lambda Labs -> Lambda Cloud

* Fix tag file reuse bug

* Testing nit

* Fix auth bug and address nits

* Address final nits

* Fix typing issues from merge

* Provide basic support for cpus in resource specification

* Improve 'cpu' resource specification for Lambda Cloud
sumanthgenz pushed a commit to sumanthgenz/skypilot that referenced this pull request Feb 22, 2023
* Apply gmittal's lambda lab PR (skypilot-org#1136) on top of commit ad37a47

* Basic working Lambda Labs support

* Add error handling for Lambda Labs API and small lambda-ray.yml bugfix

* Add automatic key generation, improve sky check, and resolve import bug

* Improve Lambda Labs launch code and error handling

* Remove bootstrap_config, change metadata file design, and resolve
provisioning bug

* Make autodown work on Lambda Labs

* Add basic tests and improve lambda-ray.yml.j2 bugfix

* Add sky cancel test and do not allow Lambda nodes to stop

* Polish provider code and change local metadata path to avoid clutter

* Update and move catalog out of repo

* Clean up code

* Cleanup and add CLI logs test

* Disallow --num-nodes > 1 and rename some variables

* Do not let optimizer consider Lambda Labs when launching spot

* Fix issues arising from merge

* Address Michaelvll comments

Nits, improve error handling for autostop and --num-nodes > 1,
regions_with_offering bugfix

* Address infwinston comments

Nits, lambda_keys format, improve error handling for autostop and
--use-spot

* Update Lambda Labs help string

* Move Lambda Lab tests into smoke tests and change local tag file
location

* Improve remote node detection

* Change tag file scheme

* Add comments and change region_zone lookup

* Use same tag file path for local and remote

* Remove is_remote file

* Clean up imports in Lambda Labs node_provider

* Make optimizer skip clouds that do not implement requested_features

* Rename Lambda Labs client functions, nits

* Improve requested_features implementation, nits

* Add type annotations, nits

* Improve pytest serialization logic

* Improve requested_features, introduce CloudImplementationFeatures enums

* Update lambda_utils.Metadata, address nits

* Fix conftest.py bug introduced in previous commit

* Update test comment

* Rename Lambda Labs -> Lambda Cloud

* Fix tag file reuse bug

* Testing nit

* Fix auth bug and address nits

* Address final nits

* Fix typing issues from merge

* Provide basic support for cpus in resource specification

* Improve 'cpu' resource specification for Lambda Cloud
@concretevitamin
Copy link
Member

Morphed into #1557.

sumanthgenz pushed a commit to sumanthgenz/skypilot that referenced this pull request Mar 15, 2023
* Apply gmittal's lambda lab PR (skypilot-org#1136) on top of commit ad37a47

* Basic working Lambda Labs support

* Add error handling for Lambda Labs API and small lambda-ray.yml bugfix

* Add automatic key generation, improve sky check, and resolve import bug

* Improve Lambda Labs launch code and error handling

* Remove bootstrap_config, change metadata file design, and resolve
provisioning bug

* Make autodown work on Lambda Labs

* Add basic tests and improve lambda-ray.yml.j2 bugfix

* Add sky cancel test and do not allow Lambda nodes to stop

* Polish provider code and change local metadata path to avoid clutter

* Update and move catalog out of repo

* Clean up code

* Cleanup and add CLI logs test

* Disallow --num-nodes > 1 and rename some variables

* Do not let optimizer consider Lambda Labs when launching spot

* Fix issues arising from merge

* Address Michaelvll comments

Nits, improve error handling for autostop and --num-nodes > 1,
regions_with_offering bugfix

* Address infwinston comments

Nits, lambda_keys format, improve error handling for autostop and
--use-spot

* Update Lambda Labs help string

* Move Lambda Lab tests into smoke tests and change local tag file
location

* Improve remote node detection

* Change tag file scheme

* Add comments and change region_zone lookup

* Use same tag file path for local and remote

* Remove is_remote file

* Clean up imports in Lambda Labs node_provider

* Make optimizer skip clouds that do not implement requested_features

* Rename Lambda Labs client functions, nits

* Improve requested_features implementation, nits

* Add type annotations, nits

* Improve pytest serialization logic

* Improve requested_features, introduce CloudImplementationFeatures enums

* Update lambda_utils.Metadata, address nits

* Fix conftest.py bug introduced in previous commit

* Update test comment

* Rename Lambda Labs -> Lambda Cloud

* Fix tag file reuse bug

* Testing nit

* Fix auth bug and address nits

* Address final nits

* Fix typing issues from merge

* Provide basic support for cpus in resource specification

* Improve 'cpu' resource specification for Lambda Cloud
@Michaelvll Michaelvll deleted the lambda-labs branch December 18, 2024 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Compatibility with Lambda Cloud
2 participants