-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sky show-gpus support for Kubernetes #2638
Add sky show-gpus support for Kubernetes #2638
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks @hemildesai! Left some comments about the functionality, code looks good otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hemildesai!
if method_name == "list_accelerators": | ||
clouds.append("kubernetes") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ok for this PR, but we should add a comment here suggesting to remove this once the common service catalog functions are refactored from clouds/kubernetes.py
to kubernetes_catalog.py
(see todo).
Once that's done, we can safely add 'kubernetes'
to _ALL_CLOUDS
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 6b1691e
6b1691e
to
839ea03
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hemildesai! Tried it on a cluster without GPUs and had some suggestions on how to handle that case better. Otherwise looks great!
sky/utils/kubernetes_utils.py
Outdated
@@ -148,6 +161,10 @@ def get_label_key(cls) -> str: | |||
def get_label_value(cls, accelerator: str) -> str: | |||
return get_gke_accelerator_name(accelerator) | |||
|
|||
@classmethod | |||
def get_accelerator_from_label_value(cls, value: str) -> str: | |||
return value.split('-')[-1].upper() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accelerators may have hyphens in their names (e.g.,nvidia-a100-80gb
would end up being parsed as 80GB
here).
Can we do parsing by lstripping nvidia-
and/or nvidia-tesla-
? (see possible names here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, will update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in fa0df6a
if Kubernetes not in global_user_state.get_enabled_clouds( | ||
) or not kubernetes_utils.check_credentials()[0]: | ||
return {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, there seems to be a bug - my GKE cluster has V100 yet this shows the error message:
# sky show-gpus does not find GPUs:
(base) ➜ ~ sky show-gpus --cloud kubernetes
No GPUs found in Kubernetes cluster. If your cluster contains GPUs, make sure nvidia.com/gpu resource is available on the nodes and the node labels for identifying GPUs (e.g., skypilot.co/accelerators) are setup correctly. To further debug, run: sky check.
# Confirming sky launch detects GPUs:
(base) ➜ ~ sky launch -c test --gpus V100:1
I 10-05 14:31:23 optimizer.py:682] == Optimizer ==
I 10-05 14:31:23 optimizer.py:693] Target: minimizing cost
I 10-05 14:31:23 optimizer.py:705] Estimated cost: $0.0 / hour
I 10-05 14:31:23 optimizer.py:705]
I 10-05 14:31:23 optimizer.py:777] Considered resources (1 node):
I 10-05 14:31:23 optimizer.py:826] ------------------------------------------------------------------------------------------------------
I 10-05 14:31:23 optimizer.py:826] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
I 10-05 14:31:23 optimizer.py:826] ------------------------------------------------------------------------------------------------------
I 10-05 14:31:23 optimizer.py:826] Kubernetes 2CPU--8GB--1V100 2 8 V100:1 kubernetes 0.00 ✔
I 10-05 14:31:23 optimizer.py:826] IBM gx2-8x64x1v100 8 64 V100:1 us-east 2.50
I 10-05 14:31:23 optimizer.py:826] GCP n1-highmem-8 8 52 V100:1 us-central1-a 2.95
I 10-05 14:31:23 optimizer.py:826] AWS p3.2xlarge 8 61 V100:1 us-east-1 3.06
I 10-05 14:31:23 optimizer.py:826] Azure Standard_NC6s_v3 6 112 V100:1 eastus 3.06
I 10-05 14:31:23 optimizer.py:826] ------------------------------------------------------------------------------------------------------
I 10-05 14:31:23 optimizer.py:826]
Launching a new cluster 'test'. Proceed? [Y/n]: ^CAborted!
# Confirming the underlying code can detect GPUs:
(base) ➜ ~ SKYPILOT_DEBUG=1 python -c "import sky;print(sky.utils.kubernetes_utils.get_gpu_label_key_value('v100'))"
D 10-05 14:32:58 skypilot_config.py:157] Using config path: /Users/romilb/.sky/config.yaml
D 10-05 14:32:58 skypilot_config.py:160] Config loaded:
D 10-05 14:32:58 skypilot_config.py:160] None
D 10-05 14:32:58 skypilot_config.py:166] Config syntax check passed.
('cloud.google.com/gke-accelerator', 'nvidia-tesla-v100')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comparison is failing - Kubernetes not in global_user_state.get_enabled_clouds()
. You may need to use Kubernetes()
and the is_same_cloud comparator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 2134873.
Checked both clusters with/without GPUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hemildesai! This should be good to go after linting (./format.sh
) is fixed.
Tested on GKE multi-node cluster:
sky show-gpus
sky show-gpus --cloud kubernetes
sky show-gpus --cloud kubernetes -a
sky show-gpus
with a bad kubeconfigsky show-gpus
on a fresh environment without k8s configured
It looks like |
85c092d
to
6b0b40d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hemildesai! Merging now 🎉
* Add sky show-gpus support for Kubernetes * Update sky/clouds/service_catalog/kubernetes_catalog.py Co-authored-by: Romil Bhardwaj <[email protected]> * PR feedback * PR feedback part 2 * Format fix * PR feedback part 3 * Fix bug with checking enabled clouds in k8s list_accelerators * Pylint fixes * Pylint fixes part 2 * Pylint fixes part 3 * Pylint fixes part 4 --------- Co-authored-by: Romil Bhardwaj <[email protected]>
Implements
list_accelerators
function for Kubernetes catalog. The function is dynamic and iterates through all nodes (checking for GPUs) currently, but the output can also be cached similar to other catalogs. Right now, it only works withsky show-gpus --cloud kubernetes
since_ALL_CLOUDS
in service_catalog doesn't includekubernetes
due to the lack of implementation of other catalog functions.Example output:
Addresses #2431.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh