Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sky show-gpus support for Kubernetes #2638

Merged
merged 11 commits into from
Oct 7, 2023

Conversation

hemildesai
Copy link
Contributor

@hemildesai hemildesai commented Oct 2, 2023

Implements list_accelerators function for Kubernetes catalog. The function is dynamic and iterates through all nodes (checking for GPUs) currently, but the output can also be cached similar to other catalogs. Right now, it only works with sky show-gpus --cloud kubernetes since _ALL_CLOUDS in service_catalog doesn't include kubernetes due to the lack of implementation of other catalog functions.

Example output:

sky show-gpus --cloud kubernetes          
   
COMMON_GPU  AVAILABLE_QUANTITIES  
T4          1                     
V100        1                     

Hint: use -a/--all to see all accelerators (including non-common ones) and pricing.
sky show-gpus --cloud kubernetes -a

COMMON_GPU  AVAILABLE_QUANTITIES  
T4          1                     
V100        1                     



GPU  QTY  CLOUD       INSTANCE_TYPE  DEVICE_MEM  vCPUs  HOST_MEM  HOURLY_PRICE  HOURLY_SPOT_PRICE  
T4   1    Kubernetes  (attachable)   -           -      -         $ 0.000       $ 0.000            

GPU   QTY  CLOUD       INSTANCE_TYPE  DEVICE_MEM  vCPUs  HOST_MEM  HOURLY_PRICE  HOURLY_SPOT_PRICE  
V100  1    Kubernetes  (attachable)   -           -      -         $ 0.000       $ 0.000      

Addresses #2431.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks @hemildesai! Left some comments about the functionality, code looks good otherwise.

sky/clouds/service_catalog/kubernetes_catalog.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hemildesai!

Comment on lines 23 to 28
if method_name == "list_accelerators":
clouds.append("kubernetes")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ok for this PR, but we should add a comment here suggesting to remove this once the common service catalog functions are refactored from clouds/kubernetes.py to kubernetes_catalog.py (see todo).

Once that's done, we can safely add 'kubernetes' to _ALL_CLOUDS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 6b1691e

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hemildesai! Tried it on a cluster without GPUs and had some suggestions on how to handle that case better. Otherwise looks great!

@@ -148,6 +161,10 @@ def get_label_key(cls) -> str:
def get_label_value(cls, accelerator: str) -> str:
return get_gke_accelerator_name(accelerator)

@classmethod
def get_accelerator_from_label_value(cls, value: str) -> str:
return value.split('-')[-1].upper()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accelerators may have hyphens in their names (e.g.,nvidia-a100-80gb would end up being parsed as 80GB here).

Can we do parsing by lstripping nvidia- and/or nvidia-tesla-? (see possible names here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, will update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in fa0df6a

sky/cli.py Show resolved Hide resolved
Comment on lines 43 to 45
if Kubernetes not in global_user_state.get_enabled_clouds(
) or not kubernetes_utils.check_credentials()[0]:
return {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, there seems to be a bug - my GKE cluster has V100 yet this shows the error message:

# sky show-gpus does not find GPUs:
(base) ➜  ~ sky show-gpus --cloud kubernetes
No GPUs found in Kubernetes cluster. If your cluster contains GPUs, make sure nvidia.com/gpu resource is available on the nodes and the node labels for identifying GPUs (e.g., skypilot.co/accelerators) are setup correctly. To further debug, run: sky check.


# Confirming sky launch detects GPUs:
(base) ➜  ~ sky launch -c test --gpus V100:1
I 10-05 14:31:23 optimizer.py:682] == Optimizer ==
I 10-05 14:31:23 optimizer.py:693] Target: minimizing cost
I 10-05 14:31:23 optimizer.py:705] Estimated cost: $0.0 / hour
I 10-05 14:31:23 optimizer.py:705]
I 10-05 14:31:23 optimizer.py:777] Considered resources (1 node):
I 10-05 14:31:23 optimizer.py:826] ------------------------------------------------------------------------------------------------------
I 10-05 14:31:23 optimizer.py:826]  CLOUD        INSTANCE           vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE     COST ($)   CHOSEN
I 10-05 14:31:23 optimizer.py:826] ------------------------------------------------------------------------------------------------------
I 10-05 14:31:23 optimizer.py:826]  Kubernetes   2CPU--8GB--1V100   2       8         V100:1         kubernetes      0.00          ✔
I 10-05 14:31:23 optimizer.py:826]  IBM          gx2-8x64x1v100     8       64        V100:1         us-east         2.50
I 10-05 14:31:23 optimizer.py:826]  GCP          n1-highmem-8       8       52        V100:1         us-central1-a   2.95
I 10-05 14:31:23 optimizer.py:826]  AWS          p3.2xlarge         8       61        V100:1         us-east-1       3.06
I 10-05 14:31:23 optimizer.py:826]  Azure        Standard_NC6s_v3   6       112       V100:1         eastus          3.06
I 10-05 14:31:23 optimizer.py:826] ------------------------------------------------------------------------------------------------------
I 10-05 14:31:23 optimizer.py:826]
Launching a new cluster 'test'. Proceed? [Y/n]: ^CAborted!


# Confirming the underlying code can detect GPUs:
(base) ➜  ~ SKYPILOT_DEBUG=1 python -c "import sky;print(sky.utils.kubernetes_utils.get_gpu_label_key_value('v100'))"
D 10-05 14:32:58 skypilot_config.py:157] Using config path: /Users/romilb/.sky/config.yaml
D 10-05 14:32:58 skypilot_config.py:160] Config loaded:
D 10-05 14:32:58 skypilot_config.py:160] None
D 10-05 14:32:58 skypilot_config.py:166] Config syntax check passed.
('cloud.google.com/gke-accelerator', 'nvidia-tesla-v100')

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comparison is failing - Kubernetes not in global_user_state.get_enabled_clouds(). You may need to use Kubernetes() and the is_same_cloud comparator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 2134873.
Checked both clusters with/without GPUs.

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hemildesai! This should be good to go after linting (./format.sh) is fixed.

Tested on GKE multi-node cluster:

  • sky show-gpus
  • sky show-gpus --cloud kubernetes
  • sky show-gpus --cloud kubernetes -a
  • sky show-gpus with a bad kubeconfig
  • sky show-gpus on a fresh environment without k8s configured

@hemildesai
Copy link
Contributor Author

It looks like pylint in ./format.sh doesn't work properly on Python 3.11. Manually fixed the errors for now.

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hemildesai! Merging now 🎉

@romilbhardwaj romilbhardwaj merged commit 9ff1927 into skypilot-org:master Oct 7, 2023
18 checks passed
jc9123 pushed a commit to jc9123/skypilot that referenced this pull request Oct 11, 2023
* Add sky show-gpus support for Kubernetes

* Update sky/clouds/service_catalog/kubernetes_catalog.py

Co-authored-by: Romil Bhardwaj <[email protected]>

* PR feedback

* PR feedback part 2

* Format fix

* PR feedback part 3

* Fix bug with checking enabled clouds in k8s list_accelerators

* Pylint fixes

* Pylint fixes part 2

* Pylint fixes part 3

* Pylint fixes part 4

---------

Co-authored-by: Romil Bhardwaj <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants