Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Add sky status flag to query global Kubernetes status #4040

Merged
merged 24 commits into from
Oct 11, 2024

Conversation

romilbhardwaj
Copy link
Collaborator

@romilbhardwaj romilbhardwaj commented Oct 5, 2024

Adds a --kubernetes flag to sky status to show the global state of the Kubernetes cluster, including SkyPilot clusters created by other users. Helps users see the current state of the Kubernetes cluster.

Example:

(base) ➜  sky-experiments git:(k8s_global_status) ✗ sky status --kubernetes
SkyPilot Clusters on Kubernetes
NAME        USER    LAUNCHED     RESOURCES                                      STATUS  
test-2ea4   romilb  45 secs ago  2x Kubernetes(cpus=2.0, mem=2.0)               UP      
train-j49a  test    1 min ago    2x Kubernetes(cpus=2.0, mem=8.0, {'H100': 1})  UP     

TODO:

  • Handle multiple contexts
  • Decide whether to poll current namespace or all namespaces
  • Remove user hash from cluster name
  • Cleaner code

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@romilbhardwaj romilbhardwaj marked this pull request as draft October 5, 2024 23:59
@romilbhardwaj
Copy link
Collaborator Author

Updated to query job status from all running controllers:

(base) ➜  sky-experiments git:(k8s_global_status) ✗ sky status --kubernetes
SkyPilot Clusters on Kubernetes
USER     NAME                               LAUNCHED        RESOURCES                          STATUS  
gcpuser  sky-cmd-2-80b5                     9 mins ago      1x Kubernetes(cpus=2.0, mem=2.0)   UP      
romilb   sky-cmd-3-2ea4                     19 mins ago     1x Kubernetes(cpus=2.0, mem=2.0)   UP      
romilb   sky-jobs-controller-2ea485ea-2ea4  4 hrs ago       1x Kubernetes(cpus=4.0, mem=4.0)   UP      
gcpuser  sky-jobs-controller-80b50983-80b5  3 hrs ago       1x Kubernetes(cpus=8.0, mem=24.0)  UP      
romilb   train-2ea4                         a few secs ago  1x Kubernetes(cpus=2.0, mem=2.0)   UP      
Managed jobs from 2 users on Kubernetes
In progress tasks: 2 RUNNING, 1 STARTING
USER     ID  TASK  NAME     RESOURCES   SUBMITTED    TOT. DURATION  JOB DURATION  #RECOVERIES  STATUS     
romilb   4   -     sky-cmd  1x[CPU:16]  8 mins ago   8m 14s         -             0            STARTING   
romilb   3   -     sky-cmd  1x[CPU:1+]  19 mins ago  19m 18s        18m 31s       0            RUNNING    
romilb   2   -     sky-cmd  1x[CPU:1+]  1 hr ago     1h 1m 2s       1h 16s        0            SUCCEEDED  
romilb   1   -     sky-cmd  1x[CPU:1+]  4 hrs ago    1m 32s         8s            0            SUCCEEDED  
gcpuser  2   -     sky-cmd  1x[CPU:1+]  9 mins ago   9m 57s         9m 14s        0            RUNNING    
gcpuser  1   -     sky-cmd  1x[CPU:1+]  3 hrs ago    49s            8s            0            SUCCEEDED  

Still needs clean up and edge case handling.

@romilbhardwaj romilbhardwaj marked this pull request as ready for review October 9, 2024 06:35
@romilbhardwaj
Copy link
Collaborator Author

romilbhardwaj commented Oct 9, 2024

This should be ready for review now.
skystatus

  • Handle sky serve clusters?

@romilbhardwaj
Copy link
Collaborator Author

Few nits, added hint for SkyServe replicas and --k8s shorthand:

image

@concretevitamin
Copy link
Member

UX LGTM; quick nits:

  • I'd expect the spinner doesn't refresh for every user name, esp. when there are a lot of users?
  • show-gpus context is None while status --k8s shows non-None context
    • also: former has Context, latter is lower case
  • -h: [Experimental] Show SkyPilot clusters from all users on Kubernetes. --> [Experimental] Show all SkyPilot resources (including from other users) of the current Kubernetes context.?
  • --k8s: discussed offline, this seems like a name we're ok with for now
  • Hint: SkyServe controllers detected in the cluster. SkyServe service replicas will be shown as SkyPilot clusters.
    • Hint: Currently, SkyServe replica pods are shown in the "SkyPilot clusters" section.
    • Maybe dim it since it doesn't warrant a lot of attention.
  • Managed jobs from 2 users -> Managed jobs

I tried launching a managed job on the same shared k8s cluster, and the job loops forever in starting. Controller logs:

...
(sky-cmd, pid=4467) Creating a new cluster: 'sky-cmd-1' [1x Kubernetes(2CPU--2GB, cpus=2)].
(sky-cmd, pid=4467) Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
(sky-cmd, pid=4467) To view detailed progress: tail -n100 -f /home/gcpuser/sky_logs/sky-2024-10-09-23-21-21-787148/provision.log
(sky-cmd, pid=4467) Launching on Kubernetes 'sky-cmd-1'.
(sky-cmd, pid=4467) run_instances: Error occurred when creating pods: Failed to check user privilege for pod sky-cmd-1-8a39-head with return code 1: 'if [ $(id -u) -eq 0 ]; then  echo \'alias sudo=""\' >> ~/.bashrc; echo succeed;else   if command -v sudo >/dev/null 2>&1; then     timeout 2 sudo -l >/dev/null 2>&1 && echo succeed ||     ( echo 52; );   else     ( echo 52; );   fi; fi'
(sky-cmd, pid=4467) Output: error: The gcp auth plugin has been removed.
(sky-cmd, pid=4467) Please use the "gke-gcloud-auth-plugin" kubectl/client-go credential plugin instead.
(sky-cmd, pid=4467) See https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke for further details
(sky-cmd, pid=4467) .
(sky-cmd, pid=4467) sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in all zones in gke_skypilot-375900_us-central1-c_gkeusc6. Try changing resource requirements or use another region.
...

@romilbhardwaj
Copy link
Collaborator Author

Fixed UX comments and added error handling.

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @romilbhardwaj.

sky/cli.py Show resolved Hide resolved
sky/utils/cli_utils/status_utils.py Outdated Show resolved Hide resolved
sky/data/storage_utils.py Outdated Show resolved Hide resolved
sky/jobs/core.py Outdated
Comment on lines 143 to 146
def queue_kubernetes(pod_name: str,
context: Optional[str] = None,
skip_finished: bool = False) -> List[Dict[str, Any]]:
"""Gets the jobs queue from a specific controller pod.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This naming is surprising. From the name I thought it's "gets the queue info for an entire k8s cluster". Maybe rename?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to queue_from_kubernetes_pod

sky/utils/cli_utils/status_utils.py Show resolved Hide resolved
sky/utils/common_utils.py Show resolved Hide resolved
@romilbhardwaj romilbhardwaj added this pull request to the merge queue Oct 11, 2024
Merged via the queue into master with commit f63850b Oct 11, 2024
20 checks passed
@romilbhardwaj romilbhardwaj deleted the k8s_global_status branch October 11, 2024 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants