Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Ensure Jump Pod is in "Running" Status Before Proceeding #2589

Conversation

landscapepainter
Copy link
Collaborator

@landscapepainter landscapepainter commented Sep 21, 2023

Within node_provider.py/create_node(), we previously waited for the node pod to reach the "Running" status. This ensured the handling of potential errors, such as image pull failures, and allowed for the proper setting of environment variables. With the introduction of the jump pod creation process during provisioning, it's imperative to also ensure the jump pod is operational and in the "Running" status. This step is crucial to prevent scenarios where the node pod becomes active while the jump pod encounters errors and fails to launch.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Manually running sky launch with kind/GKE
  • pytest tests/test_smoke.py --kubernetes -k "not TestStorageWithCredentials"

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @landscapepainter - left a comment.

@landscapepainter
Copy link
Collaborator Author

@romilbhardwaj This is ready for another look

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @landscapepainter! This works, but there seem to be some bugs in the error handling/surfacing. See comments.

ssh_jump_pod_name = conf['metadata']['labels']['skypilot-ssh-jump']
jump_pod = kubernetes.core_api().read_namespaced_pod(
ssh_jump_pod_name, self.namespace)
new_nodes.append(jump_pod)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to test this by editing kubernetes-ssh-jump.yml.j2 to request a large amount of CPU for the SSH jump pod (so that it fails to get scheduled):

# containers:
  ...
     resources:
       requests:
         cpu: 20

When I ran sky launch -c test -- echo hi, I got this error (though failover continued as expected, but the error message was confusing):

D 10-26 11:32:53 cloud_vm_ray_backend.py:2006] `ray up` takes 11.3 seconds with 1 retries.
W 10-26 11:32:53 cloud_vm_ray_backend.py:997] Got error(s) in kubernetes:
W 10-26 11:32:53 cloud_vm_ray_backend.py:999] 	sky.skylet.providers.kubernetes.config.KubernetesError: An error occurred while trying to fetch the reason for pod scheduling failure. Error: UnboundLocalError: local variable 'event_message' referenced before assignment
...

On the second run, I got:

W 10-26 11:36:14 cloud_vm_ray_backend.py:999] 	sky.skylet.providers.kubernetes.config.KubernetesError: Timed out while waiting for nodes to start. Cluster may be out of resources or may be too slow to autoscale. Pod status: PendingDetails: 'skip schedule deleting pod: default/test-2ea4-ray-head'

Both of these messages were while kubectl describe pod was giving a clear failed scheduling message:

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  8s    default-scheduler  0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

Compare this to our regular message when a task requests too many CPUs (e.g., sky launch --cpus 8), which clearly states that the cluster is out of CPU and suggests debugging steps. Can we have the same message for jump pod too?

W 10-26 11:34:57 cloud_vm_ray_backend.py:997] Got error(s) in kubernetes:
W 10-26 11:34:57 cloud_vm_ray_backend.py:999] 	sky.skylet.providers.kubernetes.config.KubernetesError: Insufficient CPU capacity on the cluster. Other SkyPilot tasks or pods may be using resources. Check resource usage by running `kubectl describe nodes`

Copy link
Collaborator Author

@landscapepainter landscapepainter Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had this error due to two reasons:

  1. Was not updating the pod status by re-reading the passed new_nodes arguement from _raise_pod_scheduling_errors.
  2. Was attempting to read an event of a pod that is already scheduled and is not in Pending status.

Fixed the two above and setting excessive resources for jump pod or the pod instance, both fails with correct error message. Following is displayed when excessive amount of CPUs are requested for the jump pod:

$ sky launch --cloud kubernetes -y
I 10-30 05:57:34 optimizer.py:674] == Optimizer ==
I 10-30 05:57:34 optimizer.py:697] Estimated cost: $0.0 / hour
I 10-30 05:57:34 optimizer.py:697] 
I 10-30 05:57:34 optimizer.py:769] Considered resources (1 node):
I 10-30 05:57:34 optimizer.py:818] ---------------------------------------------------------------------------------------------
I 10-30 05:57:34 optimizer.py:818]  CLOUD        INSTANCE    vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
I 10-30 05:57:34 optimizer.py:818] ---------------------------------------------------------------------------------------------
I 10-30 05:57:34 optimizer.py:818]  Kubernetes   2CPU--2GB   2       2         -              kubernetes    0.00          ✔     
I 10-30 05:57:34 optimizer.py:818] ---------------------------------------------------------------------------------------------
I 10-30 05:57:34 optimizer.py:818] 
Running task on cluster sky-xxxx-gcpuser...
I 10-30 05:57:34 cloud_vm_ray_backend.py:4382] Creating a new cluster: 'sky-xxxx-gcpuser' [1x Kubernetes(2CPU--2GB)].
I 10-30 05:57:34 cloud_vm_ray_backend.py:4382] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 10-30 05:57:35 cloud_vm_ray_backend.py:1449] To view detailed progress: tail -n100 -f /home/gcpuser/sky_logs/sky-2023-10-30-05-57-33-829528/provision.log
I 10-30 05:57:36 cloud_vm_ray_backend.py:1884] Launching on Kubernetes 
W 10-30 05:57:48 cloud_vm_ray_backend.py:997] Got error(s) in kubernetes:
W 10-30 05:57:48 cloud_vm_ray_backend.py:999] 	sky.skylet.providers.kubernetes.config.KubernetesError: Insufficient CPU capacity on the cluster. Other SkyPilot tasks or pods may be using resources. Check resource usage by running `kubectl describe nodes`.
W 10-30 05:57:54 cloud_vm_ray_backend.py:2221] sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in all zones in kubernetes. Try changing resource requirements or use another region.
W 10-30 05:57:54 cloud_vm_ray_backend.py:2230] 
W 10-30 05:57:54 cloud_vm_ray_backend.py:2230] Provision failed for 1x Kubernetes(2CPU--2GB) in kubernetes. Trying other locations (if any).
Clusters
No existing clusters.

sky.exceptions.ResourcesUnavailableError: Failed to provision all possible launchable resources. Relax the task's resource requirements: 1x Kubernetes()
To keep retrying until the cluster is up, use the `--retry-until-up` flag.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this! We may also need to update kubernetes_utils.clean_zombie_ssh_jump_pod() to delete a pending SSH jump pod if it's out of resources. Currently, if the jump pod's CPU resources are not satisfied, then it is not cleaned up and stays indefinitely in a pending state:

# Right after sky launch exits, the cluster pod is terminated
(base) ➜ k get pods
NAME                    READY   STATUS        RESTARTS   AGE
sky-ssh-jump-2ea485ef   0/1     Pending       0          13s
test-2ea4-ray-head      1/1     Terminating   0          13s


# But the ssh pod remains pending forever. We should clean this up in clean_zombie_ssh_jump_pod()
(base) ➜  k get pods
NAME                    READY   STATUS    RESTARTS   AGE
sky-ssh-jump-2ea485ef   0/1     Pending   0          15s

Copy link
Collaborator Author

@landscapepainter landscapepainter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romilbhardwaj Resolved the issue. This is ready for another look :)

ssh_jump_pod_name = conf['metadata']['labels']['skypilot-ssh-jump']
jump_pod = kubernetes.core_api().read_namespaced_pod(
ssh_jump_pod_name, self.namespace)
new_nodes.append(jump_pod)
Copy link
Collaborator Author

@landscapepainter landscapepainter Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had this error due to two reasons:

  1. Was not updating the pod status by re-reading the passed new_nodes arguement from _raise_pod_scheduling_errors.
  2. Was attempting to read an event of a pod that is already scheduled and is not in Pending status.

Fixed the two above and setting excessive resources for jump pod or the pod instance, both fails with correct error message. Following is displayed when excessive amount of CPUs are requested for the jump pod:

$ sky launch --cloud kubernetes -y
I 10-30 05:57:34 optimizer.py:674] == Optimizer ==
I 10-30 05:57:34 optimizer.py:697] Estimated cost: $0.0 / hour
I 10-30 05:57:34 optimizer.py:697] 
I 10-30 05:57:34 optimizer.py:769] Considered resources (1 node):
I 10-30 05:57:34 optimizer.py:818] ---------------------------------------------------------------------------------------------
I 10-30 05:57:34 optimizer.py:818]  CLOUD        INSTANCE    vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
I 10-30 05:57:34 optimizer.py:818] ---------------------------------------------------------------------------------------------
I 10-30 05:57:34 optimizer.py:818]  Kubernetes   2CPU--2GB   2       2         -              kubernetes    0.00          ✔     
I 10-30 05:57:34 optimizer.py:818] ---------------------------------------------------------------------------------------------
I 10-30 05:57:34 optimizer.py:818] 
Running task on cluster sky-xxxx-gcpuser...
I 10-30 05:57:34 cloud_vm_ray_backend.py:4382] Creating a new cluster: 'sky-xxxx-gcpuser' [1x Kubernetes(2CPU--2GB)].
I 10-30 05:57:34 cloud_vm_ray_backend.py:4382] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 10-30 05:57:35 cloud_vm_ray_backend.py:1449] To view detailed progress: tail -n100 -f /home/gcpuser/sky_logs/sky-2023-10-30-05-57-33-829528/provision.log
I 10-30 05:57:36 cloud_vm_ray_backend.py:1884] Launching on Kubernetes 
W 10-30 05:57:48 cloud_vm_ray_backend.py:997] Got error(s) in kubernetes:
W 10-30 05:57:48 cloud_vm_ray_backend.py:999] 	sky.skylet.providers.kubernetes.config.KubernetesError: Insufficient CPU capacity on the cluster. Other SkyPilot tasks or pods may be using resources. Check resource usage by running `kubectl describe nodes`.
W 10-30 05:57:54 cloud_vm_ray_backend.py:2221] sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in all zones in kubernetes. Try changing resource requirements or use another region.
W 10-30 05:57:54 cloud_vm_ray_backend.py:2230] 
W 10-30 05:57:54 cloud_vm_ray_backend.py:2230] Provision failed for 1x Kubernetes(2CPU--2GB) in kubernetes. Trying other locations (if any).
Clusters
No existing clusters.

sky.exceptions.ResourcesUnavailableError: Failed to provision all possible launchable resources. Relax the task's resource requirements: 1x Kubernetes()
To keep retrying until the cluster is up, use the `--retry-until-up` flag.

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @landscapepainter! Ran into an issue with the cleanup of ssh jump pod if provisioning fails, otherwise this PR is ready!

sky/skylet/providers/kubernetes/node_provider.py Outdated Show resolved Hide resolved
ssh_jump_pod_name = conf['metadata']['labels']['skypilot-ssh-jump']
jump_pod = kubernetes.core_api().read_namespaced_pod(
ssh_jump_pod_name, self.namespace)
new_nodes.append(jump_pod)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this! We may also need to update kubernetes_utils.clean_zombie_ssh_jump_pod() to delete a pending SSH jump pod if it's out of resources. Currently, if the jump pod's CPU resources are not satisfied, then it is not cleaned up and stays indefinitely in a pending state:

# Right after sky launch exits, the cluster pod is terminated
(base) ➜ k get pods
NAME                    READY   STATUS        RESTARTS   AGE
sky-ssh-jump-2ea485ef   0/1     Pending       0          13s
test-2ea4-ray-head      1/1     Terminating   0          13s


# But the ssh pod remains pending forever. We should clean this up in clean_zombie_ssh_jump_pod()
(base) ➜  k get pods
NAME                    READY   STATUS    RESTARTS   AGE
sky-ssh-jump-2ea485ef   0/1     Pending   0          15s

Copy link
Collaborator Author

@landscapepainter landscapepainter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romilbhardwaj Thanks for the catch. I updated kubernetes_utils.clean_zombie_ssh_jump_pod() so that it deletes the pod and the service when the jump pod is in dangling Pending status. The jump pod is now removed when provisioning fails due to some error raised while launching the jump pod.

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @landscapepainter, this is good to go!

@landscapepainter landscapepainter merged commit 3fbde39 into skypilot-org:master Nov 7, 2023
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants