[k8s] Ensure Jump Pod is in "Running" Status Before Proceeding #2589

landscapepainter · 2023-09-21T06:52:05Z

Within node_provider.py/create_node(), we previously waited for the node pod to reach the "Running" status. This ensured the handling of potential errors, such as image pull failures, and allowed for the proper setting of environment variables. With the introduction of the jump pod creation process during provisioning, it's imperative to also ensure the jump pod is operational and in the "Running" status. This step is crucial to prevent scenarios where the node pod becomes active while the jump pod encounters errors and fails to launch.

Tested (run the relevant ones):

Code formatting: bash format.sh
Manually running sky launch with kind/GKE
pytest tests/test_smoke.py --kubernetes -k "not TestStorageWithCredentials"

romilbhardwaj

Thanks @landscapepainter - left a comment.

sky/skylet/providers/kubernetes/node_provider.py

landscapepainter · 2023-10-09T00:34:46Z

@romilbhardwaj This is ready for another look

romilbhardwaj

Thanks @landscapepainter! This works, but there seem to be some bugs in the error handling/surfacing. See comments.

romilbhardwaj · 2023-10-26T18:47:52Z

sky/skylet/providers/kubernetes/node_provider.py

+        ssh_jump_pod_name = conf['metadata']['labels']['skypilot-ssh-jump']
+        jump_pod = kubernetes.core_api().read_namespaced_pod(
+            ssh_jump_pod_name, self.namespace)
+        new_nodes.append(jump_pod)


I tried to test this by editing kubernetes-ssh-jump.yml.j2 to request a large amount of CPU for the SSH jump pod (so that it fails to get scheduled):

# containers: ... resources: requests: cpu: 20

When I ran sky launch -c test -- echo hi, I got this error (though failover continued as expected, but the error message was confusing):

D 10-26 11:32:53 cloud_vm_ray_backend.py:2006] `ray up` takes 11.3 seconds with 1 retries. W 10-26 11:32:53 cloud_vm_ray_backend.py:997] Got error(s) in kubernetes: W 10-26 11:32:53 cloud_vm_ray_backend.py:999] sky.skylet.providers.kubernetes.config.KubernetesError: An error occurred while trying to fetch the reason for pod scheduling failure. Error: UnboundLocalError: local variable 'event_message' referenced before assignment ...

On the second run, I got:

W 10-26 11:36:14 cloud_vm_ray_backend.py:999] sky.skylet.providers.kubernetes.config.KubernetesError: Timed out while waiting for nodes to start. Cluster may be out of resources or may be too slow to autoscale. Pod status: PendingDetails: 'skip schedule deleting pod: default/test-2ea4-ray-head'

Both of these messages were while kubectl describe pod was giving a clear failed scheduling message:

Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 8s default-scheduler 0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

Compare this to our regular message when a task requests too many CPUs (e.g., sky launch --cpus 8), which clearly states that the cluster is out of CPU and suggests debugging steps. Can we have the same message for jump pod too?

W 10-26 11:34:57 cloud_vm_ray_backend.py:997] Got error(s) in kubernetes: W 10-26 11:34:57 cloud_vm_ray_backend.py:999] sky.skylet.providers.kubernetes.config.KubernetesError: Insufficient CPU capacity on the cluster. Other SkyPilot tasks or pods may be using resources. Check resource usage by running `kubectl describe nodes`

Had this error due to two reasons:

Was not updating the pod status by re-reading the passed new_nodes arguement from _raise_pod_scheduling_errors.

Was attempting to read an event of a pod that is already scheduled and is not in Pending status.

Fixed the two above and setting excessive resources for jump pod or the pod instance, both fails with correct error message. Following is displayed when excessive amount of CPUs are requested for the jump pod:

$ sky launch --cloud kubernetes -y I 10-30 05:57:34 optimizer.py:674] == Optimizer == I 10-30 05:57:34 optimizer.py:697] Estimated cost: $0.0 / hour I 10-30 05:57:34 optimizer.py:697] I 10-30 05:57:34 optimizer.py:769] Considered resources (1 node): I 10-30 05:57:34 optimizer.py:818] --------------------------------------------------------------------------------------------- I 10-30 05:57:34 optimizer.py:818] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN I 10-30 05:57:34 optimizer.py:818] --------------------------------------------------------------------------------------------- I 10-30 05:57:34 optimizer.py:818] Kubernetes 2CPU--2GB 2 2 - kubernetes 0.00 ✔ I 10-30 05:57:34 optimizer.py:818] --------------------------------------------------------------------------------------------- I 10-30 05:57:34 optimizer.py:818] Running task on cluster sky-xxxx-gcpuser... I 10-30 05:57:34 cloud_vm_ray_backend.py:4382] Creating a new cluster: 'sky-xxxx-gcpuser' [1x Kubernetes(2CPU--2GB)]. I 10-30 05:57:34 cloud_vm_ray_backend.py:4382] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters. I 10-30 05:57:35 cloud_vm_ray_backend.py:1449] To view detailed progress: tail -n100 -f /home/gcpuser/sky_logs/sky-2023-10-30-05-57-33-829528/provision.log I 10-30 05:57:36 cloud_vm_ray_backend.py:1884] Launching on Kubernetes W 10-30 05:57:48 cloud_vm_ray_backend.py:997] Got error(s) in kubernetes: W 10-30 05:57:48 cloud_vm_ray_backend.py:999] sky.skylet.providers.kubernetes.config.KubernetesError: Insufficient CPU capacity on the cluster. Other SkyPilot tasks or pods may be using resources. Check resource usage by running `kubectl describe nodes`. W 10-30 05:57:54 cloud_vm_ray_backend.py:2221] sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in all zones in kubernetes. Try changing resource requirements or use another region. W 10-30 05:57:54 cloud_vm_ray_backend.py:2230] W 10-30 05:57:54 cloud_vm_ray_backend.py:2230] Provision failed for 1x Kubernetes(2CPU--2GB) in kubernetes. Trying other locations (if any). Clusters No existing clusters. sky.exceptions.ResourcesUnavailableError: Failed to provision all possible launchable resources. Relax the task's resource requirements: 1x Kubernetes() To keep retrying until the cluster is up, use the `--retry-until-up` flag.

Thanks for fixing this! We may also need to update kubernetes_utils.clean_zombie_ssh_jump_pod() to delete a pending SSH jump pod if it's out of resources. Currently, if the jump pod's CPU resources are not satisfied, then it is not cleaned up and stays indefinitely in a pending state:

# Right after sky launch exits, the cluster pod is terminated (base) ➜ k get pods NAME READY STATUS RESTARTS AGE sky-ssh-jump-2ea485ef 0/1 Pending 0 13s test-2ea4-ray-head 1/1 Terminating 0 13s # But the ssh pod remains pending forever. We should clean this up in clean_zombie_ssh_jump_pod() (base) ➜ k get pods NAME READY STATUS RESTARTS AGE sky-ssh-jump-2ea485ef 0/1 Pending 0 15s

landscapepainter

@romilbhardwaj Resolved the issue. This is ready for another look :)

landscapepainter · 2023-10-30T06:11:13Z

sky/skylet/providers/kubernetes/node_provider.py

+        ssh_jump_pod_name = conf['metadata']['labels']['skypilot-ssh-jump']
+        jump_pod = kubernetes.core_api().read_namespaced_pod(
+            ssh_jump_pod_name, self.namespace)
+        new_nodes.append(jump_pod)


Had this error due to two reasons:

Was not updating the pod status by re-reading the passed new_nodes arguement from _raise_pod_scheduling_errors.

Was attempting to read an event of a pod that is already scheduled and is not in Pending status.

Fixed the two above and setting excessive resources for jump pod or the pod instance, both fails with correct error message. Following is displayed when excessive amount of CPUs are requested for the jump pod:

$ sky launch --cloud kubernetes -y I 10-30 05:57:34 optimizer.py:674] == Optimizer == I 10-30 05:57:34 optimizer.py:697] Estimated cost: $0.0 / hour I 10-30 05:57:34 optimizer.py:697] I 10-30 05:57:34 optimizer.py:769] Considered resources (1 node): I 10-30 05:57:34 optimizer.py:818] --------------------------------------------------------------------------------------------- I 10-30 05:57:34 optimizer.py:818] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN I 10-30 05:57:34 optimizer.py:818] --------------------------------------------------------------------------------------------- I 10-30 05:57:34 optimizer.py:818] Kubernetes 2CPU--2GB 2 2 - kubernetes 0.00 ✔ I 10-30 05:57:34 optimizer.py:818] --------------------------------------------------------------------------------------------- I 10-30 05:57:34 optimizer.py:818] Running task on cluster sky-xxxx-gcpuser... I 10-30 05:57:34 cloud_vm_ray_backend.py:4382] Creating a new cluster: 'sky-xxxx-gcpuser' [1x Kubernetes(2CPU--2GB)]. I 10-30 05:57:34 cloud_vm_ray_backend.py:4382] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters. I 10-30 05:57:35 cloud_vm_ray_backend.py:1449] To view detailed progress: tail -n100 -f /home/gcpuser/sky_logs/sky-2023-10-30-05-57-33-829528/provision.log I 10-30 05:57:36 cloud_vm_ray_backend.py:1884] Launching on Kubernetes W 10-30 05:57:48 cloud_vm_ray_backend.py:997] Got error(s) in kubernetes: W 10-30 05:57:48 cloud_vm_ray_backend.py:999] sky.skylet.providers.kubernetes.config.KubernetesError: Insufficient CPU capacity on the cluster. Other SkyPilot tasks or pods may be using resources. Check resource usage by running `kubectl describe nodes`. W 10-30 05:57:54 cloud_vm_ray_backend.py:2221] sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in all zones in kubernetes. Try changing resource requirements or use another region. W 10-30 05:57:54 cloud_vm_ray_backend.py:2230] W 10-30 05:57:54 cloud_vm_ray_backend.py:2230] Provision failed for 1x Kubernetes(2CPU--2GB) in kubernetes. Trying other locations (if any). Clusters No existing clusters. sky.exceptions.ResourcesUnavailableError: Failed to provision all possible launchable resources. Relax the task's resource requirements: 1x Kubernetes() To keep retrying until the cluster is up, use the `--retry-until-up` flag.

romilbhardwaj

Thanks @landscapepainter! Ran into an issue with the cleanup of ssh jump pod if provisioning fails, otherwise this PR is ready!

sky/skylet/providers/kubernetes/node_provider.py

romilbhardwaj · 2023-11-02T19:47:03Z

sky/skylet/providers/kubernetes/node_provider.py

+        ssh_jump_pod_name = conf['metadata']['labels']['skypilot-ssh-jump']
+        jump_pod = kubernetes.core_api().read_namespaced_pod(
+            ssh_jump_pod_name, self.namespace)
+        new_nodes.append(jump_pod)


Thanks for fixing this! We may also need to update kubernetes_utils.clean_zombie_ssh_jump_pod() to delete a pending SSH jump pod if it's out of resources. Currently, if the jump pod's CPU resources are not satisfied, then it is not cleaned up and stays indefinitely in a pending state:

# Right after sky launch exits, the cluster pod is terminated (base) ➜ k get pods NAME READY STATUS RESTARTS AGE sky-ssh-jump-2ea485ef 0/1 Pending 0 13s test-2ea4-ray-head 1/1 Terminating 0 13s # But the ssh pod remains pending forever. We should clean this up in clean_zombie_ssh_jump_pod() (base) ➜ k get pods NAME READY STATUS RESTARTS AGE sky-ssh-jump-2ea485ef 0/1 Pending 0 15s

landscapepainter

@romilbhardwaj Thanks for the catch. I updated kubernetes_utils.clean_zombie_ssh_jump_pod() so that it deletes the pod and the service when the jump pod is in dangling Pending status. The jump pod is now removed when provisioning fails due to some error raised while launching the jump pod.

romilbhardwaj

Thanks @landscapepainter, this is good to go!

wait for jump pod to be running

4cb3883

romilbhardwaj reviewed Oct 3, 2023

View reviewed changes

sky/skylet/providers/kubernetes/node_provider.py Show resolved Hide resolved

landscapepainter added 4 commits October 8, 2023 23:45

testing images

079caff

Merge branch 'master' into wait-for-jump-pod-completion

a24f1b6

update ssh-jump-pod-name

8278736

nit

9193cf7

romilbhardwaj reviewed Oct 26, 2023

View reviewed changes

landscapepainter added 3 commits October 30, 2023 05:41

re-read pod status from _rase_pod_scheduling_error

bd888a9

Update == 'Running' to != 'Pending'

01eaf12

comment

3d1428f

landscapepainter commented Oct 30, 2023

View reviewed changes

romilbhardwaj reviewed Nov 2, 2023

View reviewed changes

fix to remove pending jump pod when terminating

e31d364

landscapepainter commented Nov 3, 2023

View reviewed changes

landscapepainter requested a review from romilbhardwaj November 3, 2023 06:29

update comment

b40ce3a

romilbhardwaj approved these changes Nov 6, 2023

View reviewed changes

landscapepainter merged commit 3fbde39 into skypilot-org:master Nov 7, 2023
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[k8s] Ensure Jump Pod is in "Running" Status Before Proceeding #2589

[k8s] Ensure Jump Pod is in "Running" Status Before Proceeding #2589

landscapepainter commented Sep 21, 2023 •

edited

Loading

romilbhardwaj left a comment

landscapepainter commented Oct 9, 2023

romilbhardwaj left a comment

romilbhardwaj Oct 26, 2023

landscapepainter Oct 30, 2023 •

edited

Loading

romilbhardwaj Nov 2, 2023

landscapepainter left a comment

landscapepainter Oct 30, 2023 •

edited

Loading

romilbhardwaj left a comment

romilbhardwaj Nov 2, 2023

landscapepainter left a comment

romilbhardwaj left a comment

[k8s] Ensure Jump Pod is in "Running" Status Before Proceeding #2589

[k8s] Ensure Jump Pod is in "Running" Status Before Proceeding #2589

Conversation

landscapepainter commented Sep 21, 2023 • edited Loading

romilbhardwaj left a comment

Choose a reason for hiding this comment

landscapepainter commented Oct 9, 2023

romilbhardwaj left a comment

Choose a reason for hiding this comment

romilbhardwaj Oct 26, 2023

Choose a reason for hiding this comment

landscapepainter Oct 30, 2023 • edited Loading

Choose a reason for hiding this comment

romilbhardwaj Nov 2, 2023

Choose a reason for hiding this comment

landscapepainter left a comment

Choose a reason for hiding this comment

landscapepainter Oct 30, 2023 • edited Loading

Choose a reason for hiding this comment

romilbhardwaj left a comment

Choose a reason for hiding this comment

romilbhardwaj Nov 2, 2023

Choose a reason for hiding this comment

landscapepainter left a comment

Choose a reason for hiding this comment

romilbhardwaj left a comment

Choose a reason for hiding this comment

landscapepainter commented Sep 21, 2023 •

edited

Loading

landscapepainter Oct 30, 2023 •

edited

Loading

landscapepainter Oct 30, 2023 •

edited

Loading