Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Ensure Jump Pod is in "Running" Status Before Proceeding #2589

33 changes: 23 additions & 10 deletions sky/skylet/providers/kubernetes/node_provider.py
Original file line number Diff line number Diff line change
Expand Up @@ -164,17 +164,25 @@ def _set_node_tags(self, node_id, tags):

def _raise_pod_scheduling_errors(self, new_nodes):
for new_node in new_nodes:
pod_status = new_node.status.phase
pod_name = new_node._metadata._name
pod = kubernetes.core_api().read_namespaced_pod(
new_node.metadata.name, self.namespace)
pod_status = pod.status.phase
# When there are multiple pods involved while launching instance,
# there may be a single pod causing issue while others are
# scheduled. In this case, we make sure to not surface the error
# message from the pod that is already scheduled.
if pod_status != 'Pending':
continue
pod_name = pod._metadata._name
events = kubernetes.core_api().list_namespaced_event(
self.namespace,
field_selector=(f'involvedObject.name={pod_name},'
'involvedObject.kind=Pod'))
# Events created in the past hours are kept by
# Kubernetes python client and we want to surface
# the latest event message
events_desc_by_time = \
sorted(events.items,
events_desc_by_time = sorted(
events.items,
key=lambda e: e.metadata.creation_timestamp,
reverse=True)
for event in events_desc_by_time:
Expand All @@ -200,8 +208,8 @@ def _raise_pod_scheduling_errors(self, new_nodes):
lf.get_label_key()
for lf in kubernetes_utils.LABEL_FORMATTER_REGISTRY
]
if new_node.spec.node_selector:
for label_key in new_node.spec.node_selector.keys():
if pod.spec.node_selector:
for label_key in pod.spec.node_selector.keys():
if label_key in gpu_lf_keys:
# TODO(romilb): We may have additional node
# affinity selectors in the future - in that
Expand All @@ -210,7 +218,7 @@ def _raise_pod_scheduling_errors(self, new_nodes):
'didn\'t match Pod\'s node affinity/selector' in event_message:
raise config.KubernetesError(
f'{lack_resource_msg.format(resource="GPU")} '
f'Verify if {new_node.spec.node_selector[label_key]}'
f'Verify if {pod.spec.node_selector[label_key]}'
' is available in the cluster.')
raise config.KubernetesError(f'{timeout_err_msg} '
f'Pod status: {pod_status}'
Expand Down Expand Up @@ -257,9 +265,14 @@ def create_node(self, node_config, tags, count):
self.namespace, service_spec)
new_svcs.append(svc)

# Wait for all pods to be ready, and if it exceeds the timeout, raise an
# exception. If pod's container is ContainerCreating, then we can assume
# that resources have been allocated and we can exit.
# Wait for all pods including jump pod to be ready, and if it
# exceeds the timeout, raise an exception. If pod's container
# is ContainerCreating, then we can assume that resources have been
# allocated and we can exit.
ssh_jump_pod_name = conf['metadata']['labels']['skypilot-ssh-jump']
jump_pod = kubernetes.core_api().read_namespaced_pod(
landscapepainter marked this conversation as resolved.
Show resolved Hide resolved
ssh_jump_pod_name, self.namespace)
new_nodes.append(jump_pod)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to test this by editing kubernetes-ssh-jump.yml.j2 to request a large amount of CPU for the SSH jump pod (so that it fails to get scheduled):

# containers:
  ...
     resources:
       requests:
         cpu: 20

When I ran sky launch -c test -- echo hi, I got this error (though failover continued as expected, but the error message was confusing):

D 10-26 11:32:53 cloud_vm_ray_backend.py:2006] `ray up` takes 11.3 seconds with 1 retries.
W 10-26 11:32:53 cloud_vm_ray_backend.py:997] Got error(s) in kubernetes:
W 10-26 11:32:53 cloud_vm_ray_backend.py:999] 	sky.skylet.providers.kubernetes.config.KubernetesError: An error occurred while trying to fetch the reason for pod scheduling failure. Error: UnboundLocalError: local variable 'event_message' referenced before assignment
...

On the second run, I got:

W 10-26 11:36:14 cloud_vm_ray_backend.py:999] 	sky.skylet.providers.kubernetes.config.KubernetesError: Timed out while waiting for nodes to start. Cluster may be out of resources or may be too slow to autoscale. Pod status: PendingDetails: 'skip schedule deleting pod: default/test-2ea4-ray-head'

Both of these messages were while kubectl describe pod was giving a clear failed scheduling message:

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  8s    default-scheduler  0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

Compare this to our regular message when a task requests too many CPUs (e.g., sky launch --cpus 8), which clearly states that the cluster is out of CPU and suggests debugging steps. Can we have the same message for jump pod too?

W 10-26 11:34:57 cloud_vm_ray_backend.py:997] Got error(s) in kubernetes:
W 10-26 11:34:57 cloud_vm_ray_backend.py:999] 	sky.skylet.providers.kubernetes.config.KubernetesError: Insufficient CPU capacity on the cluster. Other SkyPilot tasks or pods may be using resources. Check resource usage by running `kubectl describe nodes`

Copy link
Collaborator Author

@landscapepainter landscapepainter Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had this error due to two reasons:

  1. Was not updating the pod status by re-reading the passed new_nodes arguement from _raise_pod_scheduling_errors.
  2. Was attempting to read an event of a pod that is already scheduled and is not in Pending status.

Fixed the two above and setting excessive resources for jump pod or the pod instance, both fails with correct error message. Following is displayed when excessive amount of CPUs are requested for the jump pod:

$ sky launch --cloud kubernetes -y
I 10-30 05:57:34 optimizer.py:674] == Optimizer ==
I 10-30 05:57:34 optimizer.py:697] Estimated cost: $0.0 / hour
I 10-30 05:57:34 optimizer.py:697] 
I 10-30 05:57:34 optimizer.py:769] Considered resources (1 node):
I 10-30 05:57:34 optimizer.py:818] ---------------------------------------------------------------------------------------------
I 10-30 05:57:34 optimizer.py:818]  CLOUD        INSTANCE    vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
I 10-30 05:57:34 optimizer.py:818] ---------------------------------------------------------------------------------------------
I 10-30 05:57:34 optimizer.py:818]  Kubernetes   2CPU--2GB   2       2         -              kubernetes    0.00          ✔     
I 10-30 05:57:34 optimizer.py:818] ---------------------------------------------------------------------------------------------
I 10-30 05:57:34 optimizer.py:818] 
Running task on cluster sky-xxxx-gcpuser...
I 10-30 05:57:34 cloud_vm_ray_backend.py:4382] Creating a new cluster: 'sky-xxxx-gcpuser' [1x Kubernetes(2CPU--2GB)].
I 10-30 05:57:34 cloud_vm_ray_backend.py:4382] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 10-30 05:57:35 cloud_vm_ray_backend.py:1449] To view detailed progress: tail -n100 -f /home/gcpuser/sky_logs/sky-2023-10-30-05-57-33-829528/provision.log
I 10-30 05:57:36 cloud_vm_ray_backend.py:1884] Launching on Kubernetes 
W 10-30 05:57:48 cloud_vm_ray_backend.py:997] Got error(s) in kubernetes:
W 10-30 05:57:48 cloud_vm_ray_backend.py:999] 	sky.skylet.providers.kubernetes.config.KubernetesError: Insufficient CPU capacity on the cluster. Other SkyPilot tasks or pods may be using resources. Check resource usage by running `kubectl describe nodes`.
W 10-30 05:57:54 cloud_vm_ray_backend.py:2221] sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in all zones in kubernetes. Try changing resource requirements or use another region.
W 10-30 05:57:54 cloud_vm_ray_backend.py:2230] 
W 10-30 05:57:54 cloud_vm_ray_backend.py:2230] Provision failed for 1x Kubernetes(2CPU--2GB) in kubernetes. Trying other locations (if any).
Clusters
No existing clusters.

sky.exceptions.ResourcesUnavailableError: Failed to provision all possible launchable resources. Relax the task's resource requirements: 1x Kubernetes()
To keep retrying until the cluster is up, use the `--retry-until-up` flag.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this! We may also need to update kubernetes_utils.clean_zombie_ssh_jump_pod() to delete a pending SSH jump pod if it's out of resources. Currently, if the jump pod's CPU resources are not satisfied, then it is not cleaned up and stays indefinitely in a pending state:

# Right after sky launch exits, the cluster pod is terminated
(base) ➜ k get pods
NAME                    READY   STATUS        RESTARTS   AGE
sky-ssh-jump-2ea485ef   0/1     Pending       0          13s
test-2ea4-ray-head      1/1     Terminating   0          13s


# But the ssh pod remains pending forever. We should clean this up in clean_zombie_ssh_jump_pod()
(base) ➜  k get pods
NAME                    READY   STATUS    RESTARTS   AGE
sky-ssh-jump-2ea485ef   0/1     Pending   0          15s

start = time.time()
while True:
if time.time() - start > self.timeout:
Expand Down
13 changes: 7 additions & 6 deletions sky/utils/kubernetes_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -945,12 +945,13 @@ def find(l, predicate):
ssh_jump_name, namespace)
cont_ready_cond = find(ssh_jump_pod.status.conditions,
lambda c: c.type == 'ContainersReady')
if cont_ready_cond and \
cont_ready_cond.status == 'False':
# The main container is not ready. To be on the safe side
# and prevent a dangling ssh jump pod, lets remove it and
# the service. Otherwise main container is ready and its lifecycle
# management script takes care of the cleaning.
if (cont_ready_cond and cont_ready_cond.status
== 'False') or ssh_jump_pod.status.phase == 'Pending':
# Either the main container is not ready or the pod failed
# to schedule. To be on the safe side and prevent a dangling
# ssh jump pod, lets remove it and the service. Otherwise, main
# container is ready and its lifecycle management script takes
# care of the cleaning.
kubernetes.core_api().delete_namespaced_pod(ssh_jump_name,
namespace)
kubernetes.core_api().delete_namespaced_service(
Expand Down