-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[k8s] Ensure Jump Pod is in "Running" Status Before Proceeding #2589
[k8s] Ensure Jump Pod is in "Running" Status Before Proceeding #2589
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @landscapepainter - left a comment.
@romilbhardwaj This is ready for another look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @landscapepainter! This works, but there seem to be some bugs in the error handling/surfacing. See comments.
ssh_jump_pod_name = conf['metadata']['labels']['skypilot-ssh-jump'] | ||
jump_pod = kubernetes.core_api().read_namespaced_pod( | ||
ssh_jump_pod_name, self.namespace) | ||
new_nodes.append(jump_pod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to test this by editing kubernetes-ssh-jump.yml.j2
to request a large amount of CPU for the SSH jump pod (so that it fails to get scheduled):
# containers:
...
resources:
requests:
cpu: 20
When I ran sky launch -c test -- echo hi
, I got this error (though failover continued as expected, but the error message was confusing):
D 10-26 11:32:53 cloud_vm_ray_backend.py:2006] `ray up` takes 11.3 seconds with 1 retries.
W 10-26 11:32:53 cloud_vm_ray_backend.py:997] Got error(s) in kubernetes:
W 10-26 11:32:53 cloud_vm_ray_backend.py:999] sky.skylet.providers.kubernetes.config.KubernetesError: An error occurred while trying to fetch the reason for pod scheduling failure. Error: UnboundLocalError: local variable 'event_message' referenced before assignment
...
On the second run, I got:
W 10-26 11:36:14 cloud_vm_ray_backend.py:999] sky.skylet.providers.kubernetes.config.KubernetesError: Timed out while waiting for nodes to start. Cluster may be out of resources or may be too slow to autoscale. Pod status: PendingDetails: 'skip schedule deleting pod: default/test-2ea4-ray-head'
Both of these messages were while kubectl describe pod
was giving a clear failed scheduling message:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 8s default-scheduler 0/1 nodes are available: 1 Insufficient cpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.
Compare this to our regular message when a task requests too many CPUs (e.g., sky launch --cpus 8), which clearly states that the cluster is out of CPU and suggests debugging steps. Can we have the same message for jump pod too?
W 10-26 11:34:57 cloud_vm_ray_backend.py:997] Got error(s) in kubernetes:
W 10-26 11:34:57 cloud_vm_ray_backend.py:999] sky.skylet.providers.kubernetes.config.KubernetesError: Insufficient CPU capacity on the cluster. Other SkyPilot tasks or pods may be using resources. Check resource usage by running `kubectl describe nodes`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had this error due to two reasons:
- Was not updating the pod status by re-reading the passed
new_nodes
arguement from_raise_pod_scheduling_errors
. - Was attempting to read an event of a pod that is already scheduled and is not in
Pending
status.
Fixed the two above and setting excessive resources for jump pod or the pod instance, both fails with correct error message. Following is displayed when excessive amount of CPUs are requested for the jump pod:
$ sky launch --cloud kubernetes -y
I 10-30 05:57:34 optimizer.py:674] == Optimizer ==
I 10-30 05:57:34 optimizer.py:697] Estimated cost: $0.0 / hour
I 10-30 05:57:34 optimizer.py:697]
I 10-30 05:57:34 optimizer.py:769] Considered resources (1 node):
I 10-30 05:57:34 optimizer.py:818] ---------------------------------------------------------------------------------------------
I 10-30 05:57:34 optimizer.py:818] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
I 10-30 05:57:34 optimizer.py:818] ---------------------------------------------------------------------------------------------
I 10-30 05:57:34 optimizer.py:818] Kubernetes 2CPU--2GB 2 2 - kubernetes 0.00 ✔
I 10-30 05:57:34 optimizer.py:818] ---------------------------------------------------------------------------------------------
I 10-30 05:57:34 optimizer.py:818]
Running task on cluster sky-xxxx-gcpuser...
I 10-30 05:57:34 cloud_vm_ray_backend.py:4382] Creating a new cluster: 'sky-xxxx-gcpuser' [1x Kubernetes(2CPU--2GB)].
I 10-30 05:57:34 cloud_vm_ray_backend.py:4382] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 10-30 05:57:35 cloud_vm_ray_backend.py:1449] To view detailed progress: tail -n100 -f /home/gcpuser/sky_logs/sky-2023-10-30-05-57-33-829528/provision.log
I 10-30 05:57:36 cloud_vm_ray_backend.py:1884] Launching on Kubernetes
W 10-30 05:57:48 cloud_vm_ray_backend.py:997] Got error(s) in kubernetes:
W 10-30 05:57:48 cloud_vm_ray_backend.py:999] sky.skylet.providers.kubernetes.config.KubernetesError: Insufficient CPU capacity on the cluster. Other SkyPilot tasks or pods may be using resources. Check resource usage by running `kubectl describe nodes`.
W 10-30 05:57:54 cloud_vm_ray_backend.py:2221] sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in all zones in kubernetes. Try changing resource requirements or use another region.
W 10-30 05:57:54 cloud_vm_ray_backend.py:2230]
W 10-30 05:57:54 cloud_vm_ray_backend.py:2230] Provision failed for 1x Kubernetes(2CPU--2GB) in kubernetes. Trying other locations (if any).
Clusters
No existing clusters.
sky.exceptions.ResourcesUnavailableError: Failed to provision all possible launchable resources. Relax the task's resource requirements: 1x Kubernetes()
To keep retrying until the cluster is up, use the `--retry-until-up` flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this! We may also need to update kubernetes_utils.clean_zombie_ssh_jump_pod()
to delete a pending SSH jump pod if it's out of resources. Currently, if the jump pod's CPU resources are not satisfied, then it is not cleaned up and stays indefinitely in a pending state:
# Right after sky launch exits, the cluster pod is terminated
(base) ➜ k get pods
NAME READY STATUS RESTARTS AGE
sky-ssh-jump-2ea485ef 0/1 Pending 0 13s
test-2ea4-ray-head 1/1 Terminating 0 13s
# But the ssh pod remains pending forever. We should clean this up in clean_zombie_ssh_jump_pod()
(base) ➜ k get pods
NAME READY STATUS RESTARTS AGE
sky-ssh-jump-2ea485ef 0/1 Pending 0 15s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@romilbhardwaj Resolved the issue. This is ready for another look :)
ssh_jump_pod_name = conf['metadata']['labels']['skypilot-ssh-jump'] | ||
jump_pod = kubernetes.core_api().read_namespaced_pod( | ||
ssh_jump_pod_name, self.namespace) | ||
new_nodes.append(jump_pod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had this error due to two reasons:
- Was not updating the pod status by re-reading the passed
new_nodes
arguement from_raise_pod_scheduling_errors
. - Was attempting to read an event of a pod that is already scheduled and is not in
Pending
status.
Fixed the two above and setting excessive resources for jump pod or the pod instance, both fails with correct error message. Following is displayed when excessive amount of CPUs are requested for the jump pod:
$ sky launch --cloud kubernetes -y
I 10-30 05:57:34 optimizer.py:674] == Optimizer ==
I 10-30 05:57:34 optimizer.py:697] Estimated cost: $0.0 / hour
I 10-30 05:57:34 optimizer.py:697]
I 10-30 05:57:34 optimizer.py:769] Considered resources (1 node):
I 10-30 05:57:34 optimizer.py:818] ---------------------------------------------------------------------------------------------
I 10-30 05:57:34 optimizer.py:818] CLOUD INSTANCE vCPUs Mem(GB) ACCELERATORS REGION/ZONE COST ($) CHOSEN
I 10-30 05:57:34 optimizer.py:818] ---------------------------------------------------------------------------------------------
I 10-30 05:57:34 optimizer.py:818] Kubernetes 2CPU--2GB 2 2 - kubernetes 0.00 ✔
I 10-30 05:57:34 optimizer.py:818] ---------------------------------------------------------------------------------------------
I 10-30 05:57:34 optimizer.py:818]
Running task on cluster sky-xxxx-gcpuser...
I 10-30 05:57:34 cloud_vm_ray_backend.py:4382] Creating a new cluster: 'sky-xxxx-gcpuser' [1x Kubernetes(2CPU--2GB)].
I 10-30 05:57:34 cloud_vm_ray_backend.py:4382] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 10-30 05:57:35 cloud_vm_ray_backend.py:1449] To view detailed progress: tail -n100 -f /home/gcpuser/sky_logs/sky-2023-10-30-05-57-33-829528/provision.log
I 10-30 05:57:36 cloud_vm_ray_backend.py:1884] Launching on Kubernetes
W 10-30 05:57:48 cloud_vm_ray_backend.py:997] Got error(s) in kubernetes:
W 10-30 05:57:48 cloud_vm_ray_backend.py:999] sky.skylet.providers.kubernetes.config.KubernetesError: Insufficient CPU capacity on the cluster. Other SkyPilot tasks or pods may be using resources. Check resource usage by running `kubectl describe nodes`.
W 10-30 05:57:54 cloud_vm_ray_backend.py:2221] sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in all zones in kubernetes. Try changing resource requirements or use another region.
W 10-30 05:57:54 cloud_vm_ray_backend.py:2230]
W 10-30 05:57:54 cloud_vm_ray_backend.py:2230] Provision failed for 1x Kubernetes(2CPU--2GB) in kubernetes. Trying other locations (if any).
Clusters
No existing clusters.
sky.exceptions.ResourcesUnavailableError: Failed to provision all possible launchable resources. Relax the task's resource requirements: 1x Kubernetes()
To keep retrying until the cluster is up, use the `--retry-until-up` flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @landscapepainter! Ran into an issue with the cleanup of ssh jump pod if provisioning fails, otherwise this PR is ready!
ssh_jump_pod_name = conf['metadata']['labels']['skypilot-ssh-jump'] | ||
jump_pod = kubernetes.core_api().read_namespaced_pod( | ||
ssh_jump_pod_name, self.namespace) | ||
new_nodes.append(jump_pod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this! We may also need to update kubernetes_utils.clean_zombie_ssh_jump_pod()
to delete a pending SSH jump pod if it's out of resources. Currently, if the jump pod's CPU resources are not satisfied, then it is not cleaned up and stays indefinitely in a pending state:
# Right after sky launch exits, the cluster pod is terminated
(base) ➜ k get pods
NAME READY STATUS RESTARTS AGE
sky-ssh-jump-2ea485ef 0/1 Pending 0 13s
test-2ea4-ray-head 1/1 Terminating 0 13s
# But the ssh pod remains pending forever. We should clean this up in clean_zombie_ssh_jump_pod()
(base) ➜ k get pods
NAME READY STATUS RESTARTS AGE
sky-ssh-jump-2ea485ef 0/1 Pending 0 15s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@romilbhardwaj Thanks for the catch. I updated kubernetes_utils.clean_zombie_ssh_jump_pod()
so that it deletes the pod and the service when the jump pod is in dangling Pending
status. The jump pod is now removed when provisioning fails due to some error raised while launching the jump pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @landscapepainter, this is good to go!
Within
node_provider.py/create_node()
, we previously waited for the node pod to reach the"Running"
status. This ensured the handling of potential errors, such as image pull failures, and allowed for the proper setting of environment variables. With the introduction of the jump pod creation process during provisioning, it's imperative to also ensure the jump pod is operational and in the"Running"
status. This step is crucial to prevent scenarios where the node pod becomes active while the jump pod encounters errors and fails to launch.Tested (run the relevant ones):
bash format.sh
sky launch
with kind/GKE