Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s][ux] Auto-exclude stale Kubernetes cloud #2807

Open
romilbhardwaj opened this issue Nov 21, 2023 · 8 comments
Open

[k8s][ux] Auto-exclude stale Kubernetes cloud #2807

romilbhardwaj opened this issue Nov 21, 2023 · 8 comments
Labels
k8s Kubernetes related items

Comments

@romilbhardwaj
Copy link
Collaborator

I often terminate a Kubernetes cluster externally using the cloud console/cli (e.g., gcloud container clusters delete <cluster-name> --region us-central1-c), but I forget to run sky check to update the list of enabled clouds.

As a result, the next sky launch fails:

sky.exceptions.ResourcesUnavailableError: Timed out when trying to get node info from Kubernetes cluster. Please check if the cluster is healthy and retry.

We should consider printing a warning and continuing by either:

  1. Excluding Kubernetes from the list of clouds considered by the optimizer
  2. Removing Kubernetes from the list of enabled clouds stored in global user state.

1 is less aggressive and doesn't require user to re-run sky check in case it is a transient failure.

@Michaelvll
Copy link
Collaborator

Michaelvll commented Feb 5, 2024

This is also related to #3013

@kbrgl
Copy link
Contributor

kbrgl commented Feb 24, 2024

Going to self-assign and work on this!

Copy link
Contributor

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

Copy link
Contributor

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label Oct 23, 2024
Copy link
Contributor

github-actions bot commented Nov 3, 2024

This issue was closed because it has been stalled for 10 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 3, 2024
@chris-aeviator
Copy link

chris-aeviator commented Dec 29, 2024

I'm running into this after having renewed my k8s cert in my kube config. I can see the pods as unhealthy, there might have been another isuse.

However, I'm unable to start new clusters on said k8s due this error.

Update

It seems like the error is actually swallowing a real error - in my case BAD_BASE64_DECODE - which in my case I can only see when executing the purge command

@romilbhardwaj
Copy link
Collaborator Author

Thanks for the report @chris-aeviator - that sounds bad. Can you share the full output log and the commands you ran so I can reproduce it?

@romilbhardwaj
Copy link
Collaborator Author

Nvm @chris-aeviator, I can reproduce this. Looks like a recent regression from #4443. Being fixed in #4514 - can you give that branch a try and see if it fixes your issue too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
k8s Kubernetes related items
Projects
None yet
Development

No branches or pull requests

4 participants