You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Even though #4158 significantly improves multi-node provisioning time on k8s by parallelizing SSH setup, large jobs (50 nodes+) can still take a long time (~10 min, depending on degree of parallelism/number of CPU cores) to get SSH up and running on all pods.
From user:
is it possible to make it (SSH setup) "on demand"? for example, sky ssh host_name that sets up ssh connection and then connects to it? (edited)
my 2 cents is that ssh connection is not usually necessary for these long running training jobs or at least is not necessary when we launch the job if it's mostly for user convenience. additionally, we could also ssh using tools like k9s. so it's desirable to cut off the set up time as much as possible by making this optional. this also reduces the chance for timeouts, etc.
The text was updated successfully, but these errors were encountered:
Even though #4158 significantly improves multi-node provisioning time on k8s by parallelizing SSH setup, large jobs (50 nodes+) can still take a long time (~10 min, depending on degree of parallelism/number of CPU cores) to get SSH up and running on all pods.
From user:
The text was updated successfully, but these errors were encountered: