Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Provisioner] Parallelize wait for SSH in provisioner #4156

Closed
Michaelvll opened this issue Oct 23, 2024 · 0 comments · Fixed by #4158
Closed

[Provisioner] Parallelize wait for SSH in provisioner #4156

Michaelvll opened this issue Oct 23, 2024 · 0 comments · Fixed by #4158
Labels

Comments

@Michaelvll
Copy link
Collaborator

Michaelvll commented Oct 23, 2024

A user reported that we wait SSH access in provisioner sequentially, which causes inefficiency when the number of nodes increase. We should parallelize it.

def wait_for_ssh(cluster_info: provision_common.ClusterInfo,
ssh_credentials: Dict[str, str]):
"""Wait until SSH is ready.
Raises:
RuntimeError: If the SSH connection is not ready after timeout.
"""
if (cluster_info.has_external_ips() and
ssh_credentials.get('ssh_proxy_command') is None):
# If we can access public IPs, then it is more efficient to test SSH
# connection with raw sockets.
waiter = _wait_ssh_connection_direct
else:
# See https://github.com/skypilot-org/skypilot/pull/1512
waiter = _wait_ssh_connection_indirect
ip_list = cluster_info.get_feasible_ips()
port_list = cluster_info.get_ssh_ports()
timeout = 60 * 10 # 10-min maximum timeout
start = time.time()
# use a queue for SSH querying
ips = collections.deque(ip_list)
ssh_ports = collections.deque(port_list)
while ips:
ip = ips.popleft()
ssh_port = ssh_ports.popleft()
success, stderr = waiter(ip, ssh_port, **ssh_credentials)
if not success:
ips.append(ip)
ssh_ports.append(ssh_port)
if time.time() - start > timeout:
with ux_utils.print_exception_no_traceback():
raise RuntimeError(
f'Failed to SSH to {ip} after timeout {timeout}s, with '
f'{stderr}')
logger.debug('Retrying in 1 second...')
time.sleep(1)

@Michaelvll Michaelvll added the P0 label Oct 23, 2024
@asaiacai asaiacai mentioned this issue Oct 23, 2024
3 tasks
@Michaelvll Michaelvll added the OSS label Dec 19, 2024 — with Linear
@Michaelvll Michaelvll removed the OSS label Dec 19, 2024
@Michaelvll Michaelvll added the OSS label Dec 19, 2024 — with Linear
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant