Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spot] Fix race condition for spot logs #1329

Merged
merged 5 commits into from
Oct 31, 2022
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 11 additions & 5 deletions sky/spot/spot_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -230,15 +230,21 @@ def stream_logs_by_id(job_id: int, follow: bool = True) -> str:
cluster_name = generate_spot_cluster_name(task_name, job_id)
backend = backends.CloudVmRayBackend()
spot_status = spot_state.get_status(job_id)
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
while not spot_status.is_terminal():
if spot_status != spot_state.SpotStatus.RUNNING:
logger.info(f'INFO: The log is not ready yet, as the spot job '
f'is {spot_status.value}. '
while spot_status is None or not spot_status.is_terminal():
handle = global_user_state.get_handle_from_cluster_name(cluster_name)
# Check the handle: The cluster can be removed from the table before the
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
# spot state is updated by the controller. In this case, we should skip
# the logging, and wait for the next round of status check.
if handle is None or spot_status != spot_state.SpotStatus.RUNNING:
concretevitamin marked this conversation as resolved.
Show resolved Hide resolved
status_help_str = ''
if (spot_status is not None and
spot_status != spot_state.SpotStatus.RUNNING):
status_help_str = f', as the spot job is {spot_status.value}'
logger.info(f'INFO: The log is not ready yet{status_help_str}. '
f'Waiting for {JOB_STATUS_CHECK_GAP_SECONDS} seconds.')
time.sleep(JOB_STATUS_CHECK_GAP_SECONDS)
spot_status = spot_state.get_status(job_id)
continue
handle = global_user_state.get_handle_from_cluster_name(cluster_name)
returncode = backend.tail_logs(handle,
job_id=None,
spot_job_id=job_id,
Expand Down