Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCP/provisioner: Handle the RESOURCE_NOT_FOUND error. #1842

Merged
merged 2 commits into from
Apr 8, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions sky/backends/cloud_vm_ray_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -657,6 +657,15 @@ def _update_blocklist_on_gcp_error(
# {'code': 8, 'message': 'There is no more capacity in the zone "europe-west4-a"; you can try in another zone where Cloud TPU Nodes are offered (see https://cloud.google.com/tpu/docs/regions) [EID: 0x1bc8f9d790be9142]'} # pylint: disable=line-too-long
self._blocked_resources.add(
launchable_resources.copy(zone=zone.name))
elif code == 'RESOURCE_NOT_FOUND':
# https://github.com/skypilot-org/skypilot/issues/1797
# In the inner provision loop we have used retries to
# recover but failed. This indicates this zone is most
# likely out of capacity. The provision loop will terminate
# any potentially live VMs before moving onto the next
# zone.
self._blocked_resources.add(
launchable_resources.copy(zone=zone.name))
else:
assert False, error
elif len(httperror_str) >= 1:
Expand Down Expand Up @@ -1524,6 +1533,18 @@ def need_ray_up(
'Retrying due to list request rate limit exceeded.')
return True

# https://github.com/skypilot-org/skypilot/issues/1797
# "The resource 'projects/xxx/zones/us-central1-b/instances/ray-yyy-head-<hash>-compute' was not found" # pylint: disable=line-too-long
pattern = (r'\'code\': \'RESOURCE_NOT_FOUND\'.*The resource'
r'.*instances\/.*-compute\' was not found')
result = re.search(pattern, stderr)
Michaelvll marked this conversation as resolved.
Show resolved Hide resolved
if result is not None:
# Retry. Unlikely will succeed if it's due to no capacity.
logger.info(
'Retrying due to the possibly flaky RESOURCE_NOT_FOUND '
'error.')
return True

if ('Processing file mounts' in stdout and
'Running setup commands' not in stdout and
'Failed to setup head node.' in stderr):
Expand Down