You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
User reported the following failure, causing a spot job to fail with FAILED_CONTROLLER (which is bad, since no further attempts to launch the job were attempted):
Process Process-1:
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.10/site-packages/sky/spot/controller.py", line 330, in _run_controller
spot_controller = SpotController(job_id, dag_yaml, retry_until_up)
File "/opt/conda/lib/python3.10/site-packages/sky/spot/controller.py", line 49, in __init__
self._dag, self._dag_name = _get_dag_and_name(dag_yaml)
File "/opt/conda/lib/python3.10/site-packages/sky/spot/controller.py", line 37, in _get_dag_and_name
dag = dag_utils.load_chain_dag_from_yaml(dag_yaml)
File "/opt/conda/lib/python3.10/site-packages/sky/utils/dag_utils.py", line 50, in load_chain_dag_from_yaml
task = task_lib.Task.from_yaml_config(task_config, env_overrides)
File "/opt/conda/lib/python3.10/site-packages/sky/task.py", line 336, in from_yaml_config
storage_obj = storage_lib.Storage.from_yaml_config(storage[1])
File "/opt/conda/lib/python3.10/site-packages/sky/data/storage.py", line 862, in from_yaml_config
storage_obj.add_store(StoreType(store.upper()))
File "/opt/conda/lib/python3.10/site-packages/sky/data/storage.py", line 717, in add_store
store = store_cls(
File "/opt/conda/lib/python3.10/site-packages/sky/data/storage.py", line 909, in __init__
super().__init__(name, source, region, is_sky_managed,
File "/opt/conda/lib/python3.10/site-packages/sky/data/storage.py", line 218, in __init__
self.initialize()
File "/opt/conda/lib/python3.10/site-packages/sky/data/storage.py", line 1016, in initialize
self.bucket, is_new_bucket = self._get_bucket()
File "/opt/conda/lib/python3.10/site-packages/sky/data/storage.py", line 1148, in _get_bucket
self.client.head_bucket(Bucket=self.name)
File "/opt/conda/lib/python3.10/site-packages/botocore/client.py", line 534, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/opt/conda/lib/python3.10/site-packages/botocore/client.py", line 959, in _make_api_call
http, parsed_response = self._make_request(
File "/opt/conda/lib/python3.10/site-packages/botocore/client.py", line 982, in _make_request
return self._endpoint.make_request(operation_model, request_dict)
File "/opt/conda/lib/python3.10/site-packages/botocore/endpoint.py", line 119, in make_request
return self._send_request(request_dict, operation_model)
File "/opt/conda/lib/python3.10/site-packages/botocore/endpoint.py", line 198, in _send_request
request = self.create_request(request_dict, operation_model)
File "/opt/conda/lib/python3.10/site-packages/botocore/endpoint.py", line 134, in create_request
self._event_emitter.emit(
File "/opt/conda/lib/python3.10/site-packages/botocore/hooks.py", line 412, in emit
return self._emitter.emit(aliased_event_name, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/botocore/hooks.py", line 256, in emit
return self._emit(event_name, kwargs)
File "/opt/conda/lib/python3.10/site-packages/botocore/hooks.py", line 239, in _emit
response = handler(**kwargs)
File "/opt/conda/lib/python3.10/site-packages/botocore/signers.py", line 105, in handler
return self.sign(operation_name, request)
File "/opt/conda/lib/python3.10/site-packages/botocore/signers.py", line 189, in sign
auth.add_auth(request)
File "/opt/conda/lib/python3.10/site-packages/botocore/auth.py", line 418, in add_auth
raise NoCredentialsError()
botocore.exceptions.NoCredentialsError: Unable to locate credentials
I 07-22 02:01:28 controller.py:412] Killing controller process 869158.
I 07-22 02:01:28 controller.py:420] Controller process 869158 killed.
I 07-22 02:01:28 controller.py:422] Cleaning up any spot cluster for job 7.
I 07-22 02:01:28 storage.py:700] Storage type StoreType.S3 already exists.
I 07-22 02:01:30 storage.py:1058] Deleted S3 bucket skypilot-filemounts-files-xxxxxxxxxxxxxxx
I 07-22 02:01:30 controller.py:431] Spot cluster of job 7 has been cleaned up.
I 07-22 02:01:30 controller.py:443] Previous spot job status: PENDING
I 07-22 02:01:30 spot_state.py:404] Unexpected error occurred. For details, run: sky spot logs --controller 7
Note the lines
File "/opt/conda/lib/python3.10/site-packages/sky/data/storage.py", line 1148, in _get_bucket
self.client.head_bucket(Bucket=self.name)
...
File "/opt/conda/lib/python3.10/site-packages/botocore/auth.py", line 418, in add_auth
raise NoCredentialsError()
botocore.exceptions.NoCredentialsError: Unable to locate credentials
The current hypothesis is the user is using SSO, and AWS's metadata service for providing the SSO credentials sometimes can be temporarily flaky. See that ticket for more details.
The text was updated successfully, but these errors were encountered:
User reported the following failure, causing a spot job to fail with FAILED_CONTROLLER (which is bad, since no further attempts to launch the job were attempted):
Note the lines
@landscapepainter Can we adopt the solution in https://github.com/skypilot-org/skypilot/pull/1988/files to storage.py's S3 client? Retry a few times with randomized backoff.
The current hypothesis is the user is using SSO, and AWS's metadata service for providing the SSO credentials sometimes can be temporarily flaky. See that ticket for more details.
The text was updated successfully, but these errors were encountered: