Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[storage] S3 client should auto-retry NoCredentialsError #2301

Closed
concretevitamin opened this issue Jul 25, 2023 · 3 comments
Closed

[storage] S3 client should auto-retry NoCredentialsError #2301

concretevitamin opened this issue Jul 25, 2023 · 3 comments
Labels
bug Something isn't working Stale
Milestone

Comments

@concretevitamin
Copy link
Member

User reported the following failure, causing a spot job to fail with FAILED_CONTROLLER (which is bad, since no further attempts to launch the job were attempted):

Process Process-1:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/sky/spot/controller.py", line 330, in _run_controller
    spot_controller = SpotController(job_id, dag_yaml, retry_until_up)
  File "/opt/conda/lib/python3.10/site-packages/sky/spot/controller.py", line 49, in __init__
    self._dag, self._dag_name = _get_dag_and_name(dag_yaml)
  File "/opt/conda/lib/python3.10/site-packages/sky/spot/controller.py", line 37, in _get_dag_and_name
    dag = dag_utils.load_chain_dag_from_yaml(dag_yaml)
  File "/opt/conda/lib/python3.10/site-packages/sky/utils/dag_utils.py", line 50, in load_chain_dag_from_yaml
    task = task_lib.Task.from_yaml_config(task_config, env_overrides)
  File "/opt/conda/lib/python3.10/site-packages/sky/task.py", line 336, in from_yaml_config
    storage_obj = storage_lib.Storage.from_yaml_config(storage[1])
  File "/opt/conda/lib/python3.10/site-packages/sky/data/storage.py", line 862, in from_yaml_config
    storage_obj.add_store(StoreType(store.upper()))
  File "/opt/conda/lib/python3.10/site-packages/sky/data/storage.py", line 717, in add_store
    store = store_cls(
  File "/opt/conda/lib/python3.10/site-packages/sky/data/storage.py", line 909, in __init__
    super().__init__(name, source, region, is_sky_managed,
  File "/opt/conda/lib/python3.10/site-packages/sky/data/storage.py", line 218, in __init__
    self.initialize()
  File "/opt/conda/lib/python3.10/site-packages/sky/data/storage.py", line 1016, in initialize
    self.bucket, is_new_bucket = self._get_bucket()
  File "/opt/conda/lib/python3.10/site-packages/sky/data/storage.py", line 1148, in _get_bucket
    self.client.head_bucket(Bucket=self.name)
  File "/opt/conda/lib/python3.10/site-packages/botocore/client.py", line 534, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/opt/conda/lib/python3.10/site-packages/botocore/client.py", line 959, in _make_api_call
    http, parsed_response = self._make_request(
  File "/opt/conda/lib/python3.10/site-packages/botocore/client.py", line 982, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "/opt/conda/lib/python3.10/site-packages/botocore/endpoint.py", line 119, in make_request
    return self._send_request(request_dict, operation_model)
  File "/opt/conda/lib/python3.10/site-packages/botocore/endpoint.py", line 198, in _send_request
    request = self.create_request(request_dict, operation_model)
  File "/opt/conda/lib/python3.10/site-packages/botocore/endpoint.py", line 134, in create_request
    self._event_emitter.emit(
  File "/opt/conda/lib/python3.10/site-packages/botocore/hooks.py", line 412, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/botocore/hooks.py", line 256, in emit
    return self._emit(event_name, kwargs)
  File "/opt/conda/lib/python3.10/site-packages/botocore/hooks.py", line 239, in _emit
    response = handler(**kwargs)
  File "/opt/conda/lib/python3.10/site-packages/botocore/signers.py", line 105, in handler
    return self.sign(operation_name, request)
  File "/opt/conda/lib/python3.10/site-packages/botocore/signers.py", line 189, in sign
    auth.add_auth(request)
  File "/opt/conda/lib/python3.10/site-packages/botocore/auth.py", line 418, in add_auth
    raise NoCredentialsError()
botocore.exceptions.NoCredentialsError: Unable to locate credentials
I 07-22 02:01:28 controller.py:412] Killing controller process 869158.
I 07-22 02:01:28 controller.py:420] Controller process 869158 killed.
I 07-22 02:01:28 controller.py:422] Cleaning up any spot cluster for job 7.
I 07-22 02:01:28 storage.py:700] Storage type StoreType.S3 already exists.
I 07-22 02:01:30 storage.py:1058] Deleted S3 bucket skypilot-filemounts-files-xxxxxxxxxxxxxxx
I 07-22 02:01:30 controller.py:431] Spot cluster of job 7 has been cleaned up.
I 07-22 02:01:30 controller.py:443] Previous spot job status: PENDING
I 07-22 02:01:30 spot_state.py:404] Unexpected error occurred. For details, run: sky spot logs --controller 7

Note the lines

  File "/opt/conda/lib/python3.10/site-packages/sky/data/storage.py", line 1148, in _get_bucket
    self.client.head_bucket(Bucket=self.name)
...

  File "/opt/conda/lib/python3.10/site-packages/botocore/auth.py", line 418, in add_auth
    raise NoCredentialsError()
botocore.exceptions.NoCredentialsError: Unable to locate credentials

@landscapepainter Can we adopt the solution in https://github.com/skypilot-org/skypilot/pull/1988/files to storage.py's S3 client? Retry a few times with randomized backoff.

The current hypothesis is the user is using SSO, and AWS's metadata service for providing the SSO credentials sometimes can be temporarily flaky. See that ticket for more details.

@concretevitamin concretevitamin added the bug Something isn't working label Jul 25, 2023
@concretevitamin concretevitamin added this to the Storage milestone Jul 25, 2023
@landscapepainter
Copy link
Collaborator

@concretevitamin Thanks for sharing the issue. Will have it resolved shortly with retries.

Copy link
Contributor

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label Nov 23, 2023
@landscapepainter
Copy link
Collaborator

landscapepainter commented Nov 27, 2023

Resolved by #2321

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale
Projects
None yet
2 participants