-
Notifications
You must be signed in to change notification settings - Fork 541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix AWS credential creation error handling #2321
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick look. Thanks for the in-depth investigation @suquark. Some questions inlined.
Just to make sure my understanding of our plan is right:
- We need to make session()/resource() thread-local since they are not thread-safe. We also want to cache it once for each thread's lifetime.
- We can in principle reuse client() across threads. However, if it's created by calling session() first, that func is already made thread-local, so client() is going to be made thread-local too?
try: | ||
with _session_creation_lock: | ||
return boto3.session.Session() | ||
except (botocore_exceptions().CredentialRetrievalError, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I'm not too sure whether the error reported by the user in #1988 was thrown at session create time. There was no stacktrace. Spot controller log:
I 05-29 16:23:00 spot_utils.py:69] === Checking the job status... ===
I 05-29 16:23:01 spot_utils.py:75] Job status: JobStatus.RUNNING
I 05-29 16:23:01 spot_utils.py:78] ==================================
sky.exceptions.ClusterStatusFetchingError: Failed to query AWS cluster 'train-xxxxxxx-16' status: Unable to locate credentials. You can configure credentials by running "aws configure".
Unexpected error occurred: [sky.exceptions.ClusterStatusFetchingError] Failed to query AWS cluster 'train-xxxxxxx-16' status: Unable to locate credentials. You can configure credentials by running "aws configure".
I 05-29 16:23:06 spot_state.py:292] Unexpected error occurred: [sky.exceptions.ClusterStatusFetchingError] Failed to query AWS cluster 'train-xxxxxxx-16' status: Unable to locate credentials. You can configure credentials by running "aws configure".
Guessing from the newer stacktrace in #2301, is it possible that NoCredentialsError is thrown not at creation time by at request time?
Another possibility: #2301 happened because the current session/resource handling are not thread-safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, if it happens at request time, then I do not have good approaches to handle this (seems impossible to distinguish between a real error or flaky env issue). But this may not be an issue, since we add default retries here:
skypilot/sky/provision/aws/instance.py
Lines 16 to 21 in ca2a092
def _default_ec2_resource(region: str) -> Any: | |
return aws.resource( | |
'ec2', | |
region_name=region, | |
config=config.Config(retries={'max_attempts': BOTO_MAX_RETRIES})) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks a bunch @suquark! Should I run smoke tests?
_local.resource = {} | ||
|
||
# Using service name and kwargs as key | ||
sorted_kwargs = tuple(sorted(kwargs.items(), key=lambda x: x[0])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
key=lambda x: x[0] seems omittable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, kwargs.items()
returns a list of pair of (key: str, value: Any)
, where we only want to compare against the key, since values may not be comparable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_local.client = {} | ||
|
||
# Using service name and kwargs as key | ||
sorted_kwargs = tuple(sorted(kwargs.items(), key=lambda x: x[0])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(same above)
I have run the smoke tests. Only these tests fail due to lack of quotas with my account:
|
Thanks for the suggestions! I updated the code and comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM @suquark. All these tests passed for me:
FAILED tests/test_smoke.py::test_aws_image_id_dict - Exception: test failed: less /var/folders/4m/37zr5qys6yxbhs49_gwr1vb80000gn/T/aws_image_id_dict-s5sbybpz.log
FAILED tests/test_smoke.py::test_image_no_conda - Exception: test failed: less /var/folders/4m/37zr5qys6yxbhs49_gwr1vb80000gn/T/image_no_conda-1kklvlpb.log
FAILED tests/test_smoke.py::test_aws_image_id_dict_zone - Exception: test failed: less /var/folders/4m/37zr5qys6yxbhs49_gwr1vb80000gn/T/aws_image_id_dict_zone-8qfi2j7n.log
FAILED tests/test_smoke.py::test_aws_image_id_dict_region - Exception: test failed: less /var/folders/4m/37zr5qys6yxbhs49_gwr1vb80000gn/T/aws_image_id_dict_region-px_098q0.log
FAILED tests/test_smoke.py::test_spot_recovery_aws - Exception: test failed: less /var/folders/4m/37zr5qys6yxbhs49_gwr1vb80000gn/T/spot_recovery_aws-en7297qe.log
FAILED tests/test_smoke.py::test_custom_image - Exception: test failed: less /var/folders/4m/37zr5qys6yxbhs49_gwr1vb80000gn/T/test-custom-image-_8fus_yg.log
Address the credential fetching issue mentioned in #2288
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py --aws
bash tests/backward_comaptibility_tests.sh