Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix AWS credential creation error handling #2321

Merged
merged 4 commits into from
Aug 6, 2023
Merged

Conversation

suquark
Copy link
Collaborator

@suquark suquark commented Jul 28, 2023

Address the credential fetching issue mentioned in #2288

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py --aws
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick look. Thanks for the in-depth investigation @suquark. Some questions inlined.

Just to make sure my understanding of our plan is right:

  • We need to make session()/resource() thread-local since they are not thread-safe. We also want to cache it once for each thread's lifetime.
  • We can in principle reuse client() across threads. However, if it's created by calling session() first, that func is already made thread-local, so client() is going to be made thread-local too?

sky/adaptors/aws.py Outdated Show resolved Hide resolved
sky/adaptors/aws.py Outdated Show resolved Hide resolved
try:
with _session_creation_lock:
return boto3.session.Session()
except (botocore_exceptions().CredentialRetrievalError,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm not too sure whether the error reported by the user in #1988 was thrown at session create time. There was no stacktrace. Spot controller log:

I 05-29 16:23:00 spot_utils.py:69] === Checking the job status... ===
I 05-29 16:23:01 spot_utils.py:75] Job status: JobStatus.RUNNING
I 05-29 16:23:01 spot_utils.py:78] ==================================
sky.exceptions.ClusterStatusFetchingError: Failed to query AWS cluster 'train-xxxxxxx-16' status: Unable to locate credentials. You can configure credentials by running "aws configure".


Unexpected error occurred: [sky.exceptions.ClusterStatusFetchingError] Failed to query AWS cluster 'train-xxxxxxx-16' status: Unable to locate credentials. You can configure credentials by running "aws configure".

I 05-29 16:23:06 spot_state.py:292] Unexpected error occurred: [sky.exceptions.ClusterStatusFetchingError] Failed to query AWS cluster 'train-xxxxxxx-16' status: Unable to locate credentials. You can configure credentials by running "aws configure".

Guessing from the newer stacktrace in #2301, is it possible that NoCredentialsError is thrown not at creation time by at request time?

Another possibility: #2301 happened because the current session/resource handling are not thread-safe.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, if it happens at request time, then I do not have good approaches to handle this (seems impossible to distinguish between a real error or flaky env issue). But this may not be an issue, since we add default retries here:

def _default_ec2_resource(region: str) -> Any:
return aws.resource(
'ec2',
region_name=region,
config=config.Config(retries={'max_attempts': BOTO_MAX_RETRIES}))

@suquark suquark requested a review from concretevitamin August 4, 2023 22:37
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks a bunch @suquark! Should I run smoke tests?

sky/adaptors/aws.py Show resolved Hide resolved
sky/adaptors/aws.py Show resolved Hide resolved
sky/adaptors/aws.py Outdated Show resolved Hide resolved
_local.resource = {}

# Using service name and kwargs as key
sorted_kwargs = tuple(sorted(kwargs.items(), key=lambda x: x[0]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

key=lambda x: x[0] seems omittable?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, kwargs.items() returns a list of pair of (key: str, value: Any), where we only want to compare against the key, since values may not be comparable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_local.client = {}

# Using service name and kwargs as key
sorted_kwargs = tuple(sorted(kwargs.items(), key=lambda x: x[0]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(same above)

@suquark
Copy link
Collaborator Author

suquark commented Aug 5, 2023

I have run the smoke tests. Only these tests fail due to lack of quotas with my account:

FAILED tests/test_smoke.py::test_aws_image_id_dict - Exception: test failed: less /var/folders/4m/37zr5qys6yxbhs49_gwr1vb80000gn/T/aws_image_id_dict-s5sbybpz.log
FAILED tests/test_smoke.py::test_image_no_conda - Exception: test failed: less /var/folders/4m/37zr5qys6yxbhs49_gwr1vb80000gn/T/image_no_conda-1kklvlpb.log
FAILED tests/test_smoke.py::test_aws_image_id_dict_zone - Exception: test failed: less /var/folders/4m/37zr5qys6yxbhs49_gwr1vb80000gn/T/aws_image_id_dict_zone-8qfi2j7n.log
FAILED tests/test_smoke.py::test_aws_image_id_dict_region - Exception: test failed: less /var/folders/4m/37zr5qys6yxbhs49_gwr1vb80000gn/T/aws_image_id_dict_region-px_098q0.log
FAILED tests/test_smoke.py::test_spot_recovery_aws - Exception: test failed: less /var/folders/4m/37zr5qys6yxbhs49_gwr1vb80000gn/T/spot_recovery_aws-en7297qe.log
FAILED tests/test_smoke.py::test_custom_image - Exception: test failed: less /var/folders/4m/37zr5qys6yxbhs49_gwr1vb80000gn/T/test-custom-image-_8fus_yg.log

@suquark
Copy link
Collaborator Author

suquark commented Aug 5, 2023

Thanks for the suggestions! I updated the code and comments.

@suquark suquark requested a review from concretevitamin August 5, 2023 07:32
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @suquark. All these tests passed for me:

FAILED tests/test_smoke.py::test_aws_image_id_dict - Exception: test failed: less /var/folders/4m/37zr5qys6yxbhs49_gwr1vb80000gn/T/aws_image_id_dict-s5sbybpz.log
FAILED tests/test_smoke.py::test_image_no_conda - Exception: test failed: less /var/folders/4m/37zr5qys6yxbhs49_gwr1vb80000gn/T/image_no_conda-1kklvlpb.log
FAILED tests/test_smoke.py::test_aws_image_id_dict_zone - Exception: test failed: less /var/folders/4m/37zr5qys6yxbhs49_gwr1vb80000gn/T/aws_image_id_dict_zone-8qfi2j7n.log
FAILED tests/test_smoke.py::test_aws_image_id_dict_region - Exception: test failed: less /var/folders/4m/37zr5qys6yxbhs49_gwr1vb80000gn/T/aws_image_id_dict_region-px_098q0.log
FAILED tests/test_smoke.py::test_spot_recovery_aws - Exception: test failed: less /var/folders/4m/37zr5qys6yxbhs49_gwr1vb80000gn/T/spot_recovery_aws-en7297qe.log
FAILED tests/test_smoke.py::test_custom_image - Exception: test failed: less /var/folders/4m/37zr5qys6yxbhs49_gwr1vb80000gn/T/test-custom-image-_8fus_yg.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants