Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AWS] Adopt new provisioner to query clusters #2288

Merged
merged 7 commits into from
Jul 26, 2023
Merged

[AWS] Adopt new provisioner to query clusters #2288

merged 7 commits into from
Jul 26, 2023

Conversation

suquark
Copy link
Collaborator

@suquark suquark commented Jul 21, 2023

Adopt new provisioner to query cluster status, i.e. instances and their status.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @suquark!

sky/provision/__init__.py Show resolved Hide resolved
sky/provision/aws/instance.py Show resolved Hide resolved
sky/provision/__init__.py Show resolved Hide resolved
sky/provision/aws/instance.py Outdated Show resolved Hide resolved
node_status_dict = provision_lib.query_instances(
cloud_name, cluster_name, provider_config)
node_statuses = list(node_status_dict.values())
except Exception as e: # pylint: disable=broad-except
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exceptions can be thrown by provision_lib.query_instances()? Should we document that?

Also, how would the caller of _query_cluster_status_via_cloud_api() handle them?

Does the previous codepath allow throwing any exceptions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the previous codepath only checks the returncode of aws cli. so it catches general exceptions. We inherit this behavior here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a

    Raises:
        exceptions.ClusterStatusFetchingError: the cluster status cannot be
          fetched from the cloud provider.

to

  • _query_cluster_status_via_cloud_api
  • _update_cluster_status_no_lock

Just some code gardening.

@suquark
Copy link
Collaborator Author

suquark commented Jul 25, 2023

@Michaelvll could you help run the smoke tests? Most of tests pass with my env, the failed one may related to my resource limit. Thanks!

@suquark suquark requested a review from concretevitamin July 25, 2023 09:08
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'm running pytest tests/test_smoke.py --aws.

node_status_dict = provision_lib.query_instances(
cloud_name, cluster_name, provider_config)
node_statuses = list(node_status_dict.values())
except Exception as e: # pylint: disable=broad-except
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a

    Raises:
        exceptions.ClusterStatusFetchingError: the cluster status cannot be
          fetched from the cloud provider.

to

  • _query_cluster_status_via_cloud_api
  • _update_cluster_status_no_lock

Just some code gardening.

@concretevitamin
Copy link
Member

These smoke tests failed with the same reason:

FAILED tests/test_smoke.py::test_spot_pipeline_recovery_aws - Exception: test failed: less /var/folders/8f/56gzvwkd3n3293xjlrztr6600000gp/T/spot_pipeline_recovery_aws-...
FAILED tests/test_smoke.py::test_spot_recovery_aws - Exception: test failed: less /var/folders/8f/56gzvwkd3n3293xjlrztr6600000gp/T/spot_recovery_aws-ki9xi753.log
FAILED tests/test_smoke.py::test_spot_recovery_multi_node_aws - Exception: test failed: less /var/folders/8f/56gzvwkd3n3293xjlrztr6600000gp/T/spot_recovery_multi_node_...

They all grepped for RUNNING but found FAILED_CONTROLLER at some point. Their controller logs show a bunch of

(t-spot-pipeline-07a-2b28-1e, pid=21554)     raise exceptions.ClusterStatusFetchingError(
(t-spot-pipeline-07a-2b28-1e, pid=21554) sky.exceptions.ClusterStatusFetchingError: Found 2 node(s) with the same cluster name tag in the cloud provider for cluster 'a-53', which should have 1 nodes. This normally should not happen. Please check the cloud console and fix any possible resources leakage (e.g., if there are any stopped nodes and they do not have data or are unhealthy, terminate them).

  (t-spot-recovery-784-2b28-54, pid=19100) I 07-25 17:06:01 recovery_strategy.py:286] Spot cluster launched.
  (t-spot-recovery-784-2b28-54, pid=19100) I 07-25 17:06:02 recovery_strategy.py:192] Unexpected exception: Found 2 node(s) with the same cluster name tag in the cloud provider for cluster 't-spot-recovery-784-2b28-54-57', which should have 1 nodes. This normally should not happen. Please check the cloud console and fix any possible resources leakage (e.g., if there are any stopped nodes and they do not have data or are unhealthy, terminate them).

(t-spot-recovery-351-2b28-a8, pid=11463)     raise exceptions.ClusterStatusFetchingError(
(t-spot-recovery-351-2b28-a8, pid=11463) sky.exceptions.ClusterStatusFetchingError: Found 3 node(s) with the same cluster name tag in the cloud provider for cluster 't-spot-recovery-351-2b28-a8-52', which should have 2 nodes. This normally should not happen. Please check the cloud console and fix any possible resources leakage (e.g., if there are any stopped nodes and they do not have data or are unhealthy, terminate them).

Checked the console: The running nodes count is correct. However, the current PR may have included some terminated node(s) with the cluster name tag, hence the count mismatch errors. Could you reproduce this? Should we do something like get_nonterminated_nodes()?

@concretevitamin
Copy link
Member

Actually, another test was failing:

FAILED tests/test_smoke.py::test_autodown - Exception: test failed: less /var/folders/8f/56gzvwkd3n3293xjlrztr6600000gp/T/autodown-8_75gbgu.log

with log

...
+ s=$(SKYPILOT_DEBUG=0 sky status t-autodown-2b28-c7 --refresh) && echo "$s" && { echo "$s" | grep t-autodown-2b28-c7 | grep "Autodowned cluster\|terminated on the cloud"
; } || { echo "$s" | grep t-autodown-2b28-c7 && exit 1 || exit 0; }
Clusters

W 07-25 10:28:59 backend_utils.py:1992] Attempted to cancel autodown on the cluster 't-autodown-2b28-c7' with best effort, since it is found to be in an abnormal state. To fix, try running: sky start -f -i 1 --down t-autodown-2b28-c7

NAME                LAUNCHED    RESOURCES            STATUS  AUTOSTOP  COMMAND
t-autodown-2b28-c7  5 mins ago  2x AWS(m6i.2xlarge)  INIT    -         sky launch -y -d -c t-aut...
...

@suquark
Copy link
Collaborator Author

suquark commented Jul 26, 2023

I have fixed failed tests mentions (test_spot_recovery_aws failed initially, but it passes later for a multiple runs, so I think it is just some flaky test).

@suquark suquark requested a review from concretevitamin July 26, 2023 06:51
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks a bunch @suquark! All smoke tests passed on my end as well.

@@ -43,6 +43,7 @@ def query_instances(
provider_name: str,
cluster_name: str,
provider_config: Optional[Dict[str, Any]] = None,
non_terminated_only: bool = True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For discussion: Does it make sense to not expose this and always assume non_terminated_only=True? Will there be callers who would want this to be False?

stop() and terminate() for example already implicitly assume non-terminated, e.g.,

    filters = [
        {
            'Name': 'instance-state-name',
            # exclude 'shutting-down' or 'terminated' states
            'Values': ['pending', 'running', 'stopping', 'stopped'],
        },
        *_cluster_name_filter(cluster_name),
    ]

Also similar to node providers' design of get_nonterminated_nodes().

We can certainly leave this for the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me leave a comment about it

@suquark suquark merged commit fe3360d into master Jul 26, 2023
@suquark suquark deleted the migrate_query branch July 26, 2023 23:19
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @suquark - sorry for not spotting a potential issue. PTAL.

Comment on lines -758 to -761
retry_stderrs=[
'Unable to locate credentials. You can configure credentials by '
'running "aws configure"'
])
Copy link
Member

@concretevitamin concretevitamin Jul 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When reviewing #2314, I realized this code (originally added in #1988) was accidentally left out from this PR/master branch.

Could we add it back? #1988 has context. Tldr: previously users have encountered "ec2 describe-instances" throwing NoCredentialsError with this message (the programmatic client may only throw the Unable to locate credentials part as the exception message).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this reminds of another problem.

Previously, we retry by issuing another CLI cmd "ec2 describe-instances" with a subprocess. This probably means a new underlying boto3 client is created for each retry. This could be the reason the retry has mitigated this problem. E.g., abandoning a malfunctioning client.

With this PR, even if we add retry back, it'll access

@functools.lru_cache()
def client(service_name: str, **kwargs):
"""Create an AWS client of a certain service.
Args:
service_name: AWS service name (e.g., 's3', 'ec2').
kwargs: Other options.
"""
# Need to use the client retrieved from the per-thread session
# to avoid thread-safety issues (Directly creating the client
# with boto3.client() is not thread-safe).
# Reference: https://stackoverflow.com/a/59635814
return session().client(service_name, **kwargs)
which is LRU-cached per thread(?). So if we retry using the same thread, it may not have the same effect.

This is all speculation since we don't have a reliable repro. That said, could we somehow force create a new boto3 client when we retry?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants