[AWS] Adopt new provisioner to query clusters #2288

suquark · 2023-07-21T21:41:08Z

Adopt new provisioner to query cluster status, i.e. instances and their status.

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: pytest tests/test_smoke.py
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

concretevitamin

Thanks @suquark!

sky/provision/__init__.py

sky/provision/aws/instance.py

sky/provision/__init__.py

sky/provision/aws/instance.py

concretevitamin · 2023-07-24T17:54:22Z

sky/backends/backend_utils.py

+            node_status_dict = provision_lib.query_instances(
+                cloud_name, cluster_name, provider_config)
+            node_statuses = list(node_status_dict.values())
+        except Exception as e:  # pylint: disable=broad-except


What exceptions can be thrown by provision_lib.query_instances()? Should we document that?

Also, how would the caller of _query_cluster_status_via_cloud_api() handle them?

Does the previous codepath allow throwing any exceptions?

the previous codepath only checks the returncode of aws cli. so it catches general exceptions. We inherit this behavior here.

Can we add a

Raises: exceptions.ClusterStatusFetchingError: the cluster status cannot be fetched from the cloud provider.

to

_query_cluster_status_via_cloud_api

_update_cluster_status_no_lock

Just some code gardening.

suquark · 2023-07-25T09:08:46Z

@Michaelvll could you help run the smoke tests? Most of tests pass with my env, the failed one may related to my resource limit. Thanks!

concretevitamin

LGTM. I'm running pytest tests/test_smoke.py --aws.

concretevitamin · 2023-07-25T16:24:03Z

sky/backends/backend_utils.py

+            node_status_dict = provision_lib.query_instances(
+                cloud_name, cluster_name, provider_config)
+            node_statuses = list(node_status_dict.values())
+        except Exception as e:  # pylint: disable=broad-except


Can we add a

Raises: exceptions.ClusterStatusFetchingError: the cluster status cannot be fetched from the cloud provider.

to

_query_cluster_status_via_cloud_api

_update_cluster_status_no_lock

Just some code gardening.

concretevitamin · 2023-07-25T17:19:24Z

These smoke tests failed with the same reason:

FAILED tests/test_smoke.py::test_spot_pipeline_recovery_aws - Exception: test failed: less /var/folders/8f/56gzvwkd3n3293xjlrztr6600000gp/T/spot_pipeline_recovery_aws-...
FAILED tests/test_smoke.py::test_spot_recovery_aws - Exception: test failed: less /var/folders/8f/56gzvwkd3n3293xjlrztr6600000gp/T/spot_recovery_aws-ki9xi753.log
FAILED tests/test_smoke.py::test_spot_recovery_multi_node_aws - Exception: test failed: less /var/folders/8f/56gzvwkd3n3293xjlrztr6600000gp/T/spot_recovery_multi_node_...

They all grepped for RUNNING but found FAILED_CONTROLLER at some point. Their controller logs show a bunch of

(t-spot-pipeline-07a-2b28-1e, pid=21554)     raise exceptions.ClusterStatusFetchingError(
(t-spot-pipeline-07a-2b28-1e, pid=21554) sky.exceptions.ClusterStatusFetchingError: Found 2 node(s) with the same cluster name tag in the cloud provider for cluster 'a-53', which should have 1 nodes. This normally should not happen. Please check the cloud console and fix any possible resources leakage (e.g., if there are any stopped nodes and they do not have data or are unhealthy, terminate them).

  (t-spot-recovery-784-2b28-54, pid=19100) I 07-25 17:06:01 recovery_strategy.py:286] Spot cluster launched.
  (t-spot-recovery-784-2b28-54, pid=19100) I 07-25 17:06:02 recovery_strategy.py:192] Unexpected exception: Found 2 node(s) with the same cluster name tag in the cloud provider for cluster 't-spot-recovery-784-2b28-54-57', which should have 1 nodes. This normally should not happen. Please check the cloud console and fix any possible resources leakage (e.g., if there are any stopped nodes and they do not have data or are unhealthy, terminate them).

(t-spot-recovery-351-2b28-a8, pid=11463)     raise exceptions.ClusterStatusFetchingError(
(t-spot-recovery-351-2b28-a8, pid=11463) sky.exceptions.ClusterStatusFetchingError: Found 3 node(s) with the same cluster name tag in the cloud provider for cluster 't-spot-recovery-351-2b28-a8-52', which should have 2 nodes. This normally should not happen. Please check the cloud console and fix any possible resources leakage (e.g., if there are any stopped nodes and they do not have data or are unhealthy, terminate them).

Checked the console: The running nodes count is correct. However, the current PR may have included some terminated node(s) with the cluster name tag, hence the count mismatch errors. Could you reproduce this? Should we do something like get_nonterminated_nodes()?

concretevitamin · 2023-07-25T17:30:48Z

Actually, another test was failing:

FAILED tests/test_smoke.py::test_autodown - Exception: test failed: less /var/folders/8f/56gzvwkd3n3293xjlrztr6600000gp/T/autodown-8_75gbgu.log

with log

...
+ s=$(SKYPILOT_DEBUG=0 sky status t-autodown-2b28-c7 --refresh) && echo "$s" && { echo "$s" | grep t-autodown-2b28-c7 | grep "Autodowned cluster\|terminated on the cloud"
; } || { echo "$s" | grep t-autodown-2b28-c7 && exit 1 || exit 0; }
Clusters

W 07-25 10:28:59 backend_utils.py:1992] Attempted to cancel autodown on the cluster 't-autodown-2b28-c7' with best effort, since it is found to be in an abnormal state. To fix, try running: sky start -f -i 1 --down t-autodown-2b28-c7

NAME                LAUNCHED    RESOURCES            STATUS  AUTOSTOP  COMMAND
t-autodown-2b28-c7  5 mins ago  2x AWS(m6i.2xlarge)  INIT    -         sky launch -y -d -c t-aut...
...

suquark · 2023-07-26T06:51:06Z

I have fixed failed tests mentions (test_spot_recovery_aws failed initially, but it passes later for a multiple runs, so I think it is just some flaky test).

concretevitamin

LGTM, thanks a bunch @suquark! All smoke tests passed on my end as well.

concretevitamin · 2023-07-26T15:55:28Z

sky/provision/__init__.py

@@ -43,6 +43,7 @@ def query_instances(
    provider_name: str,
    cluster_name: str,
    provider_config: Optional[Dict[str, Any]] = None,
+    non_terminated_only: bool = True,


For discussion: Does it make sense to not expose this and always assume non_terminated_only=True? Will there be callers who would want this to be False?

stop() and terminate() for example already implicitly assume non-terminated, e.g.,

filters = [ { 'Name': 'instance-state-name', # exclude 'shutting-down' or 'terminated' states 'Values': ['pending', 'running', 'stopping', 'stopped'], }, *_cluster_name_filter(cluster_name), ]

Also similar to node providers' design of get_nonterminated_nodes().

We can certainly leave this for the future.

let me leave a comment about it

concretevitamin

Hey @suquark - sorry for not spotting a potential issue. PTAL.

concretevitamin · 2023-07-27T15:52:09Z

sky/clouds/aws.py

-            retry_stderrs=[
-                'Unable to locate credentials. You can configure credentials by '
-                'running "aws configure"'
-            ])


When reviewing #2314, I realized this code (originally added in #1988) was accidentally left out from this PR/master branch.

Could we add it back? #1988 has context. Tldr: previously users have encountered "ec2 describe-instances" throwing NoCredentialsError with this message (the programmatic client may only throw the Unable to locate credentials part as the exception message).

Actually, this reminds of another problem.

Previously, we retry by issuing another CLI cmd "ec2 describe-instances" with a subprocess. This probably means a new underlying boto3 client is created for each retry. This could be the reason the retry has mitigated this problem. E.g., abandoning a malfunctioning client.

With this PR, even if we add retry back, it'll access

skypilot/sky/adaptors/aws.py

Lines 64 to 76 in 0bdfc31

@functools.lru_cache()

def client(service_name: str, **kwargs):

"""Create an AWS client of a certain service.

Args:

service_name: AWS service name (e.g., 's3', 'ec2').

kwargs: Other options.

"""

# Need to use the client retrieved from the per-thread session

# to avoid thread-safety issues (Directly creating the client

# with boto3.client() is not thread-safe).

# Reference: https://stackoverflow.com/a/59635814

return session().client(service_name, **kwargs)

which is LRU-cached per thread(?). So if we retry using the same thread, it may not have the same effect.

This is all speculation since we don't have a reliable repro. That said, could we somehow force create a new boto3 client when we retry?

concretevitamin reviewed Jul 24, 2023

View reviewed changes

suquark added 4 commits July 24, 2023 23:58

init

0277464

update

065de2b

fix comments

9defb38

lint

de727df

suquark force-pushed the migrate_query branch from 492f249 to de727df Compare July 25, 2023 07:00

suquark assigned Michaelvll Jul 25, 2023

suquark requested a review from concretevitamin July 25, 2023 09:08

concretevitamin reviewed Jul 25, 2023

View reviewed changes

suquark added 2 commits July 25, 2023 18:57

fix

6c28cfc

lint

c38c55d

suquark requested a review from concretevitamin July 26, 2023 06:51

concretevitamin approved these changes Jul 26, 2023

View reviewed changes

add comment

a12053d

suquark merged commit fe3360d into master Jul 26, 2023

suquark deleted the migrate_query branch July 26, 2023 23:19

concretevitamin reviewed Jul 27, 2023

View reviewed changes

suquark mentioned this pull request Jul 28, 2023

Fix AWS credential creation error handling #2321

Merged

5 tasks

Michaelvll mentioned this pull request Oct 6, 2023

[Spot] OOM on spot controller when 4 spot jobs run concurrently for more than 5 days #2668

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AWS] Adopt new provisioner to query clusters #2288

[AWS] Adopt new provisioner to query clusters #2288

suquark commented Jul 21, 2023

concretevitamin left a comment

concretevitamin Jul 24, 2023

suquark Jul 25, 2023

concretevitamin Jul 25, 2023

suquark commented Jul 25, 2023

concretevitamin left a comment

concretevitamin Jul 25, 2023

concretevitamin commented Jul 25, 2023

concretevitamin commented Jul 25, 2023

suquark commented Jul 26, 2023

concretevitamin left a comment

concretevitamin Jul 26, 2023

suquark Jul 26, 2023

concretevitamin left a comment

concretevitamin Jul 27, 2023 •

edited

Loading

concretevitamin Jul 27, 2023

	@functools.lru_cache()
	def client(service_name: str, **kwargs):
	"""Create an AWS client of a certain service.

	Args:
	service_name: AWS service name (e.g., 's3', 'ec2').
	kwargs: Other options.
	"""
	# Need to use the client retrieved from the per-thread session
	# to avoid thread-safety issues (Directly creating the client
	# with boto3.client() is not thread-safe).
	# Reference: https://stackoverflow.com/a/59635814
	return session().client(service_name, **kwargs)

[AWS] Adopt new provisioner to query clusters #2288

[AWS] Adopt new provisioner to query clusters #2288

Conversation

suquark commented Jul 21, 2023

concretevitamin left a comment

Choose a reason for hiding this comment

concretevitamin Jul 24, 2023

Choose a reason for hiding this comment

suquark Jul 25, 2023

Choose a reason for hiding this comment

concretevitamin Jul 25, 2023

Choose a reason for hiding this comment

suquark commented Jul 25, 2023

concretevitamin left a comment

Choose a reason for hiding this comment

concretevitamin Jul 25, 2023

Choose a reason for hiding this comment

concretevitamin commented Jul 25, 2023

concretevitamin commented Jul 25, 2023

suquark commented Jul 26, 2023

concretevitamin left a comment

Choose a reason for hiding this comment

concretevitamin Jul 26, 2023

Choose a reason for hiding this comment

suquark Jul 26, 2023

Choose a reason for hiding this comment

concretevitamin left a comment

Choose a reason for hiding this comment

concretevitamin Jul 27, 2023 • edited Loading

Choose a reason for hiding this comment

concretevitamin Jul 27, 2023

Choose a reason for hiding this comment

concretevitamin Jul 27, 2023 •

edited

Loading