-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AWS] Adopt new provisioner to query clusters #2288
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @suquark!
node_status_dict = provision_lib.query_instances( | ||
cloud_name, cluster_name, provider_config) | ||
node_statuses = list(node_status_dict.values()) | ||
except Exception as e: # pylint: disable=broad-except |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exceptions can be thrown by provision_lib.query_instances()? Should we document that?
Also, how would the caller of _query_cluster_status_via_cloud_api() handle them?
Does the previous codepath allow throwing any exceptions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the previous codepath only checks the returncode of aws cli. so it catches general exceptions. We inherit this behavior here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a
Raises:
exceptions.ClusterStatusFetchingError: the cluster status cannot be
fetched from the cloud provider.
to
- _query_cluster_status_via_cloud_api
- _update_cluster_status_no_lock
Just some code gardening.
@Michaelvll could you help run the smoke tests? Most of tests pass with my env, the failed one may related to my resource limit. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'm running pytest tests/test_smoke.py --aws
.
node_status_dict = provision_lib.query_instances( | ||
cloud_name, cluster_name, provider_config) | ||
node_statuses = list(node_status_dict.values()) | ||
except Exception as e: # pylint: disable=broad-except |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a
Raises:
exceptions.ClusterStatusFetchingError: the cluster status cannot be
fetched from the cloud provider.
to
- _query_cluster_status_via_cloud_api
- _update_cluster_status_no_lock
Just some code gardening.
These smoke tests failed with the same reason:
They all grepped for RUNNING but found FAILED_CONTROLLER at some point. Their controller logs show a bunch of
Checked the console: The running nodes count is correct. However, the current PR may have included some terminated node(s) with the cluster name tag, hence the count mismatch errors. Could you reproduce this? Should we do something like get_nonterminated_nodes()? |
Actually, another test was failing:
with log
|
I have fixed failed tests mentions ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks a bunch @suquark! All smoke tests passed on my end as well.
@@ -43,6 +43,7 @@ def query_instances( | |||
provider_name: str, | |||
cluster_name: str, | |||
provider_config: Optional[Dict[str, Any]] = None, | |||
non_terminated_only: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For discussion: Does it make sense to not expose this and always assume non_terminated_only=True? Will there be callers who would want this to be False?
stop() and terminate() for example already implicitly assume non-terminated, e.g.,
filters = [
{
'Name': 'instance-state-name',
# exclude 'shutting-down' or 'terminated' states
'Values': ['pending', 'running', 'stopping', 'stopped'],
},
*_cluster_name_filter(cluster_name),
]
Also similar to node providers' design of get_nonterminated_nodes().
We can certainly leave this for the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me leave a comment about it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @suquark - sorry for not spotting a potential issue. PTAL.
retry_stderrs=[ | ||
'Unable to locate credentials. You can configure credentials by ' | ||
'running "aws configure"' | ||
]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When reviewing #2314, I realized this code (originally added in #1988) was accidentally left out from this PR/master branch.
Could we add it back? #1988 has context. Tldr: previously users have encountered "ec2 describe-instances" throwing NoCredentialsError with this message (the programmatic client may only throw the Unable to locate credentials
part as the exception message).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this reminds of another problem.
Previously, we retry by issuing another CLI cmd "ec2 describe-instances" with a subprocess. This probably means a new underlying boto3 client is created for each retry. This could be the reason the retry has mitigated this problem. E.g., abandoning a malfunctioning client.
With this PR, even if we add retry back, it'll access
Lines 64 to 76 in 0bdfc31
@functools.lru_cache() | |
def client(service_name: str, **kwargs): | |
"""Create an AWS client of a certain service. | |
Args: | |
service_name: AWS service name (e.g., 's3', 'ec2'). | |
kwargs: Other options. | |
""" | |
# Need to use the client retrieved from the per-thread session | |
# to avoid thread-safety issues (Directly creating the client | |
# with boto3.client() is not thread-safe). | |
# Reference: https://stackoverflow.com/a/59635814 | |
return session().client(service_name, **kwargs) |
This is all speculation since we don't have a reliable repro. That said, could we somehow force create a new boto3 client when we retry?
Adopt new provisioner to query cluster status, i.e. instances and their status.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh