-
Notifications
You must be signed in to change notification settings - Fork 539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add user identity to cluster status to avoid leakage when switching account #1513
Conversation
176b3a7
to
b164b73
Compare
fdbd895
to
aff8e61
Compare
6a193be
to
e6c7b0c
Compare
a3c6295
to
b0578e2
Compare
b0578e2
to
5446452
Compare
0b9429f
to
f07629a
Compare
Tested (d0f1888):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks; did a pass over the new changes.
Minor: Can we make the following errors for various CLI calls return exit code 1?
» AWS_PROFILE=AdministratorAccess-1234 sky start jump3
Restarting 1 cluster: jump3. Proceed? [Y/n]:
Cluster 'jump3' (AWS) is owned by account 'yyyyy', but the currently activated account is 'xxxxx'.
sky/core.py
Outdated
except exceptions.ClusterNotUpError as e: | ||
with ux_utils.print_exception_no_traceback(): | ||
e.message += ( | ||
f'\n auto{option_str} can only be set/unset for ' | ||
f'{global_user_state.ClusterStatus.UP.value} clusters.') | ||
raise e from None | ||
except exceptions.NotSupportedError as e: | ||
with ux_utils.print_exception_no_traceback(): | ||
e.message += (f'\n auto{option_str} is only supported by backend: ' | ||
f'{backends.CloudVmRayBackend.NAME}') | ||
raise e from None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to establish the convention that programmatic APIs should throw full stacktraces if possible (to ease debugging)? I think it's ok to defer to the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point! In that case, we may want to change the behavior of ux_utils.print_exception_no_traceback
based on the entrypoint (CLI or programmatic API)
That's right! Thanks. Potential issue (1) Intentionally passing a wrong profile string
(2) Passing the correct profile string
Aside: I now think that carefully allowing certain operations on a non-owned cluster may not be worth the effort. This is because it's harder for us to get right or for users to remember the logic. On the other hand, if it's clear cut the user doesn't need to learn new concepts: if one doesn't own a cluster, no operations are allowed. That said, let's ship it and see what user feedback is ;) |
dd6171d
to
577a38f
Compare
Thanks for the review @concretevitamin!
This is a good call, but a bit difficult to decide the boundary. Since our
Is it because only the spot controller has the autostop set up? We will only check the identity when we need to query the cloud CLI. For those clusters without autostop set up, there is no need to check or warn since the operator should work. However, for the cluster with autostop set up, there is no guarantee that the following codes will work, i.e. the warning might be useful as a disclaimer. Wdyt?
Modified the logs based on the suggestions in the previous comments. The logging for a non-UP cluster will look like the following:
I agree that the user may find the boundary a bit complicated, and we can modify it based on the feedback. One thing that may be noted is that we are not making multi-account officially supported after this PR, as that requires more detailed edge case handling. Instead, I would rather regard this PR as a safeguard for the user trying to switch from one account to another entirely and stay with the new account. ; ) |
As discussed offline, we decide to ban all operations for the identity mismatch for now. PTAL. The new behavior will be:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the great work @Michaelvll!
Co-authored-by: Zongheng Yang <[email protected]>
…amin/sky-experiments into add-user-identity-to-cluster
…nto add-user-identity-to-cluster
Thanks for the excellent review @concretevitamin! Just tested again with the smoke tests and it works. (7874618) |
Describe the changes in this PR:
With #1489, when a user switches account,
sky status -r
will remove the previously launched cluster from the table due to the cluster cannot be found in the new account, causing resource leakage. It could happen whenever user switches account on any cloud. This is to get rid of the problem.IMPORTANT: A user with multiple identities for a cloud need to do the following things after upgrading to this PR:
sky status -r
so that the owner information can be correctly updated in our cluster table cache.A normal user will not be affected by the PR.
Tested (run the relevant ones):
sky launch -c min --cloud aws
; switch to another account;sky status -r
sky launch -c min --cloud aws
; switch to another account;sky launch -c min
;sky down min
;sky stop min
;sky start min
sky launch -c min --cloud gcp
; switch to another account;sky status -r
sky launch -c min --cloud azure
; switch to another account;sky status -r
./tests/run_smoke_tests.sh
(except for TPU VM tests Add safe guard for provisioning/terminating TPU VM and fix spot launch TPU resource leak #1500 and list source files [Storage] Uploading list of files to bucket fails #1510)../tests/backward_compatibility.sh