Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spot] Fix spot failure reason when cloud is specified #1714

Merged
merged 8 commits into from
Feb 23, 2023

Conversation

Michaelvll
Copy link
Collaborator

@Michaelvll Michaelvll commented Feb 23, 2023

Patches #1655

When the cloud is specified, we don't raise the ResourcesUnavailableError but the actual precheck errors. We need to catch those exceptions in the recovery_strategy.

Tested (run the relevant ones):

  • sky spot launch -n spot-maskgit-minerl-maskgit-interp --cloud gcp echo hi

@Michaelvll Michaelvll changed the title Fix spot failure reason when cloud is specified [Spot] Fix spot failure reason when cloud is specified Feb 23, 2023
@infwinston
Copy link
Member

Thanks for the fix! @Michaelvll Looks like in this PR we catch three errors

except (exceptions.InvalidClusterNameError,
                    exceptions.NotSupportedError,
                    exceptions.CloudUserIdentityError) as e:

I'm thinking how to make sure we won't miss out more/keep tracking new errors in the future.
Should we update InvalidClusterNameError to the docstring of sky.launch()?

skypilot/sky/execution.py

Lines 393 to 397 in 71697f7

Raises:
exceptions.ClusterOwnerIdentityMismatchError: if the cluster is
owned by another user.
exceptions.ResourcesUnavailableError: if the requested resources
cannot be satisfied. The failover_history of the exception

or maybe beyond this PR, should we try to categorize errors into a fixed/smaller set of categories and include details in e.reason? like maybe ClusterOwnerIdentityMismatchError and CloudUserIdentityError can be grouped together? Correct me if I'm wrong as I'm not super familiar with potential errors and how we handled them.

@Michaelvll
Copy link
Collaborator Author

I'm thinking how to make sure we won't miss out more/keep tracking new errors in the future.

Actually, let us only catch the InvalidClusterNameError for now to fix the issue current issue first. I would say one thing we need to do for unhandled exceptions, we may want to complete the list of Raises: in all the functions, so it would be easier for us to handle the exceptions in the caller.

I was trying to find a tool for python that allows java-style error signatures for functions but failed to find it.

type method (arguments) throws Exception1, Exception2, … {  }

Should we update InvalidClusterNameError to the docstring of sky.launch()?

Good point! Added to the docstr.

or maybe beyond this PR, should we try to categorize errors into a fixed/smaller set of categories and include details in e.reason? like maybe ClusterOwnerIdentityMismatchError and CloudUserIdentityError can be grouped together?

Yes, we need to make the exceptions simpler and easier to track in our backend, but also to make sure not overly use an exception like ResourcesUnavailableError. One example is that the ProvisionPrechecksError should not only be used in the spot code path, but also be used directly in the backend for combining any exceptions that happen before we actually try to provision the cluster. Let us create an issue and handle it in the future. Wdyt?

Copy link
Member

@infwinston infwinston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! let's raise an issue for further improvement.

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
@Michaelvll Michaelvll merged commit c94b43b into master Feb 23, 2023
@Michaelvll Michaelvll deleted the fix-spot-failure-reason branch February 23, 2023 07:35
sumanthgenz pushed a commit to sumanthgenz/skypilot that referenced this pull request Mar 15, 2023
…1714)

* Fix spot failure reason when cloud is specified

* format

* fix

* format

* Update exceptions in docstr

* fix docstr

* Add error

* format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants