-
Notifications
You must be signed in to change notification settings - Fork 544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix spot recovery without cloud specified #1077
Conversation
launched_region = handle.launched_resources.region | ||
new_resources = resources.copy(region=launched_region) | ||
new_resources = resources.copy(cloud=launched_cloud, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: why would specifying region only not work here? It should work in a new task.
Maybe I'm not understanding this:
This fixes the problem caused by #1014, which disables launching the cluster with region specified but without cloud specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: why would specifying region only not work here? It should work in a new task.
In #1014, we change the behavior in a new task by enforcing the user to specify the cloud when the region is specified to be consistent with the zone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Is there a reason that has to be the case? That seems to introduce frictions:
- We can easily deduce the cloud from a region/zone
- Imagine a single-cloud user - it's clear to that user what the cloud is
This is an orthogonal issue that shouldn't block this PR. cc @infwinston
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are great points! I think the reason is mainly for potential conflicts among zones in different clouds. @infwinston can say more about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason was mainly to avoid potential confusion and future region/zone name conflicts.
For example, users may run into this typo between
us-west-1a
and us-west1-a
where the former is a zone in AWS, the latter one is in GCP.
If we automatically derive cloud then we might unintentionally launch a wrong one for users.
Also, it avoids potential region/zone name conflicts with other clouds.
Since specifying region/zone can be considered an advanced usage, I think it'd be better to ask more clarity from users. Does that makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These make sense, but overall I still think deducing could be better:
- The typo cases would be caught by confirmation
- Conflicts can be resolved when they arise.
- However, single-cloud users are most prevalent now and their UX problem (requiring them to type something redundant) doesn't have a workaround.
This fixes the problem caused by #1014, which disables launching the cluster with region specified but without cloud specified. That makes the spot recovery strategy fails to recover a spot task without cloud specified.