-
Notifications
You must be signed in to change notification settings - Fork 559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] Fail early for user app failure and expose failure reasons #3411
Conversation
…il-early-for-user-error
…arly-for-user-error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! It looks mostly great for me. Left some nits and after all smoke tests added and passed it should be good to go 🫡
sky/serve/autoscalers.py
Outdated
if info.is_ready: | ||
self.latest_version_ever_ready = self.latest_version | ||
elif (info.status_property.unrecoverable_failure() and | ||
self.latest_version_ever_ready < self.latest_version): | ||
# Stop scaling if one of replica of the latest version | ||
# failed, it is likely that a fatal error happens to the | ||
# user application and may lead to a infinte termination | ||
# and restart. | ||
return [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do the elif
in a new for loop? If replica 1 is unrecoverable and replica 2 is ready, and the order in replica_infos
is [replica_1, replica_2]
, iiuc here we will first execute the loop body for replica 1 and stop scaling, which is not expected as the replica 2 is ready. instead, should we loop for all replicas and update the self.latest_version_every_ready
first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! When unrecoverable_failure()
happens, it is likely all the replicas will fail, but changing it to two loops should be safer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just wondering some edge case like the initial delay seconds is about to miss and dependent on network speed 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I suppose the latest two loops should guard the concerns : )
@@ -333,17 +340,17 @@ def to_replica_status(self) -> serve_state.ReplicaStatus: | |||
return serve_state.ReplicaStatus.FAILED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return serve_state.ReplicaStatus.FAILED | |
return serve_state.ReplicaStatus.FAILED_USER_APP |
should we update this to a status w/ more information as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking to use FAILED
to always mean FAILED_USER_APP
, as FAILED
represents user app error in managed spot job. Wdyt? cc'ing @concretevitamin @romilbhardwaj for some input here as well : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh if it is the case, agreed that we keep align w/ spot jobs ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! thx :)
@@ -333,17 +340,17 @@ def to_replica_status(self) -> serve_state.ReplicaStatus: | |||
return serve_state.ReplicaStatus.FAILED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh if it is the case, agreed that we keep align w/ spot jobs ;)
Co-authored-by: Tian Xia <[email protected]>
Co-authored-by: Tian Xia <[email protected]>
Co-authored-by: Tian Xia <[email protected]>
…arly-for-user-error
…kypilot into fail-early-for-user-error
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py --serve
pytest tests/test_smoke.py::test_fill_in_the_name
pytest tests/test_smoke.py::test_skyserve_user_bug_restart
pytest tests/test_smoke.py::test_skyserve_failures
pytest tests/test_smoke.py --serve
bash tests/backward_comaptibility_tests.sh