[Serve] Fail early for user app failure and expose failure reasons #3411

Michaelvll · 2024-04-02T23:31:24Z

Tested (run the relevant ones):

Code formatting: bash format.sh
Any manual or new tests for this PR (please specify below)
All smoke tests: pytest tests/test_smoke.py
- pytest tests/test_smoke.py --serve
Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
- pytest tests/test_smoke.py::test_skyserve_user_bug_restart
- pytest tests/test_smoke.py::test_skyserve_failures
- pytest tests/test_smoke.py --serve
Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

…il-early-for-user-error

…arly-for-user-error

cblmemo

Thanks! It looks mostly great for me. Left some nits and after all smoke tests added and passed it should be good to go 🫡

sky/serve/autoscalers.py

cblmemo · 2024-04-04T00:52:04Z

sky/serve/autoscalers.py

+                    if info.is_ready:
+                        self.latest_version_ever_ready = self.latest_version
+                elif (info.status_property.unrecoverable_failure() and
+                      self.latest_version_ever_ready < self.latest_version):
+                    # Stop scaling if one of replica of the latest version
+                    # failed, it is likely that a fatal error happens to the
+                    # user application and may lead to a infinte termination
+                    # and restart.
+                    return []


Should we do the elif in a new for loop? If replica 1 is unrecoverable and replica 2 is ready, and the order in replica_infos is [replica_1, replica_2], iiuc here we will first execute the loop body for replica 1 and stop scaling, which is not expected as the replica 2 is ready. instead, should we loop for all replicas and update the self.latest_version_every_ready first?

Good point! When unrecoverable_failure() happens, it is likely all the replicas will fail, but changing it to two loops should be safer.

just wondering some edge case like the initial delay seconds is about to miss and dependent on network speed 🤔

Good point! I suppose the latest two loops should guard the concerns : )

sky/serve/replica_managers.py

cblmemo · 2024-04-04T00:56:04Z

sky/serve/replica_managers.py

@@ -333,17 +340,17 @@ def to_replica_status(self) -> serve_state.ReplicaStatus:
                return serve_state.ReplicaStatus.FAILED


Suggested change

return serve_state.ReplicaStatus.FAILED

return serve_state.ReplicaStatus.FAILED_USER_APP

should we update this to a status w/ more information as well?

I am thinking to use FAILED to always mean FAILED_USER_APP, as FAILED represents user app error in managed spot job. Wdyt? cc'ing @concretevitamin @romilbhardwaj for some input here as well : )

Oh if it is the case, agreed that we keep align w/ spot jobs ;)

sky/serve/serve_state.py

tests/skyserve/failures/probing.py

tests/skyserve/spot/base_ondemand_fallback.yaml

tests/test_smoke.py

cblmemo

LGTM! thx :)

sky/serve/autoscalers.py

sky/serve/replica_managers.py

cblmemo · 2024-04-05T02:00:37Z

sky/serve/replica_managers.py

@@ -333,17 +340,17 @@ def to_replica_status(self) -> serve_state.ReplicaStatus:
                return serve_state.ReplicaStatus.FAILED


Oh if it is the case, agreed that we keep align w/ spot jobs ;)

Co-authored-by: Tian Xia <[email protected]>

…arly-for-user-error

…kypilot into fail-early-for-user-error

Michaelvll added 16 commits April 1, 2024 22:08

expose detailed replica failure and rename service failure to crash loop

2ed2419

fix path in test

6ee1b3b

add target qps

a6594f3

shorter wait time

a0c7359

fix smoke

f6a89c2

fix smoke test

6b9348b

add ; back

12db511

Revert crash loop

e4936fc

update failed_status

2beb015

typo

c20fa14

format

6256b71

Add initial delay failure

e558410

Add initial delay failure

5a6a091

format

837ac83

do not scale when user app fails

75c5069

Merge branch 'service-ux' of github.com:skypilot-org/skypilot into fa…

c3fff15

…il-early-for-user-error

Michaelvll mentioned this pull request Apr 2, 2024

[Serve] Expose detailed replica failure reason and rename service failure to crash loop #3403

Closed

6 tasks

Michaelvll added 13 commits April 2, 2024 23:47

format

d8f812e

Add tests for failure statuses

6e6a662

Merge branch 'master' of github.com:skypilot-org/skypilot into fail-e…

f63f5e9

…arly-for-user-error

syntax error

4260891

make service termination more robust

6b0ed01

fix smoke test

0e89159

Fix permission issue if not tpu is not needed

afefa2d

fix test

8ca9a8a

fail early for initial delay timeout

b91b28c

format

40866bf

format

d3cedaa

remove unecessary logger

b0d9e23

fix

7957567

explicit annotation of scale down

ebfd136

Michaelvll requested a review from cblmemo April 3, 2024 07:56

Michaelvll added 3 commits April 3, 2024 21:48

fix logs

f8a8d7d

format

8841b4c

fix test with non spot

137db2a

Michaelvll marked this pull request as ready for review April 4, 2024 00:41

cblmemo reviewed Apr 4, 2024

View reviewed changes

Address comments

d7ea851

Michaelvll requested review from cblmemo, concretevitamin and MaoZiming April 4, 2024 03:43

Michaelvll added 2 commits April 4, 2024 03:47

add comments

5bf863f

longer time

f2b3545

cblmemo approved these changes Apr 5, 2024

View reviewed changes

Michaelvll and others added 5 commits April 5, 2024 16:45

Update sky/serve/autoscalers.py

8d9bcd7

Co-authored-by: Tian Xia <[email protected]>

Update sky/serve/autoscalers.py

23d5a67

Co-authored-by: Tian Xia <[email protected]>

Update sky/serve/replica_managers.py

7eb30ef

Co-authored-by: Tian Xia <[email protected]>

Merge branch 'master' of github.com:skypilot-org/skypilot into fail-e…

9ec2924

…arly-for-user-error

Merge branch 'fail-early-for-user-error' of github.com:skypilot-org/s…

1ef7388

…kypilot into fail-early-for-user-error

Michaelvll merged commit 48a5c63 into master Apr 7, 2024
20 checks passed

Michaelvll deleted the fail-early-for-user-error branch April 7, 2024 03:24

cblmemo mentioned this pull request Jun 21, 2024

[Serve][UX] Fine-grained reason for a replica failure #3180

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serve] Fail early for user app failure and expose failure reasons #3411

[Serve] Fail early for user app failure and expose failure reasons #3411

Michaelvll commented Apr 2, 2024 •

edited

Loading

cblmemo left a comment

cblmemo Apr 4, 2024

Michaelvll Apr 4, 2024

cblmemo Apr 5, 2024

Michaelvll Apr 5, 2024

cblmemo Apr 4, 2024

Michaelvll Apr 4, 2024

cblmemo Apr 5, 2024

cblmemo left a comment

cblmemo Apr 5, 2024

		@@ -333,17 +340,17 @@ def to_replica_status(self) -> serve_state.ReplicaStatus:
		return serve_state.ReplicaStatus.FAILED

	return serve_state.ReplicaStatus.FAILED
	return serve_state.ReplicaStatus.FAILED_USER_APP

[Serve] Fail early for user app failure and expose failure reasons #3411

[Serve] Fail early for user app failure and expose failure reasons #3411

Conversation

Michaelvll commented Apr 2, 2024 • edited Loading

cblmemo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cblmemo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Michaelvll commented Apr 2, 2024 •

edited

Loading