[Core] skylet periodically update job status #3469
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The periodic job status update was accidentally dropped due to the job queue scheduling. This will cause job failed in
ray job
failed to be set toFAILED
but stuck in RUNNING.The following could reproduce the issue:
git reset --hard 1bdc
(the commit before the fix for the ray.wait timeout, [Core] Fix jobs longer than 12 days #3460)timeout=600
forray.wait
to simulate the issue incloud_vm_ray_backend.py
sky serve up -n test-index-recover-2 examples/serve/http_server/task.yaml --cloud gcp
After 10 mins, the serve controller job fails, but the service is shown
READY
forever, andsky queue sky-serve-controller-<hash>
shows the controller job is running, whileray job list
shows the job is failed already on the controller VM.Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
bash tests/backward_comaptibility_tests.sh