-
Notifications
You must be signed in to change notification settings - Fork 451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Update wait function in test_detached_actor #635
Merged
DmitriGekhtman
merged 9 commits into
ray-project:master
from
kevin85421:fix-wait-function
Oct 24, 2022
Merged
[Bug] Update wait function in test_detached_actor #635
DmitriGekhtman
merged 9 commits into
ray-project:master
from
kevin85421:fix-wait-function
Oct 24, 2022
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This looks good! Just one request for a bit more documentation. |
DmitriGekhtman
approved these changes
Oct 21, 2022
Let's wait for CI to finish. |
4 tasks
DmitriGekhtman
approved these changes
Oct 24, 2022
lowang-bh
pushed a commit
to lowang-bh/kuberay
that referenced
this pull request
Sep 24, 2023
This commit implements a wait function for head pod restart in test_detached_actor.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
In KinD E2E tests, we use
kubectl wait
to block the process only when the system is not ready. However, it is not good to usekubectl wait --for=condition=Ready
after deleting a resource. See the Example 1 section in #618 for more details.[Example]
The test
test_detached_actor
kills GCS on the head pod and useskubectl wait
to make sure that the new head pod is ready. However, in my experiment, the head pod will need 60 seconds to crash after the GCS server is killed. The head pod isREADY:1/1, STATUS: Running
before the crash. Hence,kubectl wait
cannot make sure the new head pod is ready.kuberay/tests/compatibility-test.py
Lines 319 to 327 in ea6e8d1
The workaround solution is to replace the
kubectl wait
withtime.sleep(180)
in #619. This PR implements a wait function for head pod restart.Explanations for some changes
Kill the gcs_server process on the head pod
ps aux | grep gcs_server | ... | xargs kill
withpkill gcs_server
. The results ofpgrep gcs_server
andps aux | grep gcs_server | grep -v grep | awk '{print $2}'
are the same.restart_count
Always
. Hence, when GCS server is killed, the head pod will restart the old one rather than create a new one.restart_count
will increase by 1.restart_count
will become 0.restart_count
to ensure the head pod has been dead.ray:nightly
is buggy. It has a high possibility to create a new head pod instead of restarting the old one. Hence, therestart_count
will become 0 and fail this test.When all containers in pods are "READY" and all pods are "Running", it still takes tens of seconds to make all ray processes become ready.
time.sleep(post_wait_sec)
retry_with_timeout(lambda: ray.init(address='ray://127.0.0.1:10001', ...), timeout = 180)
ray.init
have any timeout argument?Why do I pass
client.CoreV1Api()
as a function argument?client.CoreV1Api()
instances in each function call,unittest
will report "ResourceWarning: unclosed SSLSocket".Related issue number
This PR solved a part of #618.
Checks