You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched the issues and found no similar issues.
KubeRay Component
ci
What happened + What you expected to happen
In KinD E2E tests, we use kubectl wait to block the process only when the system is not ready. However, it is not good to use kubectl wait --for=condition=Ready after deleting a resource.
[Example 1]
The test test_detached_actor kills GCS on the head pod and uses kubectl wait to make sure that the new head pod is ready. However, in my experiment, the head pod will need 60 seconds to crash after the GCS server is killed. The head pod is READY:1/1, STATUS: Running before the crash. Hence, kubectl wait cannot make sure the new head pod is ready.
'kubectl wait --for=condition=Ready pod/$(kubectl get pods -A | grep -e "-head" | awk "{print \$2}") --timeout=900s')
[Example 2] READY: 1/1 cannot imply STATUS: Running. Hence, sometimes kubectl wait --for=condition=ready will consider a pod with READY: 1/1, STATUS: Terminating to meet conditions. See the post "How can a pod have status ready and terminating?" for more details.
kevin85421
changed the title
(Draft) [Bug] Revisit the use of kubectl wait to avoid flakiness
[Bug] Revisit the use of kubectl wait to avoid flakiness
Oct 5, 2022
We should remove all "kubectl wait" because the command does not know the expected state (e.g. how many Pods?) of RayCluster. In addition, #704 will also increase the instability.
Search before asking
KubeRay Component
ci
What happened + What you expected to happen
In KinD E2E tests, we use
kubectl wait
to block the process only when the system is not ready. However, it is not good to usekubectl wait --for=condition=Ready
after deleting a resource.[Example 1]
The test
test_detached_actor
kills GCS on the head pod and useskubectl wait
to make sure that the new head pod is ready. However, in my experiment, the head pod will need 60 seconds to crash after the GCS server is killed. The head pod isREADY:1/1, STATUS: Running
before the crash. Hence,kubectl wait
cannot make sure the new head pod is ready.kuberay/tests/compatibility-test.py
Lines 319 to 327 in ea6e8d1
[Example 2]
READY: 1/1
cannot implySTATUS: Running
. Hence, sometimeskubectl wait --for=condition=ready
will consider a pod withREADY: 1/1, STATUS: Terminating
to meet conditions. See the post "How can a pod have status ready and terminating?" for more details.Link
Link
INFO:kuberay_utils.utils:executing cmd: kubectl wait --for=condition=ready pod -l rayCluster=raycluster-compatibility-test --all --timeout=900s pod/raycluster-external-redis-head-x97nr condition met pod/raycluster-external-redis-worker-small-group-b66wg condition met INFO:kuberay_utils.utils:executing cmd: kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE default raycluster-external-redis-head-x97nr 1/1 Running 0 39s default raycluster-external-redis-worker-small-group-9mjgb 1/1 Running 0 39s default raycluster-external-redis-worker-small-group-b66wg 1/1 Terminating 0 39s
Reproduction script
Anything else
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: