[Bug] Revisit the use of kubectl wait to avoid flakiness #618

kevin85421 · 2022-10-05T20:13:51Z

Search before asking

I searched the issues and found no similar issues.

KubeRay Component

ci

What happened + What you expected to happen

In KinD E2E tests, we use kubectl wait to block the process only when the system is not ready. However, it is not good to use kubectl wait --for=condition=Ready after deleting a resource.

[Example 1]
The test test_detached_actor kills GCS on the head pod and uses kubectl wait to make sure that the new head pod is ready. However, in my experiment, the head pod will need 60 seconds to crash after the GCS server is killed. The head pod is READY:1/1, STATUS: Running before the crash. Hence, kubectl wait cannot make sure the new head pod is ready.

kuberay/tests/compatibility-test.py

Lines 319 to 327 in ea6e8d1

    
           # kill the gcs on head node. If fate sharing is enabled 
        
           # the whole head node pod will terminate. 
        
           utils.shell_assert_success( 
        
               'kubectl exec -it $(kubectl get pods -A| grep -e "-head" | awk "{print \\$2}") -- /bin/bash -c "ps aux | grep gcs_server | grep -v grep | awk \'{print \$2}\' | xargs kill"') 
        
           # wait for new head node getting created 
        
           time.sleep(10) 
        
           # make sure the new head is ready 
        
           utils.shell_assert_success( 
        
               'kubectl wait --for=condition=Ready pod/$(kubectl get pods -A | grep -e "-head" | awk "{print \$2}") --timeout=900s')

[Example 2]
READY: 1/1 cannot imply STATUS: Running. Hence, sometimes kubectl wait --for=condition=ready will consider a pod with READY: 1/1, STATUS: Terminating to meet conditions. See the post "How can a pod have status ready and terminating?" for more details.

Link

INFO:kuberay_utils.utils:executing cmd: kubectl wait --for=condition=ready pod -l rayCluster=raycluster-compatibility-test --all --timeout=900s
pod/raycluster-external-redis-head-xvhcg condition met
pod/raycluster-external-redis-worker-small-group-r5s2s condition met
INFO:kuberay_utils.utils:executing cmd: kubectl get pods -A
NAMESPACE            NAME                                                 READY   STATUS        RESTARTS   AGE
default              raycluster-external-redis-head-xvhcg                 1/1     Running       0          37s
default              raycluster-external-redis-worker-small-group-jdsdd   0/1     Running       0          37s
default              raycluster-external-redis-worker-small-group-r5s2s   1/1     Terminating   0          37s

Link

INFO:kuberay_utils.utils:executing cmd: kubectl wait --for=condition=ready pod -l rayCluster=raycluster-compatibility-test --all --timeout=900s
pod/raycluster-external-redis-head-x97nr condition met
pod/raycluster-external-redis-worker-small-group-b66wg condition met
INFO:kuberay_utils.utils:executing cmd: kubectl get pods -A
NAMESPACE            NAME                                                 READY   STATUS        RESTARTS   AGE
default              raycluster-external-redis-head-x97nr                 1/1     Running       0          39s
default              raycluster-external-redis-worker-small-group-9mjgb   1/1     Running       0          39s
default              raycluster-external-redis-worker-small-group-b66wg   1/1     Terminating   0          39s

Reproduction script

RAY_IMAGE=rayproject/ray:2.0.0 python3 tests/compatibility-test.py RayFTTestCase.test_detached_actor 2>&1 | tee log
RAY_IMAGE=rayproject/ray:2.0.0 python3 tests/compatibility-test.py RayFTTestCase.test_kill_head 2>&1 | tee log

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

kevin85421 · 2022-11-30T20:59:40Z

We should remove all "kubectl wait" because the command does not know the expected state (e.g. how many Pods?) of RayCluster. In addition, #704 will also increase the instability.

kevin85421 added the bug Something isn't working label Oct 5, 2022

kevin85421 changed the title ~~(Draft) [Bug] Revisit the use of kubectl wait to avoid flakiness~~ [Bug] Revisit the use of kubectl wait to avoid flakiness Oct 5, 2022

kevin85421 self-assigned this Oct 5, 2022

This was referenced Oct 12, 2022

[Bug] Misuse of Docker API and misunderstanding of Ray HA cause test_detached_actor flaky #619

Merged

[Bug] Update wait function in test_detached_actor #635

Merged

This was referenced Nov 5, 2022

[Bug] Compatibility tests of FT features are (still) flakey #691

Closed

Improve test quality by either improving or removing unreliable test #692

Closed

kevin85421 added this to the v0.5.0 release milestone Dec 9, 2022

kevin85421 mentioned this issue Dec 31, 2022

[Bug] Revisit the use of kubectl wait to avoid flakiness #855

Closed

4 tasks

kevin85421 modified the milestones: v0.5.0 release, v0.6.0 release Apr 6, 2023

kevin85421 mentioned this issue Aug 9, 2023

[CI] Install kuberay operator in buildkite test #1308

Merged

4 tasks

kevin85421 mentioned this issue Aug 18, 2023

[GCS FT][Refactor] Redefine the behavior for deleting Pods and stop listening to Kubernetes events #1341

Merged

4 tasks

kevin85421 closed this as completed in #1341 Aug 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Revisit the use of kubectl wait to avoid flakiness #618

[Bug] Revisit the use of kubectl wait to avoid flakiness #618

kevin85421 commented Oct 5, 2022 •

edited

Loading

kevin85421 commented Nov 30, 2022

[Bug] Revisit the use of kubectl wait to avoid flakiness #618

[Bug] Revisit the use of kubectl wait to avoid flakiness #618

Comments

kevin85421 commented Oct 5, 2022 • edited Loading

Search before asking

KubeRay Component

What happened + What you expected to happen

Reproduction script

Anything else

Are you willing to submit a PR?

kevin85421 commented Nov 30, 2022

kevin85421 commented Oct 5, 2022 •

edited

Loading