Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Revisit the use of kubectl wait to avoid flakiness #618

Closed
2 tasks done
kevin85421 opened this issue Oct 5, 2022 · 1 comment · Fixed by #1341
Closed
2 tasks done

[Bug] Revisit the use of kubectl wait to avoid flakiness #618

kevin85421 opened this issue Oct 5, 2022 · 1 comment · Fixed by #1341
Assignees
Labels
bug Something isn't working

Comments

@kevin85421
Copy link
Member

kevin85421 commented Oct 5, 2022

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ci

What happened + What you expected to happen

In KinD E2E tests, we use kubectl wait to block the process only when the system is not ready. However, it is not good to use kubectl wait --for=condition=Ready after deleting a resource.

[Example 1]
The test test_detached_actor kills GCS on the head pod and uses kubectl wait to make sure that the new head pod is ready. However, in my experiment, the head pod will need 60 seconds to crash after the GCS server is killed. The head pod is READY:1/1, STATUS: Running before the crash. Hence, kubectl wait cannot make sure the new head pod is ready.

# kill the gcs on head node. If fate sharing is enabled
# the whole head node pod will terminate.
utils.shell_assert_success(
'kubectl exec -it $(kubectl get pods -A| grep -e "-head" | awk "{print \\$2}") -- /bin/bash -c "ps aux | grep gcs_server | grep -v grep | awk \'{print \$2}\' | xargs kill"')
# wait for new head node getting created
time.sleep(10)
# make sure the new head is ready
utils.shell_assert_success(
'kubectl wait --for=condition=Ready pod/$(kubectl get pods -A | grep -e "-head" | awk "{print \$2}") --timeout=900s')

[Example 2]
READY: 1/1 cannot imply STATUS: Running. Hence, sometimes kubectl wait --for=condition=ready will consider a pod with READY: 1/1, STATUS: Terminating to meet conditions. See the post "How can a pod have status ready and terminating?" for more details.

  • Link

    INFO:kuberay_utils.utils:executing cmd: kubectl wait --for=condition=ready pod -l rayCluster=raycluster-compatibility-test --all --timeout=900s
    pod/raycluster-external-redis-head-xvhcg condition met
    pod/raycluster-external-redis-worker-small-group-r5s2s condition met
    INFO:kuberay_utils.utils:executing cmd: kubectl get pods -A
    NAMESPACE            NAME                                                 READY   STATUS        RESTARTS   AGE
    default              raycluster-external-redis-head-xvhcg                 1/1     Running       0          37s
    default              raycluster-external-redis-worker-small-group-jdsdd   0/1     Running       0          37s
    default              raycluster-external-redis-worker-small-group-r5s2s   1/1     Terminating   0          37s
    
  • Link

    INFO:kuberay_utils.utils:executing cmd: kubectl wait --for=condition=ready pod -l rayCluster=raycluster-compatibility-test --all --timeout=900s
    pod/raycluster-external-redis-head-x97nr condition met
    pod/raycluster-external-redis-worker-small-group-b66wg condition met
    INFO:kuberay_utils.utils:executing cmd: kubectl get pods -A
    NAMESPACE            NAME                                                 READY   STATUS        RESTARTS   AGE
    default              raycluster-external-redis-head-x97nr                 1/1     Running       0          39s
    default              raycluster-external-redis-worker-small-group-9mjgb   1/1     Running       0          39s
    default              raycluster-external-redis-worker-small-group-b66wg   1/1     Terminating   0          39s

Reproduction script

RAY_IMAGE=rayproject/ray:2.0.0 python3 tests/compatibility-test.py RayFTTestCase.test_detached_actor 2>&1 | tee log
RAY_IMAGE=rayproject/ray:2.0.0 python3 tests/compatibility-test.py RayFTTestCase.test_kill_head 2>&1 | tee log

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@kevin85421 kevin85421 added the bug Something isn't working label Oct 5, 2022
@kevin85421 kevin85421 changed the title (Draft) [Bug] Revisit the use of kubectl wait to avoid flakiness [Bug] Revisit the use of kubectl wait to avoid flakiness Oct 5, 2022
@kevin85421 kevin85421 self-assigned this Oct 5, 2022
@kevin85421
Copy link
Member Author

We should remove all "kubectl wait" because the command does not know the expected state (e.g. how many Pods?) of RayCluster. In addition, #704 will also increase the instability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment