Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Double creation of Pods cause flaky tests #704

Closed
2 tasks done
kevin85421 opened this issue Nov 9, 2022 · 8 comments
Closed
2 tasks done

[Bug] Double creation of Pods cause flaky tests #704

kevin85421 opened this issue Nov 9, 2022 · 8 comments
Assignees
Labels
bug Something isn't working P1 Issue that should be fixed within a few weeks

Comments

@kevin85421
Copy link
Member

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

The function create_kuberay_cluster in compatibility-test.py is very flaky. As shown in this example, the k8s cluster had one head pod and one worker pod. The cluster state fulfills the goal state defined in ray-cluster.mini.yaml.template, but it reports an error message error: timed out waiting for the condition on pods/raycluster-mini-worker-small-group-tp5l2.

2022-11-09:06:11:02,784 INFO     [utils.py:47] executing cmd: kubectl wait --for=condition=ready pod -l rayCluster=raycluster-compatibility-test --all --timeout=900s
pod/raycluster-mini-head-msxms condition met
pod/raycluster-mini-worker-small-group-hmgnj condition met
error: timed out waiting for the condition on pods/raycluster-mini-worker-small-group-tp5l2
2022-11-09:06:26:05,571 INFO     [utils.py:47] executing cmd: kubectl get pods -A
NAMESPACE            NAME                                         READY   STATUS    RESTARTS   AGE
default              raycluster-mini-head-msxms                   1/1     Running   0          15m
default              raycluster-mini-worker-small-group-hmgnj     1/1     Running   0          15m
kube-system          coredns-558bd4d5db-gpttf                     1/1     Running   0          19m
kube-system          coredns-558bd4d5db-w7x6x                     1/1     Running   0          19m
kube-system          etcd-kind-control-plane                      1/1     Running   0          19m
kube-system          kindnet-f8w9t                                1/1     Running   0          19m
kube-system          kube-apiserver-kind-control-plane            1/1     Running   0          19m
kube-system          kube-controller-manager-kind-control-plane   1/1     Running   0          19m
kube-system          kube-proxy-44d89                             1/1     Running   0          19m
kube-system          kube-scheduler-kind-control-plane            1/1     Running   0          19m
local-path-storage   local-path-provisioner-547f784dff-g497d      1/1     Running   0          19m
ray-system           kuberay-apiserver-79f45d466c-nvkbm           1/1     Running   0          15m
ray-system           kuberay-operator-ddd74bf68-2pkm5             1/1     Running   0          15m

Check the log for more details. The operator creates two worker pods consecutively (L1368 & L1390). It indicates that L1390 does not know L1368 has already created a worker pod. Next, L1430 found there are two worker pods, and thus one worker pod needs to be deleted (goal state: 1 head + 1 worker).

  • Line 1368
    2022-11-09T06:11:02.809Z	INFO	controllers.RayCluster	reconcilePods	{"creating worker for group": "small-group", "index 0": "in total 1"}
    
  • Line 1390
    2022-11-09T06:11:02.823Z	INFO	controllers.RayCluster	reconcilePods	{"creating worker for group": "small-group", "index 0": "in total 1"}
    
  • Line 1430
    2022-11-09T06:11:02.865Z	INFO	controllers.RayCluster	Randomly deleting pod 	{"index ": 0, "/": 1, "with name": "raycluster-mini-worker-small-group-tp5l2"}
    

In my guess, the root cause is the inconsistency between informer cache and k8s API server. The inconsistency is caused by the non-idempotent operations in KubeRay. To elaborate, KubeRay deletes and creates Pods directly, and these operations will change the cluster state.

Possible Solutions

I discussed with @DmitriGekhtman about some possible solutions

Deployment

  • We can set the replicas field to achieve the goal number of worker pods. The operation "set the replicas" is an idempotent operation.
  • Cons: We cannot delete a specific Pod in Deployment.

Kruise

Reproduction script

https://github.com/ray-project/kuberay/actions/runs/3424423296/jobs/5706807752

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@kevin85421 kevin85421 added the bug Something isn't working label Nov 9, 2022
@kevin85421 kevin85421 self-assigned this Nov 9, 2022
@kevin85421
Copy link
Member Author

cc @DmitriGekhtman @Jeffwan @sihanwang41 @jasoonn for thoughts.

@DmitriGekhtman
Copy link
Collaborator

DmitriGekhtman commented Nov 10, 2022

cc @akanso also

We should try to get a sense for how frequent this behavior is -- in principle, if you occasionally create an extra pod and then quickly delete it, that's not the worst thing in the world. That's "eventual consistency"...

The replica set controller underlying deployments doesn't use logic that's much fancier than what we're doing in KubeRay -- maybe we have some other unknown issue.

@DmitriGekhtman
Copy link
Collaborator

The immediate action item is to fix the test such that it doesn't assume a stable name for the worker pod.

@DmitriGekhtman
Copy link
Collaborator

Hmm, I don't remember the 1.13.0 test being flakey in the past.
Could there be a new issue in master causing this?

@DmitriGekhtman DmitriGekhtman added the P1 Issue that should be fixed within a few weeks label Nov 10, 2022
@kevin85421
Copy link
Member Author

kevin85421 commented Nov 10, 2022

Hmm, I don't remember the 1.13.0 test being flakey in the past. Could there be a new issue in master causing this?

https://github.com/ray-project/kuberay/pull/609/files

I removed time.sleep(60) in this PR. Hence, when kubectl wait executes, the redundant worker has already been removed.

@DmitriGekhtman
Copy link
Collaborator

The replica set controller underlying deployments doesn't use logic that's much fancier than what we're doing in KubeRay -- maybe we have some other unknown issue.

Actually, this article suggests that the replicaset controller does some sort of maneuver to account for stale cache:
https://medium.com/@timebertt/kubernetes-controllers-at-scale-clients-caches-conflicts-patches-explained-aa0f7a8b4332
We should figure out how it works, as double launching pods and then deleting a random pod is not in general acceptable for Ray.

@DmitriGekhtman
Copy link
Collaborator

@DmitriGekhtman
Copy link
Collaborator

DmitriGekhtman commented Nov 12, 2022

Test flakiness was fixed in #705. I'm going to open an issue to discuss the underlying problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests

2 participants