[Bug] Double creation of Pods cause flaky tests #704

kevin85421 · 2022-11-09T21:57:19Z

Search before asking

I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

The function create_kuberay_cluster in compatibility-test.py is very flaky. As shown in this example, the k8s cluster had one head pod and one worker pod. The cluster state fulfills the goal state defined in ray-cluster.mini.yaml.template, but it reports an error message error: timed out waiting for the condition on pods/raycluster-mini-worker-small-group-tp5l2.

2022-11-09:06:11:02,784 INFO     [utils.py:47] executing cmd: kubectl wait --for=condition=ready pod -l rayCluster=raycluster-compatibility-test --all --timeout=900s
pod/raycluster-mini-head-msxms condition met
pod/raycluster-mini-worker-small-group-hmgnj condition met
error: timed out waiting for the condition on pods/raycluster-mini-worker-small-group-tp5l2
2022-11-09:06:26:05,571 INFO     [utils.py:47] executing cmd: kubectl get pods -A
NAMESPACE            NAME                                         READY   STATUS    RESTARTS   AGE
default              raycluster-mini-head-msxms                   1/1     Running   0          15m
default              raycluster-mini-worker-small-group-hmgnj     1/1     Running   0          15m
kube-system          coredns-558bd4d5db-gpttf                     1/1     Running   0          19m
kube-system          coredns-558bd4d5db-w7x6x                     1/1     Running   0          19m
kube-system          etcd-kind-control-plane                      1/1     Running   0          19m
kube-system          kindnet-f8w9t                                1/1     Running   0          19m
kube-system          kube-apiserver-kind-control-plane            1/1     Running   0          19m
kube-system          kube-controller-manager-kind-control-plane   1/1     Running   0          19m
kube-system          kube-proxy-44d89                             1/1     Running   0          19m
kube-system          kube-scheduler-kind-control-plane            1/1     Running   0          19m
local-path-storage   local-path-provisioner-547f784dff-g497d      1/1     Running   0          19m
ray-system           kuberay-apiserver-79f45d466c-nvkbm           1/1     Running   0          15m
ray-system           kuberay-operator-ddd74bf68-2pkm5             1/1     Running   0          15m

Check the log for more details. The operator creates two worker pods consecutively (L1368 & L1390). It indicates that L1390 does not know L1368 has already created a worker pod. Next, L1430 found there are two worker pods, and thus one worker pod needs to be deleted (goal state: 1 head + 1 worker).

Line 1368

2022-11-09T06:11:02.809Z	INFO	controllers.RayCluster	reconcilePods	{"creating worker for group": "small-group", "index 0": "in total 1"}

Line 1390

2022-11-09T06:11:02.823Z	INFO	controllers.RayCluster	reconcilePods	{"creating worker for group": "small-group", "index 0": "in total 1"}

Line 1430

2022-11-09T06:11:02.865Z	INFO	controllers.RayCluster	Randomly deleting pod 	{"index ": 0, "/": 1, "with name": "raycluster-mini-worker-small-group-tp5l2"}

In my guess, the root cause is the inconsistency between informer cache and k8s API server. The inconsistency is caused by the non-idempotent operations in KubeRay. To elaborate, KubeRay deletes and creates Pods directly, and these operations will change the cluster state.

Possible Solutions

I discussed with @DmitriGekhtman about some possible solutions

Deployment

We can set the replicas field to achieve the goal number of worker pods. The operation "set the replicas" is an idempotent operation.
Cons: We cannot delete a specific Pod in Deployment.

Kruise

https://github.com/kevin85421/kruise
I am not familiar with this solution. Maybe @DmitriGekhtman can add more details.

Reproduction script

https://github.com/ray-project/kuberay/actions/runs/3424423296/jobs/5706807752

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

kevin85421 · 2022-11-09T21:59:36Z

cc @DmitriGekhtman @Jeffwan @sihanwang41 @jasoonn for thoughts.

DmitriGekhtman · 2022-11-10T01:09:15Z

cc @akanso also

We should try to get a sense for how frequent this behavior is -- in principle, if you occasionally create an extra pod and then quickly delete it, that's not the worst thing in the world. That's "eventual consistency"...

The replica set controller underlying deployments doesn't use logic that's much fancier than what we're doing in KubeRay -- maybe we have some other unknown issue.

DmitriGekhtman · 2022-11-10T01:11:44Z

The immediate action item is to fix the test such that it doesn't assume a stable name for the worker pod.

DmitriGekhtman · 2022-11-10T04:28:46Z

Hmm, I don't remember the 1.13.0 test being flakey in the past.
Could there be a new issue in master causing this?

kevin85421 · 2022-11-10T18:28:08Z

Hmm, I don't remember the 1.13.0 test being flakey in the past. Could there be a new issue in master causing this?

https://github.com/ray-project/kuberay/pull/609/files

I removed time.sleep(60) in this PR. Hence, when kubectl wait executes, the redundant worker has already been removed.

DmitriGekhtman · 2022-11-11T09:13:43Z

The replica set controller underlying deployments doesn't use logic that's much fancier than what we're doing in KubeRay -- maybe we have some other unknown issue.

Actually, this article suggests that the replicaset controller does some sort of maneuver to account for stale cache:
https://medium.com/@timebertt/kubernetes-controllers-at-scale-clients-caches-conflicts-patches-explained-aa0f7a8b4332
We should figure out how it works, as double launching pods and then deleting a random pod is not in general acceptable for Ray.

DmitriGekhtman · 2022-11-11T09:27:13Z

The CloneSet controller also does appear to be smarter than us:
https://github.com/openkruise/kruise/blob/5839c5ba1909b463242a3603a7d22d45aed7fa6e/pkg/controller/cloneset/cloneset_controller.go#L225-L280

DmitriGekhtman · 2022-11-12T20:38:34Z

Test flakiness was fixed in #705. I'm going to open an issue to discuss the underlying problem.

kevin85421 added the bug Something isn't working label Nov 9, 2022

kevin85421 self-assigned this Nov 9, 2022

kevin85421 mentioned this issue Nov 9, 2022

[Feature] [Discussion] [Ray Autoscaler] More efficient pod state fetch in autoscaler. #700

Closed

2 tasks

DmitriGekhtman mentioned this issue Nov 10, 2022

[Feature] [CI] Track flakey KubeRay tests #709

Closed

2 tasks

DmitriGekhtman added the P1 Issue that should be fixed within a few weeks label Nov 10, 2022

kevin85421 mentioned this issue Nov 10, 2022

Replace kubectl wait command with RayClusterAddCREvent #705

Merged

4 tasks

DmitriGekhtman closed this as completed Nov 12, 2022

DmitriGekhtman mentioned this issue Nov 12, 2022

[Bug] RayCluster controller operator does not handle stale informer cache #715

Closed

2 tasks

kevin85421 mentioned this issue Nov 30, 2022

[Bug] Revisit the use of kubectl wait to avoid flakiness #618

Closed

2 tasks

kevin85421 mentioned this issue Dec 9, 2022

[RayService] Skip update events without change #811

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Double creation of Pods cause flaky tests #704

[Bug] Double creation of Pods cause flaky tests #704

kevin85421 commented Nov 9, 2022

kevin85421 commented Nov 9, 2022

DmitriGekhtman commented Nov 10, 2022 •

edited

Loading

DmitriGekhtman commented Nov 10, 2022

DmitriGekhtman commented Nov 10, 2022

kevin85421 commented Nov 10, 2022 •

edited

Loading

DmitriGekhtman commented Nov 11, 2022

DmitriGekhtman commented Nov 11, 2022

DmitriGekhtman commented Nov 12, 2022 •

edited

Loading

[Bug] Double creation of Pods cause flaky tests #704

[Bug] Double creation of Pods cause flaky tests #704

Comments

kevin85421 commented Nov 9, 2022

Search before asking

KubeRay Component

What happened + What you expected to happen

Possible Solutions

Deployment

Kruise

Reproduction script

Anything else

Are you willing to submit a PR?

kevin85421 commented Nov 9, 2022

DmitriGekhtman commented Nov 10, 2022 • edited Loading

DmitriGekhtman commented Nov 10, 2022

DmitriGekhtman commented Nov 10, 2022

kevin85421 commented Nov 10, 2022 • edited Loading

DmitriGekhtman commented Nov 11, 2022

DmitriGekhtman commented Nov 11, 2022

DmitriGekhtman commented Nov 12, 2022 • edited Loading

DmitriGekhtman commented Nov 10, 2022 •

edited

Loading

kevin85421 commented Nov 10, 2022 •

edited

Loading

DmitriGekhtman commented Nov 12, 2022 •

edited

Loading