-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Double creation of Pods cause flaky tests #704
Comments
cc @DmitriGekhtman @Jeffwan @sihanwang41 @jasoonn for thoughts. |
cc @akanso also We should try to get a sense for how frequent this behavior is -- in principle, if you occasionally create an extra pod and then quickly delete it, that's not the worst thing in the world. That's "eventual consistency"... The replica set controller underlying deployments doesn't use logic that's much fancier than what we're doing in KubeRay -- maybe we have some other unknown issue. |
The immediate action item is to fix the test such that it doesn't assume a stable name for the worker pod. |
Hmm, I don't remember the 1.13.0 test being flakey in the past. |
https://github.com/ray-project/kuberay/pull/609/files I removed |
Actually, this article suggests that the replicaset controller does some sort of maneuver to account for stale cache: |
The CloneSet controller also does appear to be smarter than us: |
Test flakiness was fixed in #705. I'm going to open an issue to discuss the underlying problem. |
Search before asking
KubeRay Component
ray-operator
What happened + What you expected to happen
The function
create_kuberay_cluster
incompatibility-test.py
is very flaky. As shown in this example, the k8s cluster had one head pod and one worker pod. The cluster state fulfills the goal state defined in ray-cluster.mini.yaml.template, but it reports an error messageerror: timed out waiting for the condition on pods/raycluster-mini-worker-small-group-tp5l2
.Check the log for more details. The operator creates two worker pods consecutively (L1368 & L1390). It indicates that L1390 does not know L1368 has already created a worker pod. Next, L1430 found there are two worker pods, and thus one worker pod needs to be deleted (goal state: 1 head + 1 worker).
In my guess, the root cause is the inconsistency between informer cache and k8s API server. The inconsistency is caused by the non-idempotent operations in KubeRay. To elaborate, KubeRay deletes and creates Pods directly, and these operations will change the cluster state.
Possible Solutions
I discussed with @DmitriGekhtman about some possible solutions
Deployment
replicas
field to achieve the goal number of worker pods. The operation "set thereplicas
" is an idempotent operation.Kruise
Reproduction script
https://github.com/ray-project/kuberay/actions/runs/3424423296/jobs/5706807752
Anything else
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: