-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RayService] Skip update events without change #811
[RayService] Skip update events without change #811
Conversation
It looks like we should add a RateLimiter rather than an event filter. https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/controller#Options |
Signed-off-by: Sihan Wang <[email protected]>
43ef4c9
to
964e95d
Compare
Skipping metadata and status changes seems fine as well -- but are you sure you want to skip metadata changes?
|
For later, I think we should consider this too. |
RayService is a self driven reconciler. It reconciles every 2 seconds. |
Right, reconciliation happens periodically by default anyways, but reconcilers should try to respond ASAP to relevant changes. I think we do use annotations in the KubeRay operator, so it makes sense to respond to changes to those. We might use labels in the future. |
Anyways, my complaint is a nit for the reason that @brucez-anyscale points out -- reconciliation of metadata will happen either way, though it potentially could be delayed by a couple of seconds. |
If RayService controller will reconcile every two seconds, this solution makes sense. As mentioned above, the downside is the long latency. One quick question, where do we specify the 2 seconds period (SyncPeriod?)? cc @brucez-anyscale |
Signed-off-by: Sihan Wang <[email protected]>
@sihanwang41 updates the pr. So there should not be long latency issue. 2 second is here |
It looks like the latency is still the same, but 2 seconds is okay for me. Resource events with the same generation are necessary for KubeRay because KubeRay is not an ideal operator (stateless & idempotent).
Thank you for your reply! |
I agree that the KubeRay operator's reconciliation is not 100% correct. We identified some of the gaps in the GitHub issues. Why does that imply that that the reconciling multiple times with the same resource version is necessary? |
I think we can merge this PR, since it solves the immediate issue in a reasonable-enough way. |
In some cases, KubeRay needs various reconciliation iterations with same CR spec to reach the goal state. Example1: Take #704 as an example, the extra Pod will be killed in the next reconciliation. However, there is no generation update between these two reconciliations. Example2: Take RayJob as an example, if the first iteration creates RayCluster successfully but fails in a job launch, the second reconciliation needs to use HTTP client to create the job again. We can think about this question from a high-level perspective. Replicated update resource events can be considered the same event only when the reconciliation logic is stateless. In other words, the operator will help Kubernetes reach the same goal state no matter the previous cluster state. For example, we cannot discard (2) and (3) although they are the same function. On the other hand, we can ignore (5) and (6) because they are idempotent. We can imagine that KubeRay is a big non-idempotent function.
|
The reconciler will still respond to pod creation and deletion -- the event filter old affects response to RayCluster resource changes. |
For the pod reconciliation issue, I think it's better if the operator reconciles less frequently since the problem is a stale cache. Reconciling less frequently would minimize the chances of read-after-write inconsistency. Of course, the ideal solution is to actually fix the pod reconciliation logic. |
Skip RayService reconcile when there is no config/label/annotation change Signed-off-by: Sihan Wang <[email protected]>
Skip RayService reconcile when there is no config/label/annotation change Signed-off-by: Sihan Wang <[email protected]>
Skip RayService reconcile when there is no config/label/annotation change Signed-off-by: Sihan Wang <[email protected]>
Skip RayService reconcile when there is no config/label/annotation change Signed-off-by: Sihan Wang <[email protected]>
Signed-off-by: Sihan Wang [email protected]
Why are these changes needed?
Ignore K8s reconcile when there is no config/label/annotation change
Related issue number
#28652
Checks
Local testing, all reconciles are under 2 seconds interval