-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Task Manager] Shift mechanism can cause a cascade throughout a cluster #88369
Comments
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
A couple of possible directions of investigation that came to me:
It's worth implementing both and testing them against a series of perf test on cloud.... but that's hard as cloud doesn't easily support deployments from anything other than Main. |
IIRC, we are only shifting "forward" (adding a delay). I suspect it won't actually help, but I wonder if we'd considering shifting "backwards" (running a little sooner) as well. I don't think we'd want to "undelay" much, like no more than a second, and so I suspect this will have little effect. |
If you have a |
Yeah, there is that :-). I guess I was thinking that if we only ADD delays, then (I think) we are adding a bit of latency somehow. And maybe if we also went backwards in time instead of forward, sometimes, and in very small increments, may have the same effect of "spreading things out" without any additional latency. I think tho, the way we are implementing the "shift" now, it's not really possible to go "backwards" in time anyway. |
Note from triage: this issue needs some research on what we should do to solve the problem and review it with the team. |
I'm putting this on hold until I can test the result of this PR: #88210 |
Sadly this didn't have much of an impact. While running that additional cloud test, I ran a local experiment where I ran 8 kibana in parallel with a 500ms polling interval. I tried a little hack in the shifting mechanism that basically keeps the mechanism as is but adds one little change: in addition to the Taking into account @bmcconaghy 's concerns I don't want to spend more time on this research, but I do think this small change is worth implementing and testing on cloud at higher numbers. |
Sounds good to me so long as the change is small and simple. I do think the long term solution to this is some form of Kibana clustering/coordination. |
Yeah, absolutely, the goal here is to make sure the existing mechanism reduces unnecessary noise, but the long term solution is definitely going to require some form of coordination between nodes. |
To be honest, I'm indifferent :) I see some benefit from having this behavior documented, but I agree that it likely isn't actionable in isolation. |
I also agree to keep the issue open. It wouldn't hurt to document this behaviour, as the TM health API exposes information about this. It wouldn't hurt to also gather telemetry to understand the urgency of solving this in a larger manner. |
In 7.11 we've introduced a self-balancing mechanism into Task Manager so that multiple Kibana can detect when their task claiming is causing version_conflicts and shift their polling mechanism to avoid this.
While this has helped by improving the performance of the Alerting Framework, it has also introduced a new problem which is that Task Manager who shift can clash with other TMs who were running fine.
When there is a large number of TMs (32 kibana for example) this can lead to a cascade of shifts across many instances.
We need to experiment with other ways to try shifting in order to reduce the likelyhood.
Perhaps by making the average threshold higher, or avoiding a shift if conflicts were lower not that long ago, encouraging the recently shifted TM to shift again rather. than causing a cascade in which they all shift.
We should also add telemetry around this so we can get an idea of how this behaves out in the wild
The text was updated successfully, but these errors were encountered: