[Task Manager] Shift mechanism can cause a cascade throughout a cluster #88369

gmmorris · 2021-01-14T16:58:27Z

In 7.11 we've introduced a self-balancing mechanism into Task Manager so that multiple Kibana can detect when their task claiming is causing version_conflicts and shift their polling mechanism to avoid this.

While this has helped by improving the performance of the Alerting Framework, it has also introduced a new problem which is that Task Manager who shift can clash with other TMs who were running fine.
When there is a large number of TMs (32 kibana for example) this can lead to a cascade of shifts across many instances.

We need to experiment with other ways to try shifting in order to reduce the likelyhood.
Perhaps by making the average threshold higher, or avoiding a shift if conflicts were lower not that long ago, encouraging the recently shifted TM to shift again rather. than causing a cascade in which they all shift.

We should also add telemetry around this so we can get an idea of how this behaves out in the wild

elasticmachine · 2021-01-14T16:58:46Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

gmmorris · 2021-01-19T10:00:13Z

A couple of possible directions of investigation that came to me:

Perhaps we can change the mechanism so that the average version-clash required to shift is different for a Task Manager than has already shifted in the past few cycles than from a Task Manager that's experiencing clashes for the first time in a while? This would bias for constant shifting of one TM until it finds a good slot, instead of causing a cascade where both TMs shift.
Perhaps we can use a mechanism like the back pressure we use when TM experiences 429 errors? If version-clashes are high we actually slow the polling interval down (make it longer) until clashes reduce? The downsides to this are that you might end up with all TM's being spaced out to a point where the rate of task claiming is too slow and it will be harder to reason about the system as a whole as you could have wide variance in polling interval across the cluster.

It's worth implementing both and testing them against a series of perf test on cloud.... but that's hard as cloud doesn't easily support deployments from anything other than Main.

pmuellr · 2021-01-19T17:00:13Z

IIRC, we are only shifting "forward" (adding a delay). I suspect it won't actually help, but I wonder if we'd considering shifting "backwards" (running a little sooner) as well. I don't think we'd want to "undelay" much, like no more than a second, and so I suspect this will have little effect.

gmmorris · 2021-01-19T19:16:41Z

IIRC, we are only shifting "forward" (adding a delay). I suspect it won't actually help, but I wonder if we'd considering shifting "backwards" (running a little sooner) as well. I don't think we'd want to "undelay" much, like no more than a second, and so I suspect this will have little effect.

If you have a 3s interval, what's the difference between shifting by -1s and shifting by 2s?
Isn't it the same? 🤔 Since all TM's are running at a 3s interval.... in my head those sound the same....

pmuellr · 2021-01-19T19:58:37Z

If you have a 3s interval, what's the difference between shifting by -1s and shifting by 2s?
Isn't it the same?

Yeah, there is that :-). I guess I was thinking that if we only ADD delays, then (I think) we are adding a bit of latency somehow. And maybe if we also went backwards in time instead of forward, sometimes, and in very small increments, may have the same effect of "spreading things out" without any additional latency. I think tho, the way we are implementing the "shift" now, it's not really possible to go "backwards" in time anyway.

mikecote · 2021-01-19T20:03:51Z

Note from triage: this issue needs some research on what we should do to solve the problem and review it with the team.

gmmorris · 2021-01-21T16:24:56Z

I'm putting this on hold until I can test the result of this PR: #88210
Local experimentation shows that the cascading actually drops thanks to some of that cleanup, so I want to test this on cloud at scale.

gmmorris · 2021-01-22T12:39:49Z

I'm putting this on hold until I can test the result of this PR: #88210
Local experimentation shows that the cascading actually drops thanks to some of that cleanup, so I want to test this on cloud at scale.

Sadly this didn't have much of an impact.
It looks like it might have reduce shifting a tad, but nothing impactful.

While running that additional cloud test, I ran a local experiment where I ran 8 kibana in parallel with a 500ms polling interval.
It was quite easily to recreate this issue locally and visibly see the conflicts and the result of the over zealous shifting.

I tried a little hack in the shifting mechanism that basically keeps the mechanism as is but adds one little change: in addition to the p50 indicator, we also calculate the trend by comparing the average version_conflicts of the last few cycles to the few before it and avoid shifting if the trend is downwards.
Locally I saw this reduce the shifting dramatically and after a minute of shifting back and forth all Kibana settled on a certain point in time where, for the most part, they were achieving a version_conflicts rate below our threshold.

Taking into account @bmcconaghy 's concerns I don't want to spend more time on this research, but I do think this small change is worth implementing and testing on cloud at higher numbers.
Would love to hear thoughts.

bmcconaghy · 2021-01-25T13:52:07Z

Sounds good to me so long as the change is small and simple. I do think the long term solution to this is some form of Kibana clustering/coordination.

gmmorris · 2021-01-25T14:24:00Z

Sounds good to me so long as the change is small and simple. I do think the long term solution to this is some form of Kibana clustering/coordination.

Yeah, absolutely, the goal here is to make sure the existing mechanism reduces unnecessary noise, but the long term solution is definitely going to require some form of coordination between nodes.

ymao1 · 2021-12-03T16:10:20Z

@mikecote @kobelb Is there value in keeping this issue open since we are aware of the upper bound in the number of Kibanas that can run in parallel and it seems like we should address that larger issue instead of this one?

kobelb · 2021-12-03T17:46:09Z

To be honest, I'm indifferent :) I see some benefit from having this behavior documented, but I agree that it likely isn't actionable in isolation.

mikecote · 2021-12-06T12:33:44Z

I also agree to keep the issue open. It wouldn't hurt to document this behaviour, as the TM health API exposes information about this. It wouldn't hurt to also gather telemetry to understand the urgency of solving this in a larger manner.

gmmorris added Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jan 14, 2021

gmmorris mentioned this issue Jan 18, 2021

[Meta][Task Manager] It's possible to clog up the Task Manager's throughput #88625

Closed

mikecote added the R&D Research and development ticket (not meant to produce code, but to make a decision) label Jan 19, 2021

gmmorris self-assigned this Jan 21, 2021

gmmorris mentioned this issue Jan 27, 2021

[Task Manager] ignore version conflicts that exceed max_docs in the claiming process #89415

Merged

4 tasks

gmmorris added the resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility label Jul 15, 2021

gmmorris added the loe:needs-research This issue requires some research before it can be worked on or estimated label Aug 11, 2021

gmmorris added the estimate:needs-research Estimated as too large and requires research to break down into workable issues label Aug 18, 2021

gmmorris removed the loe:needs-research This issue requires some research before it can be worked on or estimated label Sep 2, 2021

gmmorris added the impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. label Sep 16, 2021

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Task Manager] Shift mechanism can cause a cascade throughout a cluster #88369

[Task Manager] Shift mechanism can cause a cascade throughout a cluster #88369

gmmorris commented Jan 14, 2021

elasticmachine commented Jan 14, 2021

gmmorris commented Jan 19, 2021

pmuellr commented Jan 19, 2021

gmmorris commented Jan 19, 2021 •

edited

Loading

pmuellr commented Jan 19, 2021

mikecote commented Jan 19, 2021

gmmorris commented Jan 21, 2021

gmmorris commented Jan 22, 2021

bmcconaghy commented Jan 25, 2021

gmmorris commented Jan 25, 2021

ymao1 commented Dec 3, 2021

kobelb commented Dec 3, 2021 •

edited

Loading

mikecote commented Dec 6, 2021

[Task Manager] Shift mechanism can cause a cascade throughout a cluster #88369

[Task Manager] Shift mechanism can cause a cascade throughout a cluster #88369

Comments

gmmorris commented Jan 14, 2021

elasticmachine commented Jan 14, 2021

gmmorris commented Jan 19, 2021

pmuellr commented Jan 19, 2021

gmmorris commented Jan 19, 2021 • edited Loading

pmuellr commented Jan 19, 2021

mikecote commented Jan 19, 2021

gmmorris commented Jan 21, 2021

gmmorris commented Jan 22, 2021

bmcconaghy commented Jan 25, 2021

gmmorris commented Jan 25, 2021

ymao1 commented Dec 3, 2021

kobelb commented Dec 3, 2021 • edited Loading

mikecote commented Dec 6, 2021

gmmorris commented Jan 19, 2021 •

edited

Loading

kobelb commented Dec 3, 2021 •

edited

Loading