Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Task Manager] Shift mechanism can cause a cascade throughout a cluster #88369

Open
gmmorris opened this issue Jan 14, 2021 · 13 comments
Open

[Task Manager] Shift mechanism can cause a cascade throughout a cluster #88369

gmmorris opened this issue Jan 14, 2021 · 13 comments
Assignees
Labels
estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Task Manager impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. R&D Research and development ticket (not meant to produce code, but to make a decision) resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@gmmorris
Copy link
Contributor

In 7.11 we've introduced a self-balancing mechanism into Task Manager so that multiple Kibana can detect when their task claiming is causing version_conflicts and shift their polling mechanism to avoid this.

While this has helped by improving the performance of the Alerting Framework, it has also introduced a new problem which is that Task Manager who shift can clash with other TMs who were running fine.
When there is a large number of TMs (32 kibana for example) this can lead to a cascade of shifts across many instances.

We need to experiment with other ways to try shifting in order to reduce the likelyhood.
Perhaps by making the average threshold higher, or avoiding a shift if conflicts were lower not that long ago, encouraging the recently shifted TM to shift again rather. than causing a cascade in which they all shift.

We should also add telemetry around this so we can get an idea of how this behaves out in the wild

@gmmorris gmmorris added Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jan 14, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@gmmorris
Copy link
Contributor Author

A couple of possible directions of investigation that came to me:

  1. Perhaps we can change the mechanism so that the average version-clash required to shift is different for a Task Manager than has already shifted in the past few cycles than from a Task Manager that's experiencing clashes for the first time in a while? This would bias for constant shifting of one TM until it finds a good slot, instead of causing a cascade where both TMs shift.

  2. Perhaps we can use a mechanism like the back pressure we use when TM experiences 429 errors? If version-clashes are high we actually slow the polling interval down (make it longer) until clashes reduce? The downsides to this are that you might end up with all TM's being spaced out to a point where the rate of task claiming is too slow and it will be harder to reason about the system as a whole as you could have wide variance in polling interval across the cluster.

It's worth implementing both and testing them against a series of perf test on cloud.... but that's hard as cloud doesn't easily support deployments from anything other than Main.

@pmuellr
Copy link
Member

pmuellr commented Jan 19, 2021

IIRC, we are only shifting "forward" (adding a delay). I suspect it won't actually help, but I wonder if we'd considering shifting "backwards" (running a little sooner) as well. I don't think we'd want to "undelay" much, like no more than a second, and so I suspect this will have little effect.

@mikecote mikecote added the R&D Research and development ticket (not meant to produce code, but to make a decision) label Jan 19, 2021
@gmmorris
Copy link
Contributor Author

gmmorris commented Jan 19, 2021

IIRC, we are only shifting "forward" (adding a delay). I suspect it won't actually help, but I wonder if we'd considering shifting "backwards" (running a little sooner) as well. I don't think we'd want to "undelay" much, like no more than a second, and so I suspect this will have little effect.

If you have a 3s interval, what's the difference between shifting by -1s and shifting by 2s?
Isn't it the same? 🤔 Since all TM's are running at a 3s interval.... in my head those sound the same....

@pmuellr
Copy link
Member

pmuellr commented Jan 19, 2021

If you have a 3s interval, what's the difference between shifting by -1s and shifting by 2s?
Isn't it the same?

Yeah, there is that :-). I guess I was thinking that if we only ADD delays, then (I think) we are adding a bit of latency somehow. And maybe if we also went backwards in time instead of forward, sometimes, and in very small increments, may have the same effect of "spreading things out" without any additional latency. I think tho, the way we are implementing the "shift" now, it's not really possible to go "backwards" in time anyway.

@mikecote
Copy link
Contributor

Note from triage: this issue needs some research on what we should do to solve the problem and review it with the team.

@gmmorris gmmorris self-assigned this Jan 21, 2021
@gmmorris
Copy link
Contributor Author

I'm putting this on hold until I can test the result of this PR: #88210
Local experimentation shows that the cascading actually drops thanks to some of that cleanup, so I want to test this on cloud at scale.

@gmmorris
Copy link
Contributor Author

I'm putting this on hold until I can test the result of this PR: #88210
Local experimentation shows that the cascading actually drops thanks to some of that cleanup, so I want to test this on cloud at scale.

Sadly this didn't have much of an impact.
It looks like it might have reduce shifting a tad, but nothing impactful.

While running that additional cloud test, I ran a local experiment where I ran 8 kibana in parallel with a 500ms polling interval.
It was quite easily to recreate this issue locally and visibly see the conflicts and the result of the over zealous shifting.

I tried a little hack in the shifting mechanism that basically keeps the mechanism as is but adds one little change: in addition to the p50 indicator, we also calculate the trend by comparing the average version_conflicts of the last few cycles to the few before it and avoid shifting if the trend is downwards.
Locally I saw this reduce the shifting dramatically and after a minute of shifting back and forth all Kibana settled on a certain point in time where, for the most part, they were achieving a version_conflicts rate below our threshold.

Taking into account @bmcconaghy 's concerns I don't want to spend more time on this research, but I do think this small change is worth implementing and testing on cloud at higher numbers.
Would love to hear thoughts.

@bmcconaghy
Copy link
Contributor

Sounds good to me so long as the change is small and simple. I do think the long term solution to this is some form of Kibana clustering/coordination.

@gmmorris
Copy link
Contributor Author

Sounds good to me so long as the change is small and simple. I do think the long term solution to this is some form of Kibana clustering/coordination.

Yeah, absolutely, the goal here is to make sure the existing mechanism reduces unnecessary noise, but the long term solution is definitely going to require some form of coordination between nodes.

@gmmorris gmmorris added the resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility label Jul 15, 2021
@gmmorris gmmorris added the loe:needs-research This issue requires some research before it can be worked on or estimated label Aug 11, 2021
@gmmorris gmmorris added the estimate:needs-research Estimated as too large and requires research to break down into workable issues label Aug 18, 2021
@gmmorris gmmorris removed the loe:needs-research This issue requires some research before it can be worked on or estimated label Sep 2, 2021
@gmmorris gmmorris added the impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. label Sep 16, 2021
@ymao1
Copy link
Contributor

ymao1 commented Dec 3, 2021

@mikecote @kobelb Is there value in keeping this issue open since we are aware of the upper bound in the number of Kibanas that can run in parallel and it seems like we should address that larger issue instead of this one?

@kobelb
Copy link
Contributor

kobelb commented Dec 3, 2021

To be honest, I'm indifferent :) I see some benefit from having this behavior documented, but I agree that it likely isn't actionable in isolation.

@mikecote
Copy link
Contributor

mikecote commented Dec 6, 2021

I also agree to keep the issue open. It wouldn't hurt to document this behaviour, as the TM health API exposes information about this. It wouldn't hurt to also gather telemetry to understand the urgency of solving this in a larger manner.

@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Task Manager impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. R&D Research and development ticket (not meant to produce code, but to make a decision) resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

No branches or pull requests

7 participants