-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Soft retry limits - AAE fullsync #772
Comments
@martinsumner just to be clear, the process growth is on the source, or the sink? |
As discussed in side-channel. The aae full-sync code is difficult to trace through, and the experience of production is that testing has so far been deficient, and also the solution is so brittle as to infer there can only be a small handful of production users (perhaps only one). Intentions for 3.0 are to look to replace with something altogether simpler, so extensive effort to unpick code and improve test coverage for 2.2.5 doesn't make sense at this stage. So if we can improve this behaviour for soft exits only, to reduce the chance of crashes, that will be good enough. There's no need to invest significant effort in refactoring code or test to move this forward for the long-term. |
When performing AAE full-sync, if an elected vnode to co-ordinate with is doing a tree rebuild (and given multi-hour reload times this would seem to be a common event), the co-ordination attempt will soft exit.
The soft exit is captured and handled here:
https://github.com/basho/riak_repl/blob/2.1.8/src/riak_repl2_fscoordinator.erl#L537-L554
This checks against a retry limit, and presumably prompts a retry if the limit is not reached. The default soft retry limit is set to infinity.
In production we see very rapid retries, which escalates the sys process count, and ultimately impacts stability of the cluster.
Attempts are underway to control this behaviour by setting a max retry limit. However the default setting appears to be unsafe and should probably be changed. Perhaps also there should be a wait between retries. Perhaps also there is an underlying issue causing the process count to stack up with the reties.
The text was updated successfully, but these errors were encountered: