-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CBGT: Dead node detection autofailover #1075
Comments
After speaking to @steveyen, it appears there are a few options:
|
Requirements
|
@steveyen -- need some help debugging this. I have two sync gateways running, and the CFG is currently: Immediately after killing one of the Sync Gateways, the CFG is: https://gist.github.com/tleyden/c5e1c452fa3e95a6900c and in the remaining Sync Gateway, the following is shown in the logs: https://gist.github.com/tleyden/0bc928851cf96343f536 and the final CFG is: https://gist.github.com/tleyden/e9b481d63f3fb4144b81 the full logs for SG1 are here: |
Hi Traun, Are the calls to cbgt.CfgRemoveNodeDef() returning any errors? Or, if not already, added some error logging for more diagnosis? More info on the two calls with NODE_DEFS_KNOWN and NODE_DEFS_WANTED are here... |
Here's the code: it has error checking, and I don't see the error being emitted.
btw, how did you determine that? Let me know if you have any tips on how to debug aside from checking the errors. |
WHOOPS, I see a bug. Should be
|
On the debugging, from the Cfg snapshots that you had gist'ed, I saw that one node had disappeared from nodesKnown but not from the nodesWanted section of the Cfg. |
@steveyen - The autofailover is still not working -- here are the steps to reproduce:
Actual sg1 is only getting DCP updates for vbuckets where vbucket id >= 512, but nothing for vbucketid < 512, which are the vbuckets that sg2 was handling. What debug info can I collect? |
First suspicion... looks like a legit cbgt bug (!), as I just saw something similar with a cbft distributed cluster. |
Yeah, I think there's maybe a bug in the cbgt code that recalculates the DCP streams (CalcFeedsDelta). Looking through it. |
Please see CBGT fix, commit a119e02 which corrects the CalcFeedsDelta for the scenario of a shrinking cluster. |
Cool! Thanks, will try it and let you know. |
As documented here, couchbaselabs/cbgt#22,
CBGT does provide a way to do this via an API. However, that would only apply for SG nodes taken down intentionally.
The text was updated successfully, but these errors were encountered: