CBGT: Dead node detection autofailover #1075

zgramana · 2015-08-14T18:45:44Z

As documented here, couchbaselabs/cbgt#22,

CBGT does provide a way to do this via an API. However, that would only apply for SG nodes taken down intentionally.

tleyden · 2015-08-14T22:07:28Z

After speaking to @steveyen, it appears there are a few options:

Tight integration with ns_server would allow this feature to be inherited -- of course, this is probably a long way off.
This problem has already been solved by CBFS in a simple and elegant manner that requires no inter-node communication. That part can be refactored out of CBFS into a shared library that CBGT could use.

tleyden · 2015-08-17T19:44:04Z

Requirements

A timeout is specified (eg, 30s), and if the node is deemed as being down outside of the range of the timout (eg, hasn't been seen for > 30s), then any other node who notices this will remove the node from the CBGT Cfg so it's Pindexes are re-distributed. This either has to be idempotent, or nodes will need to coordinate.
If the node comes online again (perhaps it restarted, perhaps there was a temporary network partition), it will need to be re-added back into the CBGT Cfg. Not sure if this can be done from the node itself, or another node would need to do this.
Some protection should be added such that an SG node in a "Reboot loop" would not cause havoc to the system.

tleyden · 2015-09-01T22:55:08Z

@steveyen -- need some help debugging this.

I have two sync gateways running, and the CFG is currently:

https://gist.githubusercontent.com/tleyden/bffa2eae29f1dfa9ece8/raw/60c8528ad34da2c09da9d4465cf48153e0344967/gistfile1.txt

Immediately after killing one of the Sync Gateways, the CFG is:

https://gist.github.com/tleyden/c5e1c452fa3e95a6900c

and in the remaining Sync Gateway, the following is shown in the logs:

https://gist.github.com/tleyden/0bc928851cf96343f536

and the final CFG is:

https://gist.github.com/tleyden/e9b481d63f3fb4144b81

the full logs for SG1 are here:

https://gist.github.com/tleyden/f956a682dfc7f659c8da

steveyen · 2015-09-02T01:05:27Z

Hi Traun,
Looks like the invocation of cbgt.CfgRemoveNodeDef() with NODE_DEFS_KNOWN worked, but the invocation of cbgt.CfgRemoveNodeDef() with NODE_DEFS_WANTED didn't work.

Are the calls to cbgt.CfgRemoveNodeDef() returning any errors?

Or, if not already, added some error logging for more diagnosis?

More info on the two calls with NODE_DEFS_KNOWN and NODE_DEFS_WANTED are here...

couchbaselabs/cbgt#26

tleyden · 2015-09-02T01:37:15Z

Here's the code:

https://github.com/couchbase/sync_gateway/blob/feature/distributed_index_autofailover/src/github.com/couchbase/sync_gateway/base/sgw_pindex.go#L306-L326

it has error checking, and I don't see the error being emitted.

invocation of cbgt.CfgRemoveNodeDef() with NODE_DEFS_WANTED didn't work.

btw, how did you determine that? Let me know if you have any tips on how to debug aside from checking the errors.

tleyden · 2015-09-02T01:41:11Z

WHOOPS, I see a bug.

Should be

for kind := range kinds {
        log.Printf("call cbgt.CfgRemoveNodeDef with nodeuuid: %v cfg: %+v", nodeUuid, h.Cfg)
        if err := cbgt.CfgRemoveNodeDef(
            h.Cfg,
            kind,  <----!!
            nodeUuid,
            h.CbgtVersion,
        ); err != nil {
            log.Printf("Warning: attempted to remove %v (%v) from CBGT but failed: %v", nodeUuid, kind, err)
        }

    }

steveyen · 2015-09-02T02:57:11Z

On the debugging, from the Cfg snapshots that you had gist'ed, I saw that one node had disappeared from nodesKnown but not from the nodesWanted section of the Cfg.

tleyden · 2015-09-03T01:11:26Z

@steveyen - The autofailover is still not working -- here are the steps to reproduce:

Start sg1 with clean data dir and empty couchbase buckets
Add a user and a doc
Start sg2
Add a doc
Kill sg2
Capture CFG json
Add more docs

Actual sg1 is only getting DCP updates for vbuckets where vbucket id >= 512, but nothing for vbucketid < 512, which are the vbuckets that sg2 was handling.
Expected sg1 should be getting DCP updates for all vbuckets after sg2 was shut down

What debug info can I collect?

steveyen · 2015-09-03T04:19:09Z

First suspicion... looks like a legit cbgt bug (!), as I just saw something similar with a cbft distributed cluster.

steveyen · 2015-09-03T04:30:47Z

Yeah, I think there's maybe a bug in the cbgt code that recalculates the DCP streams (CalcFeedsDelta). Looking through it.

steveyen · 2015-09-03T05:17:33Z

Please see CBGT fix, commit a119e02 which corrects the CalcFeedsDelta for the scenario of a shrinking cluster.

tleyden · 2015-09-03T05:35:24Z

Cool! Thanks, will try it and let you know.

tleyden · 2015-09-03T17:44:03Z

It works now!

Bring sg1 online
Add user and docs
Bring sg2 online
Add docs
Take sg2 offline
Verify that sg1 now handles DCP messages for all vbuckets

Here are the Sync Gw logs:

sg1
sg2

#1075

zgramana added this to the 1.2.0 milestone Aug 14, 2015

zgramana added chore in progress labels Aug 14, 2015

zgramana assigned tleyden Aug 14, 2015

zgramana added enhancement and removed chore labels Aug 14, 2015

tleyden changed the title ~~Dead node detection~~ CBGT: Dead node detection Aug 21, 2015

tleyden changed the title ~~CBGT: Dead node detection~~ CBGT: Dead node detection autofailover Aug 28, 2015

tleyden pushed a commit that referenced this issue Sep 2, 2015

Possible fix for #1075

aae9522

tleyden pushed a commit that referenced this issue Sep 2, 2015

Possible fix for #1075

041792b

tleyden pushed a commit that referenced this issue Sep 3, 2015

Possible fix for #1075

77db7eb

tleyden pushed a commit that referenced this issue Sep 3, 2015

Possible fix for #1075

d725d72

tleyden pushed a commit that referenced this issue Sep 3, 2015

Implement CBGT AutoFailover

11a39f8

#1075

tleyden mentioned this issue Sep 3, 2015

Implement CBGT AutoFailover #1114

Merged

zgramana added the dcache label Sep 4, 2015

tleyden pushed a commit that referenced this issue Sep 4, 2015

Implement CBGT AutoFailover

71ed697

#1075

tleyden closed this as completed Sep 10, 2015

tleyden removed the in progress label Sep 10, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CBGT: Dead node detection autofailover #1075

CBGT: Dead node detection autofailover #1075

zgramana commented Aug 14, 2015

tleyden commented Aug 14, 2015

tleyden commented Aug 17, 2015

tleyden commented Sep 1, 2015

steveyen commented Sep 2, 2015

tleyden commented Sep 2, 2015

tleyden commented Sep 2, 2015

steveyen commented Sep 2, 2015

tleyden commented Sep 3, 2015

steveyen commented Sep 3, 2015

steveyen commented Sep 3, 2015

steveyen commented Sep 3, 2015

tleyden commented Sep 3, 2015

tleyden commented Sep 3, 2015

CBGT: Dead node detection autofailover #1075

CBGT: Dead node detection autofailover #1075

Comments

zgramana commented Aug 14, 2015

tleyden commented Aug 14, 2015

tleyden commented Aug 17, 2015

Requirements

tleyden commented Sep 1, 2015

steveyen commented Sep 2, 2015

tleyden commented Sep 2, 2015

tleyden commented Sep 2, 2015

steveyen commented Sep 2, 2015

tleyden commented Sep 3, 2015

steveyen commented Sep 3, 2015

steveyen commented Sep 3, 2015

steveyen commented Sep 3, 2015

tleyden commented Sep 3, 2015

tleyden commented Sep 3, 2015