Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CBGT: Dead node detection autofailover #1075

Closed
zgramana opened this issue Aug 14, 2015 · 13 comments
Closed

CBGT: Dead node detection autofailover #1075

zgramana opened this issue Aug 14, 2015 · 13 comments
Assignees
Milestone

Comments

@zgramana
Copy link
Contributor

As documented here, couchbaselabs/cbgt#22,

CBGT does provide a way to do this via an API. However, that would only apply for SG nodes taken down intentionally.

@tleyden
Copy link
Contributor

tleyden commented Aug 14, 2015

After speaking to @steveyen, it appears there are a few options:

  • Tight integration with ns_server would allow this feature to be inherited -- of course, this is probably a long way off.
  • This problem has already been solved by CBFS in a simple and elegant manner that requires no inter-node communication. That part can be refactored out of CBFS into a shared library that CBGT could use.

@tleyden
Copy link
Contributor

tleyden commented Aug 17, 2015

Requirements

  • A timeout is specified (eg, 30s), and if the node is deemed as being down outside of the range of the timout (eg, hasn't been seen for > 30s), then any other node who notices this will remove the node from the CBGT Cfg so it's Pindexes are re-distributed. This either has to be idempotent, or nodes will need to coordinate.
  • If the node comes online again (perhaps it restarted, perhaps there was a temporary network partition), it will need to be re-added back into the CBGT Cfg. Not sure if this can be done from the node itself, or another node would need to do this.
  • Some protection should be added such that an SG node in a "Reboot loop" would not cause havoc to the system.

@tleyden tleyden changed the title Dead node detection CBGT: Dead node detection Aug 21, 2015
@tleyden tleyden changed the title CBGT: Dead node detection CBGT: Dead node detection autofailover Aug 28, 2015
@tleyden
Copy link
Contributor

tleyden commented Sep 1, 2015

@steveyen -- need some help debugging this.

I have two sync gateways running, and the CFG is currently:

https://gist.githubusercontent.com/tleyden/bffa2eae29f1dfa9ece8/raw/60c8528ad34da2c09da9d4465cf48153e0344967/gistfile1.txt

Immediately after killing one of the Sync Gateways, the CFG is:

https://gist.github.com/tleyden/c5e1c452fa3e95a6900c

and in the remaining Sync Gateway, the following is shown in the logs:

https://gist.github.com/tleyden/0bc928851cf96343f536

and the final CFG is:

https://gist.github.com/tleyden/e9b481d63f3fb4144b81

the full logs for SG1 are here:

https://gist.github.com/tleyden/f956a682dfc7f659c8da

@steveyen
Copy link
Member

steveyen commented Sep 2, 2015

Hi Traun,
Looks like the invocation of cbgt.CfgRemoveNodeDef() with NODE_DEFS_KNOWN worked, but the invocation of cbgt.CfgRemoveNodeDef() with NODE_DEFS_WANTED didn't work.

Are the calls to cbgt.CfgRemoveNodeDef() returning any errors?

Or, if not already, added some error logging for more diagnosis?

More info on the two calls with NODE_DEFS_KNOWN and NODE_DEFS_WANTED are here...

couchbaselabs/cbgt#26

@tleyden
Copy link
Contributor

tleyden commented Sep 2, 2015

Here's the code:

https://github.com/couchbase/sync_gateway/blob/feature/distributed_index_autofailover/src/github.com/couchbase/sync_gateway/base/sgw_pindex.go#L306-L326

it has error checking, and I don't see the error being emitted.

invocation of cbgt.CfgRemoveNodeDef() with NODE_DEFS_WANTED didn't work.

btw, how did you determine that? Let me know if you have any tips on how to debug aside from checking the errors.

@tleyden
Copy link
Contributor

tleyden commented Sep 2, 2015

WHOOPS, I see a bug.

Should be

for kind := range kinds {
        log.Printf("call cbgt.CfgRemoveNodeDef with nodeuuid: %v cfg: %+v", nodeUuid, h.Cfg)
        if err := cbgt.CfgRemoveNodeDef(
            h.Cfg,
            kind,  <----!!
            nodeUuid,
            h.CbgtVersion,
        ); err != nil {
            log.Printf("Warning: attempted to remove %v (%v) from CBGT but failed: %v", nodeUuid, kind, err)
        }

    }

tleyden pushed a commit that referenced this issue Sep 2, 2015
@steveyen
Copy link
Member

steveyen commented Sep 2, 2015

On the debugging, from the Cfg snapshots that you had gist'ed, I saw that one node had disappeared from nodesKnown but not from the nodesWanted section of the Cfg.

tleyden pushed a commit that referenced this issue Sep 2, 2015
tleyden pushed a commit that referenced this issue Sep 3, 2015
@tleyden
Copy link
Contributor

tleyden commented Sep 3, 2015

@steveyen - The autofailover is still not working -- here are the steps to reproduce:

  • Start sg1 with clean data dir and empty couchbase buckets
  • Add a user and a doc
  • Start sg2
  • Add a doc
  • Kill sg2
  • Capture CFG json
  • Add more docs

Actual sg1 is only getting DCP updates for vbuckets where vbucket id >= 512, but nothing for vbucketid < 512, which are the vbuckets that sg2 was handling.
Expected sg1 should be getting DCP updates for all vbuckets after sg2 was shut down

What debug info can I collect?

@steveyen
Copy link
Member

steveyen commented Sep 3, 2015

First suspicion... looks like a legit cbgt bug (!), as I just saw something similar with a cbft distributed cluster.

@steveyen
Copy link
Member

steveyen commented Sep 3, 2015

Yeah, I think there's maybe a bug in the cbgt code that recalculates the DCP streams (CalcFeedsDelta). Looking through it.

@steveyen
Copy link
Member

steveyen commented Sep 3, 2015

Please see CBGT fix, commit a119e02 which corrects the CalcFeedsDelta for the scenario of a shrinking cluster.

@tleyden
Copy link
Contributor

tleyden commented Sep 3, 2015

Cool! Thanks, will try it and let you know.

@tleyden
Copy link
Contributor

tleyden commented Sep 3, 2015

It works now!

  • Bring sg1 online
  • Add user and docs
  • Bring sg2 online
  • Add docs
  • Take sg2 offline
  • Verify that sg1 now handles DCP messages for all vbuckets

Here are the Sync Gw logs:

tleyden pushed a commit that referenced this issue Sep 3, 2015
tleyden pushed a commit that referenced this issue Sep 3, 2015
@zgramana zgramana added the dcache label Sep 4, 2015
tleyden pushed a commit that referenced this issue Sep 4, 2015
@tleyden tleyden closed this as completed Sep 10, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants