Adds in-place manual recovery support. #140

slackpad · 2016-07-19T06:33:29Z

This adds several important capabilities to help in upgrading to the new Raft protocol version:

We can migrate an existing peers.json file, which is basically the source of truth for the old version of the library before this support was moved to be fully in snapshots + raft log as the official source.
If we are using protocol version 0 where we don't support server IDs, operators can continue to use peers.json as an interface to manually recover from a loss of quorum.
We left ourselves open for a more full-featured recovery manager by giving a new RecoverCluster interface access to a complete Configuration object to consume. This will allow us to manually pick which server is a voter for manual elections (set 1 to a voter and the rest to nonvoters, the 1 voter will elect itself), as well as basically any other configuration we want to set.

This is based on the branch in #139.

slackpad · 2016-07-19T07:05:12Z

/cc @ongardie and @sean- to take a look.

slackpad · 2016-07-19T16:46:04Z

Thinking through how I did RecoverCluster() by adding the new log entry in the same term as the last log entry. This seems safe because if you apply it on all the servers and they all had the same term, then any server with a higher index who should win the election will still have higher index after this operation. If you look back in commits here, I originally didn't mess with the log at all, but there were too many edge cases to manage if the server sent a snapshot with the older configuration, so this seemed like the cleanest solution. Need to make sure this is safe in all cases, though.

ongardie-sfdc · 2016-07-20T21:29:12Z

api.go

-		Type:  LogConfiguration,
-		Data:  encodeConfiguration(configuration),
+	}
+	if conf.ProtocolVersion < 1 {


Why would you ever want to bootstrap a new cluster with version 0?

During the phase where we are transitioning to protocol version 1 but have set it to 0, we may not want to have this new entry in the log in case an older server joins, since we are saying if protocol version == 0 that we are compatible with old servers.

ongardie-sfdc · 2016-07-20T21:37:14Z

We can migrate an existing peers.json file, which is basically the source of truth for the old version of the library before this support was moved to be fully in snapshots + raft log as the official source.

I think @superfell and I had convinced ourselves that a running cluster would have the same information in its snapshot or log already, so we wouldn't need to ever read peers.json again. Was that just wishful thinking? :)

Thinking through how I did RecoverCluster() by adding the new log entry in the same term as the last log entry. This seems safe

Assuming all the servers go through this procedure, the set of servers that are eligible to become leader would stay the same. So that's good. But I'm gonna need to think about this.

slackpad · 2016-07-20T21:41:09Z

I think since peer changes used to be stored in the log and snapshots so we could probably just delete peers.json in almost all cases and things would work fine. Consul had some places where it would manipulate this at startup, and if it's there it could also mean that the operator wants to mess with it so I'm planning in Consul to ingest it and then blow it away. I think for most normal use cases we could just delete it and things would work fine. I should put "truth" in quotes :-)

…uration.

…NewRaft().

slackpad force-pushed the f-recovery-temp branch 2 times, most recently from a13969d to f75a881 Compare July 19, 2016 06:57

slackpad mentioned this pull request Jul 19, 2016

Cleanup Meta Ticket #84

Closed

14 tasks

slackpad force-pushed the f-recovery-temp branch 2 times, most recently from 83c1730 to 0a36b7f Compare July 19, 2016 08:15

slackpad mentioned this pull request Jul 20, 2016

Adds in-place upgrade and manual recovery support. #139

Merged

ongardie-sfdc reviewed Jul 20, 2016
View reviewed changes

slackpad changed the title ~~Adds manual recovery capability.~~ Adds in-place upgrade and manual recovery support. Jul 21, 2016

James Phillips added 7 commits July 21, 2016 15:16

Adds a manual recovery capability that just changes the active config…

4c72848

…uration.

Changes to an "edit the log" override approach.

418a667

Cleans up recovery logging and comments.

0ac2ba6

Moves recovery out into its own pre-NewRaft() call.

1645aff

Simplifies by removing the Recovery interface.

76bd5f9

Makes BootstrapCluster protocol-aware.

c7d3322

Changes order of arguments for bootstrap and recover to better match …

9997b5e

…NewRaft().

slackpad changed the title ~~Adds in-place upgrade and manual recovery support.~~ Adds in-place manual recovery support. Jul 21, 2016

Changes peers.json recovery test to only use protocol 0.

b4cb7e6

slackpad force-pushed the f-recovery-temp branch from b93db94 to b4cb7e6 Compare July 21, 2016 22:32

slackpad merged commit 24c4add into f-upgrade-and-recovery Jul 21, 2016

slackpad deleted the f-recovery-temp branch July 21, 2016 22:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds in-place manual recovery support. #140

Adds in-place manual recovery support. #140

slackpad commented Jul 19, 2016 •

edited

Loading

slackpad commented Jul 19, 2016

slackpad commented Jul 19, 2016 •

edited

Loading

ongardie-sfdc Jul 20, 2016

slackpad Jul 20, 2016

ongardie-sfdc commented Jul 20, 2016

slackpad commented Jul 20, 2016

Adds in-place manual recovery support. #140

Adds in-place manual recovery support. #140

Conversation

slackpad commented Jul 19, 2016 • edited Loading

slackpad commented Jul 19, 2016

slackpad commented Jul 19, 2016 • edited Loading

ongardie-sfdc Jul 20, 2016

Choose a reason for hiding this comment

slackpad Jul 20, 2016

Choose a reason for hiding this comment

ongardie-sfdc commented Jul 20, 2016

slackpad commented Jul 20, 2016

slackpad commented Jul 19, 2016 •

edited

Loading

slackpad commented Jul 19, 2016 •

edited

Loading