Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds in-place manual recovery support. #140

Merged
merged 8 commits into from
Jul 21, 2016

Conversation

slackpad
Copy link
Contributor

@slackpad slackpad commented Jul 19, 2016

This adds several important capabilities to help in upgrading to the new Raft protocol version:

  1. We can migrate an existing peers.json file, which is basically the source of truth for the old version of the library before this support was moved to be fully in snapshots + raft log as the official source.
  2. If we are using protocol version 0 where we don't support server IDs, operators can continue to use peers.json as an interface to manually recover from a loss of quorum.
  3. We left ourselves open for a more full-featured recovery manager by giving a new RecoverCluster interface access to a complete Configuration object to consume. This will allow us to manually pick which server is a voter for manual elections (set 1 to a voter and the rest to nonvoters, the 1 voter will elect itself), as well as basically any other configuration we want to set.

This is based on the branch in #139.

@slackpad slackpad force-pushed the f-recovery-temp branch 2 times, most recently from a13969d to f75a881 Compare July 19, 2016 06:57
@slackpad
Copy link
Contributor Author

/cc @ongardie and @sean- to take a look.

@slackpad slackpad mentioned this pull request Jul 19, 2016
14 tasks
@slackpad slackpad force-pushed the f-recovery-temp branch 2 times, most recently from 83c1730 to 0a36b7f Compare July 19, 2016 08:15
@slackpad
Copy link
Contributor Author

slackpad commented Jul 19, 2016

Thinking through how I did RecoverCluster() by adding the new log entry in the same term as the last log entry. This seems safe because if you apply it on all the servers and they all had the same term, then any server with a higher index who should win the election will still have higher index after this operation. If you look back in commits here, I originally didn't mess with the log at all, but there were too many edge cases to manage if the server sent a snapshot with the older configuration, so this seemed like the cleanest solution. Need to make sure this is safe in all cases, though.

Type: LogConfiguration,
Data: encodeConfiguration(configuration),
}
if conf.ProtocolVersion < 1 {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would you ever want to bootstrap a new cluster with version 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During the phase where we are transitioning to protocol version 1 but have set it to 0, we may not want to have this new entry in the log in case an older server joins, since we are saying if protocol version == 0 that we are compatible with old servers.

@ongardie-sfdc
Copy link

We can migrate an existing peers.json file, which is basically the source of truth for the old version of the library before this support was moved to be fully in snapshots + raft log as the official source.

I think @superfell and I had convinced ourselves that a running cluster would have the same information in its snapshot or log already, so we wouldn't need to ever read peers.json again. Was that just wishful thinking? :)

Thinking through how I did RecoverCluster() by adding the new log entry in the same term as the last log entry. This seems safe

Assuming all the servers go through this procedure, the set of servers that are eligible to become leader would stay the same. So that's good. But I'm gonna need to think about this.

@slackpad
Copy link
Contributor Author

I think since peer changes used to be stored in the log and snapshots so we could probably just delete peers.json in almost all cases and things would work fine. Consul had some places where it would manipulate this at startup, and if it's there it could also mean that the operator wants to mess with it so I'm planning in Consul to ingest it and then blow it away. I think for most normal use cases we could just delete it and things would work fine. I should put "truth" in quotes :-)

@slackpad slackpad changed the title Adds manual recovery capability. Adds in-place upgrade and manual recovery support. Jul 21, 2016
@slackpad slackpad changed the title Adds in-place upgrade and manual recovery support. Adds in-place manual recovery support. Jul 21, 2016
@slackpad slackpad merged commit 24c4add into f-upgrade-and-recovery Jul 21, 2016
@slackpad slackpad deleted the f-recovery-temp branch July 21, 2016 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants