Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds in-place upgrade and manual recovery support. #139

Merged
merged 52 commits into from
Aug 1, 2016

Conversation

slackpad
Copy link
Contributor

@slackpad slackpad commented Jul 18, 2016

This adds several important capabilities to help in upgrading to the new Raft protocol version:

  1. We can migrate an existing peers.json file, which is sometimes the source of truth for the old version of the library before this support was moved to be fully in snapshots + raft log as the official source.
  2. If we are using protocol version 0 where we don't support server IDs, operators can continue to use peers.json as an interface to manually recover from a loss of quorum.
  3. We left ourselves open for a more full-featured recovery manager by giving a new RecoverCluster interface access to a complete Configuration object to consume. This will allow us to manually pick which server is a voter for manual elections (set 1 to a voter and the rest to nonvoters, the 1 voter will elect itself), as well as basically any other configuration we want to set.

This also gives a path for introducing Raft servers running the new version of the library into a cluster running the old code. Things would work like this:

// These are the versions of the protocol (which includes RPC messages as
// well as Raft-specific log entries) that this server can _understand_. Use
// the ProtocolVersion member of the Config object to control the version of
// the protocol to use when _speaking_ to other servers. This is not currently
// written into snapshots so they are unversioned. Note that depending on the
// protocol version being spoken, some otherwise understood RPC messages may be
// refused. See isVersionCompatible for details of this logic.
//
// There are notes about the upgrade path in the description of the versions
// below. If you are starting a fresh cluster then there's no reason not to
// jump right to the latest protocol version. If you need to interoperate with
// older, version 0 Raft servers you'll need to drive the cluster through the
// different versions in order.
//
// The version details are complicated, but here's a summary of what's required
// to get from an version 0 cluster to version 3:
//
// 1. In version N of your app that starts using the new Raft library with
//    versioning, set ProtocolVersion to 1.
// 2. Make version N+1 of your app require version N as a prerequisite (all
//    servers must be upgraded). For version N+1 of your app set ProtocolVersion
//    to 2.
// 3. Similarly, make version N+2 of your app require version N+1 as a
//    prerequisite. For version N+2 of your app, set ProtocolVersion to 3.
//
// During this upgrade, older cluster members will still have Server IDs equal
// to their network addresses. To upgrade an older member and give it an ID, it
// needs to leave the cluster and re-enter:
//
// 1. Remove the server from the cluster with RemoveServer, using its network
//    address as its ServerID.
// 2. Update the server's config to a better ID (restarting the server).
// 3. Add the server back to the cluster with AddVoter, using its new ID.
//
// You can do this during the rolling upgrade from N+1 to N+2 of your app, or
// as a rolling change at any time after the upgrade.
//
// Version History
//
// 0: Original Raft library before versioning was added. Servers running this
//    version of the Raft library use AddPeerDeprecated/RemovePeerDeprecated
//    for all configuration changes, and have no support for LogConfiguration.
// 1: First versioned protocol, used to interoperate with old servers, and begin
//    the migration path to newer versions of the protocol. Under this version
//    all configuration changes are propagated using the now-deprecated
//    RemovePeerDeprecated Raft log entry. This means that server IDs are always
//    set to be the same as the server addresses (since the old log entry type
//    cannot transmit an ID), and only AddPeer/RemovePeer APIs are supported.
//    Servers running this version of the protocol can understand the new
//    LogConfiguration Raft log entry but will never generate one so they can
//    remain compatible with version 0 Raft servers in the cluster.
// 2: Transitional protocol used when migrating an existing cluster to the new
//    server ID system. Server IDs are still set to be the same as server
//    addresses, but all configuration changes are propagated using the new
//    LogConfiguration Raft log entry type, which can carry full ID information.
//    This version supports the old AddPeer/RemovePeer APIs as well as the new
//    ID-based AddVoter/RemoveServer APIs which should be used when adding
//    version 3 servers to the cluster later. This version sheds all
//    interoperability with version 0 servers, but can interoperate with newer
//    Raft servers running with protocol version 1 since they can understand the
//    new LogConfiguration Raft log entry, and this version can still understand
//    their RemovePeerDeprecated Raft log entries. We need this protocol version
//    as an intermediate step between 1 and 3 so that servers will propagate the
//    ID information that will come from newly-added (or -rolled) servers using
//    protocol version 3, but since they are still using their address-based IDs
//    from the previous step they will still be able to track commitments and
//    their own voting status properly. If we skipped this step, servers would
//    be started with their new IDs, but they wouldn't see themselves in the old
//    address-based configuration, so none of the servers would think they had a
//    vote.
// 3: Protocol adding full support for server IDs and new ID-based server APIs
//    (AddVoter, AddNonvoter, etc.), old AddPeer/RemovePeer APIs are no longer
//    supported. Version 2 servers should be swapped out by removing them from
//    the cluster one-by-one and re-adding them with updated configuration for
//    this protocol version, along with their server ID. The remove/add cycle
//    is required to populate their server ID. Note that removing must be done
//    by ID, which will be the old server's address.
// These are versions of snapshots that this server can _understand_. Currently,
// it is always assumed that this server generates the latest version, though
// this may be changed in the future to include a configurable version.                                                                                                              //
// Version History
//
// 0: Original Raft library before versioning was added. The peers portion of
//    these snapshots is encoded in the legacy format which requires decodePeers
//    to parse. This version of snapshots should only be produced by the
//    unversioned Raft library.
// 1: New format which adds support for a full configuration structure and its
//    associated log index, with support for server IDs and non-voting server
//    modes. To ease upgrades, this also includes the legacy peers structure but
//    that will never be used by servers that understand version 1 snapshots.
//    Since the original Raft library didn't enforce any versioning, we must
//    include the legacy peers structure for this version, but we can deprecate
//    it in the next snapshot version.

This isn't super great, but will give us a path to keep things compatible with existing clusters as we roll out the changes. We can make some higher-level tooling in Consul to help orchestrate this.

@slackpad slackpad force-pushed the f-upgrade-and-recovery branch from d7d93e5 to 19587a7 Compare July 18, 2016 23:18
@slackpad
Copy link
Contributor Author

slackpad commented Jul 18, 2016

@ongardie this should address TODO item 2 from #84. If you have a few minutes PTAL - thanks!

@slackpad slackpad mentioned this pull request Jul 18, 2016
14 tasks
logger.Printf("[WARN] raft: No server ID given, using network address: %v. This default will be removed in the future. Set server ID explicitly in config.",
if protocolVersion < 1 || localID == "" {
// During the transition to the new ID system, keep this as an
// INFO level message. Once the new scheme has been out for a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment this as COMPAT and a TODO so it's easily grepable that there is a pending action to enable a depreciate WARN message?

@ongardie-sfdc
Copy link

sorry @slackpad, was too busy generating code, will review

@slackpad
Copy link
Contributor Author

@ongardie-sfdc thanks! If you have a chance please take a look at #140 as well - it's based on this branch so relative to this PR.

config.ProtocolVersion > ProtocolVersionMax {
return fmt.Errorf("Protocol version %d must be >= %d and <= %d",
config.ProtocolVersion, ProtocolVersionMin, ProtocolVersionMax)
}
if config.HeartbeatTimeout < 5*time.Millisecond {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a new validation check: when ProtocolVersion > 0, LocalID should be nonempty.

@ongardie-sfdc
Copy link

This sets some good groundwork.

I think having the config dictate which version the server speaks may be too simplistic. Here's an example: Bring up server A with version n + 1 while servers B and C still have version n. Server A can send B a LogConfiguration entry, then crash. B can become leader and have to replicate that LogConfiguration. But B isn't allowed to speak n + 1 yet. That's awkward.

@slackpad
Copy link
Contributor Author

@ongardie-sfdc agree the manual part isn't ideal but we don't have a good mechanism to negotiate. I was thinking I'd ship the next release of Consul (N) w/protocol version set to 0 so the upgrade works with the old servers, but all new servers will be able to understand 1. The release after that (N+1) would switch to protocol version 1, and we'd make N a prerequisite for N+1.

I suppose we could try to add a version to some existing messages and then upshift if we see that, but given that we can add incompatible things to the log I don't think that will work if an older server suddenly shows up later.

@ongardie-sfdc
Copy link

... I don't think that will work if an older server suddenly shows up later.

If an older server suddenly shows up later with old code, the best thing for it to do is crash. If we're clever, there might be some way to make that happen, though there aren't many panics around.

@slackpad
Copy link
Contributor Author

slackpad commented Jul 20, 2016

I just meant if we are in protocol 0 mode we should work with old servers, even if we bootstrap a new cluster. In the > 0 world, old servers will print errors for all peer changes and will never be able to accept them. Will have to do a little thinking/testing to see if that's enough to keep them out of a cluster. With our current message serialization library we could add the protocol version to all our RPCs and reject anything from 0 (which it will default to on new servers when receiving an old message) if we are > 0 - this might be a clean way to stop shenanigans after the upgrade is complete.

@sean-
Copy link
Contributor

sean- commented Jul 21, 2016 via email

}

} else {
r.logger.Printf("[ERR] raft: Ignoring un-versioned command: %#v", rpc.Command)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work? Don't we need to listen to unversioned commands until we've reached some version ourselves?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And isn't that the same as ErrUnsupportedProtocol?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unversioned commands come through as version 0 because of the way MsgPack decoding works - the new receivers get zero-valued version info. I updated the comment to reflect this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we get a message without a version at all it's kind of a bug in the code, so I wanted that to be distinguishable from a normal "talking to an old, unsupported thing" error.

This can later be the place where we put a cluster ID.
// file at that location, returning a recovery manager to apply this
// configuration at startup. If there's no recovery defined then this will return
// nil, which is a valid thing to pass to NewRaft().
func NewPeersJSONRecovery(base string) (*PeersJSONRecovery, error) {
Copy link

@ongardie-sfdc ongardie-sfdc Jul 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A "recovery manager" doesn't seem to really be a thing, and I don't think it needs to be. How about simply:

func ReadPeersJSON(path) (Configuration, error)

The caller would have the burden of path := filepath.Join(base, "peers.json") and os.Remove(path), which I think would be entirely tolerable.

And then I'd rename this file to peersjson.go or something like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - I originally had this horrible thing that was passed in to NewRaft() that waited for things to replicate, etc. This is a leftover from that so I'll rename this and remove the Disarm() shiz.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

phew

// well as Raft-specific log entries) that this server can _understand_. Use
// the ProtocolVersion member of the Config object to control the version of
// the protocol to use when _speaking_ to other servers. This is not currently
// written into snapshots so they are unversioned. Note that depending on the

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to write some version number into snapshots?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the snapshots are somewhat divorced from Raft log entry types I figured it would be overkill. You can think of them as being at "snapshot format version 0" :-)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing we need to guarantee is that if you don't understand something in your log or snapshot, you crash. It's not cool to skip it like we're doing now.

This could be a server was running new code, wrote out a log entry/snapshot with important new information, then somehow got downgraded.

  • For log entries, we can probably agree to do this using new log entry types. But we're missing the check in NewRaft and Recover right now.
  • For snapshots, we don't have a way to know.

Otherwise, this could be that we've deleted the code that reads in old log entries/snapshot formats. Someday we need to remove that cruft. How can we guarantee that no log entry or snapshot is so old? I think the solution has to be of the form: write out a new snapshot in a newer format before you upgrade to code that stops knowing how to read older log entry/snapshot formats. How do we arrange that?

BTW, I think all this might lead us to having multiple versions covering different things, at least internally. You can't read in snapshots with newer versions than what you support, but the snapshot format won't change on every upgrade. So I think it needs its own class of version number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be done as a separate PR, but we could add a snapshot version as well as a function similar to RecoverCluster() that doesn't take a new configuration but rolls up the logs and takes a snapshot, writing that out in the latest version. That tool should allow for us to deprecate things by saying "you must run this on all your servers prior to running X (it could maybe be in NewRaft() as something we do automatically for folks).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a function similar to RecoverCluster() that doesn't take a new configuration but rolls up the logs and takes a snapshot, writing that out in the latest version.

That implicitly commits everything, so we can't do that safely.

it could maybe be in NewRaft() as something we do automatically for folks

Maybe after an upgrade once your commit index has reached ???, you force a snapshot. Filling in the ??? is tricky, hmm.

Copy link
Contributor Author

@slackpad slackpad Jul 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed your Raft paper and I see why this is dangerous, even at startup time. After an outage we have to do the best we can because we don't have the right peers to determine if the last few logs were committed or not, so the best we can do is assume they were. We wouldn't want to do this process any other time. The non-outage snapshot policy is to snapshot committed stuff only, so I see why we can't roll up willy-nilly without implicitly committing.

It's not going to be safe to ingest the old peers.json initially like I'm doing now over in the Consul integration branch, so I'll get rid of that behavior.

Copy link
Contributor Author

@slackpad slackpad Jul 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this over in Consul by adding some code to blow away the peers.json file on the first boot - hashicorp/consul@771ba18.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my late night thrashing here - I edited the last few comments based on my current understanding.

Copy link
Contributor Author

@slackpad slackpad Jul 30, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tomorrow I'll take a shot at versioning snapshots and making things die for log entries that aren't understood - that'll make things more robust.

Still not sure what a great deprecation story is for when it's safe to delete code for handling old versions because of the uncertainty of snapshot timing. We could do something simple like run a cleanup goroutine that will fire off a snapshot request 15 minutes after a to-be-deprecated log entry gets committed (or if we see one in the log at startup).

This would only cause up to 4 snapshots per hour (or whatever) even if there were tons of these deprecated entries flying by, and would eliminate the deprecated entries within 15 minutes of finishing an upgrade once there were no servers producing the old entries any longer. The ongoing cost would be very low as long as the cleanup routine doesn't need to scan the logs; it can just select on a one-item buffered channel that gets pinged in a non-blocking fashion whenever we encounter something that needs cleanup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the snapshot versioning uncovered an issue where we have dropped support for old peers in the snapshot (we can read it but not produce it) - we will need to write both in order to upgrade so that old servers can read new snapshots during the migration. I'll unpack this at the same time as snapshot versioning.

@ongardie-sfdc
Copy link

Ok, sorry for delaying a few days. I think I'm done with this pass of reviewing. It's a hairy issue, but thanks for taking it on, and you're making great progress. Have a good weekend.

@slackpad
Copy link
Contributor Author

@ongardie-sfdc thanks for taking a look. I'm finishing up the last bit of changes now and will push up shortly. There's a follow on change in #143 that I'll rebase off of this.

Appreciate your detailed scrub of this one - have a good weekend as well!

@slackpad
Copy link
Contributor Author

slackpad commented Jul 31, 2016

Ok I think I've got all the issues addressed. Unless there are any huge bugs I'd like to merge this and fix things in separate PRs at this point. I was able to form a cluster with a legacy Consul 0.6.4 binary and a build using this and #143!

// it in the next snapshot version.
const (
SnapshotVersionMin = 0
SnapshotVersionMax = 1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What controls which SnapshotVersion the server generates?

Right now it's set to SnapshotVersionMax, which would be problematic if it were non-backwards-compatible. For example, a new server could generate a new snapshot, send it to an old server, and that server would panic.

It seems like this should either be a function of the ProtocolVersion, or it should be specified separately in the config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a function for now that takes the protocol version and returns the snapshot version to use so we don't hard code the max constant all over the place.

@ongardie-sfdc
Copy link

@slackpad left you two more comments (1. SnapshotVersion, 2. dispositionRPC), then lgtm to merge. The issue of when is it safe to remove support for old log entry types and snapshot versions can be a separate PR.

@slackpad slackpad merged commit 6e3917d into issue-84-integration Aug 1, 2016
@slackpad slackpad deleted the f-upgrade-and-recovery branch October 7, 2016 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants