-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds in-place upgrade and manual recovery support. #139
Conversation
d7d93e5
to
19587a7
Compare
logger.Printf("[WARN] raft: No server ID given, using network address: %v. This default will be removed in the future. Set server ID explicitly in config.", | ||
if protocolVersion < 1 || localID == "" { | ||
// During the transition to the new ID system, keep this as an | ||
// INFO level message. Once the new scheme has been out for a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you comment this as COMPAT
and a TODO
so it's easily grepable that there is a pending action to enable a depreciate WARN
message?
sorry @slackpad, was too busy generating code, will review |
@ongardie-sfdc thanks! If you have a chance please take a look at #140 as well - it's based on this branch so relative to this PR. |
config.ProtocolVersion > ProtocolVersionMax { | ||
return fmt.Errorf("Protocol version %d must be >= %d and <= %d", | ||
config.ProtocolVersion, ProtocolVersionMin, ProtocolVersionMax) | ||
} | ||
if config.HeartbeatTimeout < 5*time.Millisecond { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a new validation check: when ProtocolVersion > 0, LocalID should be nonempty.
This sets some good groundwork. I think having the config dictate which version the server speaks may be too simplistic. Here's an example: Bring up server A with version n + 1 while servers B and C still have version n. Server A can send B a LogConfiguration entry, then crash. B can become leader and have to replicate that LogConfiguration. But B isn't allowed to speak n + 1 yet. That's awkward. |
@ongardie-sfdc agree the manual part isn't ideal but we don't have a good mechanism to negotiate. I was thinking I'd ship the next release of Consul (N) w/protocol version set to 0 so the upgrade works with the old servers, but all new servers will be able to understand 1. The release after that (N+1) would switch to protocol version 1, and we'd make N a prerequisite for N+1. I suppose we could try to add a version to some existing messages and then upshift if we see that, but given that we can add incompatible things to the log I don't think that will work if an older server suddenly shows up later. |
If an older server suddenly shows up later with old code, the best thing for it to do is crash. If we're clever, there might be some way to make that happen, though there aren't many panics around. |
I just meant if we are in protocol 0 mode we should work with old servers, even if we bootstrap a new cluster. In the > 0 world, old servers will print errors for all peer changes and will never be able to accept them. Will have to do a little thinking/testing to see if that's enough to keep them out of a cluster. With our current message serialization library we could add the protocol version to all our RPCs and reject anything from 0 (which it will default to on new servers when receiving an old message) if we are > 0 - this might be a clean way to stop shenanigans after the upgrade is complete. |
Automatically add them to the blacklist and ignore them.
|
} | ||
|
||
} else { | ||
r.logger.Printf("[ERR] raft: Ignoring un-versioned command: %#v", rpc.Command) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work? Don't we need to listen to unversioned commands until we've reached some version ourselves?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And isn't that the same as ErrUnsupportedProtocol?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unversioned commands come through as version 0 because of the way MsgPack decoding works - the new receivers get zero-valued version info. I updated the comment to reflect this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we get a message without a version at all it's kind of a bug in the code, so I wanted that to be distinguishable from a normal "talking to an old, unsupported thing" error.
This can later be the place where we put a cluster ID.
// file at that location, returning a recovery manager to apply this | ||
// configuration at startup. If there's no recovery defined then this will return | ||
// nil, which is a valid thing to pass to NewRaft(). | ||
func NewPeersJSONRecovery(base string) (*PeersJSONRecovery, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A "recovery manager" doesn't seem to really be a thing, and I don't think it needs to be. How about simply:
func ReadPeersJSON(path) (Configuration, error)
The caller would have the burden of path := filepath.Join(base, "peers.json")
and os.Remove(path)
, which I think would be entirely tolerable.
And then I'd rename this file to peersjson.go or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch - I originally had this horrible thing that was passed in to NewRaft()
that waited for things to replicate, etc. This is a leftover from that so I'll rename this and remove the Disarm()
shiz.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
phew
// well as Raft-specific log entries) that this server can _understand_. Use | ||
// the ProtocolVersion member of the Config object to control the version of | ||
// the protocol to use when _speaking_ to other servers. This is not currently | ||
// written into snapshots so they are unversioned. Note that depending on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to write some version number into snapshots?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the snapshots are somewhat divorced from Raft log entry types I figured it would be overkill. You can think of them as being at "snapshot format version 0" :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing we need to guarantee is that if you don't understand something in your log or snapshot, you crash. It's not cool to skip it like we're doing now.
This could be a server was running new code, wrote out a log entry/snapshot with important new information, then somehow got downgraded.
- For log entries, we can probably agree to do this using new log entry types. But we're missing the check in NewRaft and Recover right now.
- For snapshots, we don't have a way to know.
Otherwise, this could be that we've deleted the code that reads in old log entries/snapshot formats. Someday we need to remove that cruft. How can we guarantee that no log entry or snapshot is so old? I think the solution has to be of the form: write out a new snapshot in a newer format before you upgrade to code that stops knowing how to read older log entry/snapshot formats. How do we arrange that?
BTW, I think all this might lead us to having multiple versions covering different things, at least internally. You can't read in snapshots with newer versions than what you support, but the snapshot format won't change on every upgrade. So I think it needs its own class of version number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be done as a separate PR, but we could add a snapshot version as well as a function similar to RecoverCluster()
that doesn't take a new configuration but rolls up the logs and takes a snapshot, writing that out in the latest version. That tool should allow for us to deprecate things by saying "you must run this on all your servers prior to running X (it could maybe be in NewRaft()
as something we do automatically for folks).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a function similar to RecoverCluster() that doesn't take a new configuration but rolls up the logs and takes a snapshot, writing that out in the latest version.
That implicitly commits everything, so we can't do that safely.
it could maybe be in NewRaft() as something we do automatically for folks
Maybe after an upgrade once your commit index has reached ???, you force a snapshot. Filling in the ??? is tricky, hmm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-reviewed your Raft paper and I see why this is dangerous, even at startup time. After an outage we have to do the best we can because we don't have the right peers to determine if the last few logs were committed or not, so the best we can do is assume they were. We wouldn't want to do this process any other time. The non-outage snapshot policy is to snapshot committed stuff only, so I see why we can't roll up willy-nilly without implicitly committing.
It's not going to be safe to ingest the old peers.json initially like I'm doing now over in the Consul integration branch, so I'll get rid of that behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this over in Consul by adding some code to blow away the peers.json file on the first boot - hashicorp/consul@771ba18.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for my late night thrashing here - I edited the last few comments based on my current understanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tomorrow I'll take a shot at versioning snapshots and making things die for log entries that aren't understood - that'll make things more robust.
Still not sure what a great deprecation story is for when it's safe to delete code for handling old versions because of the uncertainty of snapshot timing. We could do something simple like run a cleanup goroutine that will fire off a snapshot request 15 minutes after a to-be-deprecated log entry gets committed (or if we see one in the log at startup).
This would only cause up to 4 snapshots per hour (or whatever) even if there were tons of these deprecated entries flying by, and would eliminate the deprecated entries within 15 minutes of finishing an upgrade once there were no servers producing the old entries any longer. The ongoing cost would be very low as long as the cleanup routine doesn't need to scan the logs; it can just select on a one-item buffered channel that gets pinged in a non-blocking fashion whenever we encounter something that needs cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the snapshot versioning uncovered an issue where we have dropped support for old peers in the snapshot (we can read it but not produce it) - we will need to write both in order to upgrade so that old servers can read new snapshots during the migration. I'll unpack this at the same time as snapshot versioning.
Ok, sorry for delaying a few days. I think I'm done with this pass of reviewing. It's a hairy issue, but thanks for taking it on, and you're making great progress. Have a good weekend. |
@ongardie-sfdc thanks for taking a look. I'm finishing up the last bit of changes now and will push up shortly. There's a follow on change in #143 that I'll rebase off of this. Appreciate your detailed scrub of this one - have a good weekend as well! |
…on of the library.
Ok I think I've got all the issues addressed. Unless there are any huge bugs I'd like to merge this and fix things in separate PRs at this point. I was able to form a cluster with a legacy Consul 0.6.4 binary and a build using this and #143! |
// it in the next snapshot version. | ||
const ( | ||
SnapshotVersionMin = 0 | ||
SnapshotVersionMax = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What controls which SnapshotVersion the server generates?
Right now it's set to SnapshotVersionMax, which would be problematic if it were non-backwards-compatible. For example, a new server could generate a new snapshot, send it to an old server, and that server would panic.
It seems like this should either be a function of the ProtocolVersion, or it should be specified separately in the config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a function for now that takes the protocol version and returns the snapshot version to use so we don't hard code the max constant all over the place.
@slackpad left you two more comments (1. SnapshotVersion, 2. dispositionRPC), then lgtm to merge. The issue of when is it safe to remove support for old log entry types and snapshot versions can be a separate PR. |
This adds several important capabilities to help in upgrading to the new Raft protocol version:
This also gives a path for introducing Raft servers running the new version of the library into a cluster running the old code. Things would work like this:
This isn't super great, but will give us a path to keep things compatible with existing clusters as we roll out the changes. We can make some higher-level tooling in Consul to help orchestrate this.