From 6ae0ff7b07fc06b7e4c74985fcb095f2fdcada68 Mon Sep 17 00:00:00 2001 From: Tobias Schottdorf Date: Wed, 20 Apr 2016 07:57:56 -0400 Subject: [PATCH] [RFC] Leader-evaluted Raft This is a hastily written, initial draft released early to get early feedback for more refined iterations. [ci skip] --- docs/RFCS/leader_evaluated_raft.md | 281 +++++++++++++++++++++++++++++ 1 file changed, 281 insertions(+) create mode 100644 docs/RFCS/leader_evaluated_raft.md diff --git a/docs/RFCS/leader_evaluated_raft.md b/docs/RFCS/leader_evaluated_raft.md new file mode 100644 index 000000000000..825c17df29fd --- /dev/null +++ b/docs/RFCS/leader_evaluated_raft.md @@ -0,0 +1,281 @@ +- Feature Name: leader_evaluated_raft +- Status: draft +- Start Date: 2016-04-19 +- Authors: Tobias Schottdorf +- RFC PR: (PR # after acceptance of initial draft) +- Cockroach Issue: (one or more # from the issue tracker) + + +# Summary + +Pursue an idea by @spencerkimball brought up in the context of +[#5985](https://github.com/cockroachdb/cockroach/pull/5985): move the bulk of +the logic involved in applying commands to a Replica upstream of Raft. + +# Motivation + +The [first exploration of +migrations](https://github.com/cockroachdb/cockroach/pull/5985) highlighted +the complexity concomitant with migrations on top of the current Raft execution +model. + +As of writing, the leading replica is the only replica that proposes commands +to Raft. However, each replica applies commands individually, which leads to +large swaths of code living "downstream" of Raft. Since all of this code needs +to produce identical output even during migrations, a satisfactory migration +story is almost inachievable, a comment made by several reviewers (and the +author). + +By having the proposing replica compute the effects on the key-value state +before proposing the command to Raft, much of this downstream code moves +upstream of Raft; naively what's left are simple writes to key-value pairs +(i.e. all commands move upstream except for a simplified MVCCPut operation). +While some complexity remains, this has promise for a much facilitated +migration story, as well as de-triplicating work previously done by each +individual replica. + +# Detailed design + +## Execution path + +### Proposing commands + +The main change in the execution path will be in `proposeRaftCommand`, which +currently takes a `BatchRequest` and creates a `pendingCmd`, which currently +has the form + +```go +type pendingCmd struct { + ctx context.Context + idKey storagebase.CmdIDKey + done chan roachpb.ResponseWithError // Used to signal waiting RPC handler + raftCmd struct { + RangeID RangeID + OriginReplica ReplicaDescriptor + Cmd BatchRequest + } +} +``` + +and within which the main change is changing the type of `Cmd` from +`BatchRequest` to a new type which carries the effects of the application of +the batch, as computed (by the leader) earlier in `proposeRaftCommand`. + +For simplicity, in this draft we will assume that the `raftCmd` struct will +be modified according to these requirements. + +The "effects" contain + +* any key-value writes: For example the outcome of a `Put`, `DeleteRange`, `CPut`, `BeginTransaction`, or `LeaderLease`. + Some alternatives exist: writes could be encoded as `[]MVCCKeyValue`, or even + as post-MVCC raw (i.e. opaque) key-value pairs. We'll assume the former for + now. +* parameters for any "triggers" which must run with the *application* of the + command. In particular, this is necessary on `EndTransaction` with a commit + trigger, when a transaction is aborted (in which case the abort cache must + be populated), and when leadership changes (to apply the new lease). + There are likely more, but this already shows that these triggers (which are + downstream of Raft) will require various pieces of data to be passed around + with them. + +Thus, we arrive at the following (proto-backed) candidate struct: + +```proto +# Pseudo-Go-Protobuf, all naming ad-hoc and TBD. + +message RaftCmd { + optional SomeMetadata # tbd (proposing replica, MVCCStats diff, ...) + repeated oneof effect { + # Applied in field and slice order. + repeated Write writes = ...; # see MVCC section + optional Split ...; + optional Merge ...: + optional ChangeReplicas ...; + optional ModifiedSpan ...; + optional LeaderLease ...; + optional AbortCache ...; + optional TSCache ...; # if effect not computable from `Writes` + # Those below may not be required, but something to think about + optional GCRange ...; # allow optimized GC (some logic downstream of Raft?) + optional ClearRange ...; # allow for dropping all kv-pairs without huge cmd + } +} +``` + +Since this simple structure is easy to reason about, it seems adequate to +strive for the minimum amount of interaction between the various effects +(though the data required for the triggers may benefit from more interaction +between them in some cases) + +### Computing the write set + +Before a leader proposes a command, it has the job of translating a +`BatchRequest` into a `RaftCmd`. Note that we assume that writes wait for +both reads and writes in the command queue (so that at the point at which +the `RaftCmd` is constructed, all prior mutations have been carried out and +the timestamp cache updated). + +This largely takes over the work which was done in + +``` +applyRaftCommandInBatch -> executeWriteBatch -> executeBatch + -> (ReplicaCommands) -> (MVCC) +``` + +This is hopefully largely mechanical, with the bulk of work in the MVCC layer, +outlined in the next section. + +The work to be carried out in `ReplicaCommands` is mostly isolating the various +side effects (most prominently the commit and leader lease triggers). In the +process, it should be possible to remove these methods from `(*Replica)` +(though that is not a stated goal of this RFC). + +### MVCC & Engine Layer + +Operations in the MVCC layer (`MVCC{Get,Put,Scan,DeleteRange,...}`) currently +return some variation of `([]roachpb.KeyValue, []roachpb.Intent, error)` and +operate on the underlying `engine.Engine` (which in practice is a RocksDB +batch) with heavy use of iterators. + +With the new code, we want mostly the same, but with two important changes: + +* we do not actually intent to commit the batch (though we could do this on the + proposing node to avoid overhead) +* we need to construct the repeated `Write` field, which (in some way or + another) tracks the key-value pairs written (i.e. created, changed or + deleted) over the lifetime of the batch. + +It's also desirable to avoid invasive changes to the MVCC layer. The following +extension to the `Engine` interface seems adequate (mod changes to improve +buffer reuse): + +```go +type Engine interface { + // ... + NewBatch() Batch +} + +var ErrNotTracking = errors.New("batch is not tracking writes") + +type Writes []MVCCKeyValue // to be discussed; could also be raw KV ([][2]byte) + +type Batch interface { + Engine + // GetWrites returns a slice of Write updates which were carried out during + // batch, or ErrNotTracking if the batch does not support this feature. + // The ordering of the writes is not guaranteed to be stable and may not + // correspond to the actual order in which the writes were carried out. + // However, the result of applying the updates in the returned order *is* + // stable, i.e. non-commutative (overlapping writes) can't rearrange + // themselves freely). [i.e. can use a map under the hood] + // The returned Writes are read-only and must not be used after their Batch + // is finalized. [i.e. can reclaim the memory at that point, or we add an + // explicit method to finalize the writes if that is more convenient]. + GetWrites() (Writes, error) +} +``` + +Naively, this can be achieved by wrapping a "regular" batch with an appropriate +in-memory map which is populated on writes. A better implementation may be +hard to get but even knowing the underlying `Engine` implementation +(i.e. currently almost exclusively RocksDB). Some refactoring should assure +that buffers used for writing MVCC values are assumed "owned" by the engine (so +that populating the internal map can freely hold on to the memory). It can be +fed back into the pool at a higher level (i.e. after having applied/discarded +the batch). + +With this strategy, the MVCC changes are fairly locally scoped and not too +invasive. + +## Raft command application + +This will be a new, straightforward facility which applies a `RaftCmd` against +the underlying `Engine` in a batch. The point of this whole change is that this +part of the stack is very straightforward and with no side effects. + +`(*Replica).executeCmd` is likely a good place where most of this code would +live. + +## Replay protection + +Replay protection is naively an issue because applying a logicless batch of +writes is at odds with our current method, which largely relies on +write/timestamp conflicts. +However, as @bdarnell pointed out, as of recently (#5973) we (mostly) colocate +the Raft leader and the leader lease holder on the same node and have more +options for dealing with this (i.e. making the application of a command +conditional on the Raft log position it was originally proposed at). +This needs to be fleshed out further, but does appear to be manageable. + +### Test changes + +These should be limited to those tests that exercise triggers and/or use +lower-level Raft internals inside of the `storage` package (plus some surprises +which are certain to occur with a change of this magnitude). + +## Implementation strategy + +Some lower-hanging fruit exist: + +* the MVCC/Engine changes don't affect upgrades or existing functionality, so +they can be carried out freely. +* the change in replay protection mentioned above can also be developed/tested +separately. This seems wise since it's an absolute necessity. +* separating the side effects of Raft applications should also be doable; most + of these should come up naturally when trying to remove the `(*Replica)` + receiver on most of the `ReplicaCommands` (`LeaderLease`, `Put`, ...) and + manually going through the rest of the call stack up to `processRaftCommand`. + +The goal here should be to identify all the necessary refactors to keep them +out of the remaining "forklift" portion of the change. + +## Implementation strategy + +My suggestion (which aims for minimal developer burden while allowing for some +palatable version of "update path") is the following: + +The offical up (and downgrade) path is changing all zone configs to hold only +one replica, and doing a stop-the-world upgrade. We supply a tool that, when +run against a cluster, "truncates" all the Raft logs by replacing any +command within them with `NoopRequest` (or, if we have an assertion against +trivial batches somewhere, a `CPut` against a practically impossible +existing value). Stopping all nodes, running the tool, restarting with new +binaries, and then restoring the zone config is the official version change +across this version (or just dump-restore if available by then). + +That is not a good migration story by any means, but everything else is +dangerously involved. It draws justification from the fact that it is the basic +change that allows for a well-defined migration process and in fact lays much +of the groundwork for it. + +I'm happy to discuss more involved options, but I doubt there's any free lunch +here. + +# Drawbacks + +* DeleteRange could create very large writes + +## Is this a scary change? + +Yes and no. It's certainly scary + +* if you try to migrate it properly (see above) +* if you try to forklift it all +* that with a change of this magnitude we could stumble upon surprises + in late-ish stages which call the whole change into question +* that we need this RFC to materialize in code soon and that a proper migration + story is blocked on it. That means we'll have to have a bunch of people + working on it simultaneously without too much distraction. + +As I understand it, the change is very ambitious, but not as vast in scope as, +for example, the plans we have with distributed SQL (which, of course, are not +on such a tight deadline). + +# Alternatives + +Other than leaving the current system in place and pushing complexity onto the +migration story, none. + +# Unresolved questions + +(Updated as comments roll in)