-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
This is a hastily written, initial draft released early to get early feedback for more refined iterations. [ci skip]
- Loading branch information
Showing
1 changed file
with
281 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,281 @@ | ||
- Feature Name: leader_evaluated_raft | ||
- Status: draft | ||
- Start Date: 2016-04-19 | ||
- Authors: Tobias Schottdorf | ||
- RFC PR: (PR # after acceptance of initial draft) | ||
- Cockroach Issue: (one or more # from the issue tracker) | ||
|
||
|
||
# Summary | ||
|
||
Pursue an idea by @spencerkimball brought up in the context of | ||
[#5985](https://github.com/cockroachdb/cockroach/pull/5985): move the bulk of | ||
the logic involved in applying commands to a Replica upstream of Raft. | ||
|
||
# Motivation | ||
|
||
The [first exploration of | ||
migrations](https://github.com/cockroachdb/cockroach/pull/5985) highlighted | ||
the complexity concomitant with migrations on top of the current Raft execution | ||
model. | ||
|
||
As of writing, the leading replica is the only replica that proposes commands | ||
to Raft. However, each replica applies commands individually, which leads to | ||
large swaths of code living "downstream" of Raft. Since all of this code needs | ||
to produce identical output even during migrations, a satisfactory migration | ||
story is almost inachievable, a comment made by several reviewers (and the | ||
author). | ||
|
||
By having the proposing replica compute the effects on the key-value state | ||
before proposing the command to Raft, much of this downstream code moves | ||
upstream of Raft; naively what's left are simple writes to key-value pairs | ||
(i.e. all commands move upstream except for a simplified MVCCPut operation). | ||
While some complexity remains, this has promise for a much facilitated | ||
migration story, as well as de-triplicating work previously done by each | ||
individual replica. | ||
|
||
# Detailed design | ||
|
||
## Execution path | ||
|
||
### Proposing commands | ||
|
||
The main change in the execution path will be in `proposeRaftCommand`, which | ||
currently takes a `BatchRequest` and creates a `pendingCmd`, which currently | ||
has the form | ||
|
||
```go | ||
type pendingCmd struct { | ||
ctx context.Context | ||
idKey storagebase.CmdIDKey | ||
done chan roachpb.ResponseWithError // Used to signal waiting RPC handler | ||
raftCmd struct { | ||
RangeID RangeID | ||
OriginReplica ReplicaDescriptor | ||
Cmd BatchRequest | ||
} | ||
} | ||
``` | ||
|
||
and within which the main change is changing the type of `Cmd` from | ||
`BatchRequest` to a new type which carries the effects of the application of | ||
the batch, as computed (by the leader) earlier in `proposeRaftCommand`. | ||
|
||
For simplicity, in this draft we will assume that the `raftCmd` struct will | ||
be modified according to these requirements. | ||
|
||
The "effects" contain | ||
|
||
* any key-value writes: For example the outcome of a `Put`, `DeleteRange`, `CPut`, `BeginTransaction`, or `LeaderLease`. | ||
Some alternatives exist: writes could be encoded as `[]MVCCKeyValue`, or even | ||
as post-MVCC raw (i.e. opaque) key-value pairs. We'll assume the former for | ||
now. | ||
* parameters for any "triggers" which must run with the *application* of the | ||
command. In particular, this is necessary on `EndTransaction` with a commit | ||
trigger, when a transaction is aborted (in which case the abort cache must | ||
be populated), and when leadership changes (to apply the new lease). | ||
There are likely more, but this already shows that these triggers (which are | ||
downstream of Raft) will require various pieces of data to be passed around | ||
with them. | ||
|
||
Thus, we arrive at the following (proto-backed) candidate struct: | ||
|
||
```proto | ||
# Pseudo-Go-Protobuf, all naming ad-hoc and TBD. | ||
message RaftCmd { | ||
optional SomeMetadata # tbd (proposing replica, MVCCStats diff, ...) | ||
repeated oneof effect { | ||
# Applied in field and slice order. | ||
repeated Write writes = ...; # see MVCC section | ||
optional Split ...; | ||
optional Merge ...: | ||
optional ChangeReplicas ...; | ||
optional ModifiedSpan ...; | ||
optional LeaderLease ...; | ||
optional AbortCache ...; | ||
optional TSCache ...; # if effect not computable from `Writes` | ||
# Those below may not be required, but something to think about | ||
optional GCRange ...; # allow optimized GC (some logic downstream of Raft?) | ||
optional ClearRange ...; # allow for dropping all kv-pairs without huge cmd | ||
} | ||
} | ||
``` | ||
|
||
Since this simple structure is easy to reason about, it seems adequate to | ||
strive for the minimum amount of interaction between the various effects | ||
(though the data required for the triggers may benefit from more interaction | ||
between them in some cases) | ||
|
||
### Computing the write set | ||
|
||
Before a leader proposes a command, it has the job of translating a | ||
`BatchRequest` into a `RaftCmd`. Note that we assume that writes wait for | ||
both reads and writes in the command queue (so that at the point at which | ||
the `RaftCmd` is constructed, all prior mutations have been carried out and | ||
the timestamp cache updated). | ||
|
||
This largely takes over the work which was done in | ||
|
||
``` | ||
applyRaftCommandInBatch -> executeWriteBatch -> executeBatch | ||
-> (ReplicaCommands) -> (MVCC) | ||
``` | ||
|
||
This is hopefully largely mechanical, with the bulk of work in the MVCC layer, | ||
outlined in the next section. | ||
|
||
The work to be carried out in `ReplicaCommands` is mostly isolating the various | ||
side effects (most prominently the commit and leader lease triggers). In the | ||
process, it should be possible to remove these methods from `(*Replica)` | ||
(though that is not a stated goal of this RFC). | ||
|
||
### MVCC & Engine Layer | ||
|
||
Operations in the MVCC layer (`MVCC{Get,Put,Scan,DeleteRange,...}`) currently | ||
return some variation of `([]roachpb.KeyValue, []roachpb.Intent, error)` and | ||
operate on the underlying `engine.Engine` (which in practice is a RocksDB | ||
batch) with heavy use of iterators. | ||
|
||
With the new code, we want mostly the same, but with two important changes: | ||
|
||
* we do not actually intent to commit the batch (though we could do this on the | ||
proposing node to avoid overhead) | ||
* we need to construct the repeated `Write` field, which (in some way or | ||
another) tracks the key-value pairs written (i.e. created, changed or | ||
deleted) over the lifetime of the batch. | ||
|
||
It's also desirable to avoid invasive changes to the MVCC layer. The following | ||
extension to the `Engine` interface seems adequate (mod changes to improve | ||
buffer reuse): | ||
|
||
```go | ||
type Engine interface { | ||
// ... | ||
NewBatch() Batch | ||
} | ||
|
||
var ErrNotTracking = errors.New("batch is not tracking writes") | ||
|
||
type Writes []MVCCKeyValue // to be discussed; could also be raw KV ([][2]byte) | ||
|
||
type Batch interface { | ||
Engine | ||
// GetWrites returns a slice of Write updates which were carried out during | ||
// batch, or ErrNotTracking if the batch does not support this feature. | ||
// The ordering of the writes is not guaranteed to be stable and may not | ||
// correspond to the actual order in which the writes were carried out. | ||
// However, the result of applying the updates in the returned order *is* | ||
// stable, i.e. non-commutative (overlapping writes) can't rearrange | ||
// themselves freely). [i.e. can use a map under the hood] | ||
// The returned Writes are read-only and must not be used after their Batch | ||
// is finalized. [i.e. can reclaim the memory at that point, or we add an | ||
// explicit method to finalize the writes if that is more convenient]. | ||
GetWrites() (Writes, error) | ||
} | ||
``` | ||
|
||
Naively, this can be achieved by wrapping a "regular" batch with an appropriate | ||
in-memory map which is populated on writes. A better implementation may be | ||
hard to get but even knowing the underlying `Engine` implementation | ||
(i.e. currently almost exclusively RocksDB). Some refactoring should assure | ||
that buffers used for writing MVCC values are assumed "owned" by the engine (so | ||
that populating the internal map can freely hold on to the memory). It can be | ||
fed back into the pool at a higher level (i.e. after having applied/discarded | ||
the batch). | ||
|
||
With this strategy, the MVCC changes are fairly locally scoped and not too | ||
invasive. | ||
|
||
## Raft command application | ||
|
||
This will be a new, straightforward facility which applies a `RaftCmd` against | ||
the underlying `Engine` in a batch. The point of this whole change is that this | ||
part of the stack is very straightforward and with no side effects. | ||
|
||
`(*Replica).executeCmd` is likely a good place where most of this code would | ||
live. | ||
|
||
## Replay protection | ||
|
||
Replay protection is naively an issue because applying a logicless batch of | ||
writes is at odds with our current method, which largely relies on | ||
write/timestamp conflicts. | ||
However, as @bdarnell pointed out, as of recently (#5973) we (mostly) colocate | ||
the Raft leader and the leader lease holder on the same node and have more | ||
options for dealing with this (i.e. making the application of a command | ||
conditional on the Raft log position it was originally proposed at). | ||
This needs to be fleshed out further, but does appear to be manageable. | ||
|
||
### Test changes | ||
|
||
These should be limited to those tests that exercise triggers and/or use | ||
lower-level Raft internals inside of the `storage` package (plus some surprises | ||
which are certain to occur with a change of this magnitude). | ||
|
||
## Implementation strategy | ||
|
||
Some lower-hanging fruit exist: | ||
|
||
* the MVCC/Engine changes don't affect upgrades or existing functionality, so | ||
they can be carried out freely. | ||
* the change in replay protection mentioned above can also be developed/tested | ||
separately. This seems wise since it's an absolute necessity. | ||
* separating the side effects of Raft applications should also be doable; most | ||
of these should come up naturally when trying to remove the `(*Replica)` | ||
receiver on most of the `ReplicaCommands` (`LeaderLease`, `Put`, ...) and | ||
manually going through the rest of the call stack up to `processRaftCommand`. | ||
|
||
The goal here should be to identify all the necessary refactors to keep them | ||
out of the remaining "forklift" portion of the change. | ||
|
||
## Implementation strategy | ||
|
||
My suggestion (which aims for minimal developer burden while allowing for some | ||
palatable version of "update path") is the following: | ||
|
||
The offical up (and downgrade) path is changing all zone configs to hold only | ||
one replica, and doing a stop-the-world upgrade. We supply a tool that, when | ||
run against a cluster, "truncates" all the Raft logs by replacing any | ||
command within them with `NoopRequest` (or, if we have an assertion against | ||
trivial batches somewhere, a `CPut` against a practically impossible | ||
existing value). Stopping all nodes, running the tool, restarting with new | ||
binaries, and then restoring the zone config is the official version change | ||
across this version (or just dump-restore if available by then). | ||
|
||
That is not a good migration story by any means, but everything else is | ||
dangerously involved. It draws justification from the fact that it is the basic | ||
change that allows for a well-defined migration process and in fact lays much | ||
of the groundwork for it. | ||
|
||
I'm happy to discuss more involved options, but I doubt there's any free lunch | ||
here. | ||
|
||
# Drawbacks | ||
|
||
* DeleteRange could create very large writes | ||
|
||
## Is this a scary change? | ||
|
||
Yes and no. It's certainly scary | ||
|
||
* if you try to migrate it properly (see above) | ||
* if you try to forklift it all | ||
* that with a change of this magnitude we could stumble upon surprises | ||
in late-ish stages which call the whole change into question | ||
* that we need this RFC to materialize in code soon and that a proper migration | ||
story is blocked on it. That means we'll have to have a bunch of people | ||
working on it simultaneously without too much distraction. | ||
|
||
As I understand it, the change is very ambitious, but not as vast in scope as, | ||
for example, the plans we have with distributed SQL (which, of course, are not | ||
on such a tight deadline). | ||
|
||
# Alternatives | ||
|
||
Other than leaving the current system in place and pushing complexity onto the | ||
migration story, none. | ||
|
||
# Unresolved questions | ||
|
||
(Updated as comments roll in) |