RFC: dedicated storage engine for Raft #16361

irfansharif · 2017-06-06T17:46:53Z

Each Replica is backed by a single instance of RocksDB which is used to
store all modifications to the underlying state machine in addition to
storing all consensus state. This RFC proposes the separation of the
two, outlines the motivations for doing so and alternatives considered.

cockroach-teamcity · 2017-06-06T17:46:59Z

This change is

petermattis · 2017-06-06T19:05:42Z

Nice RFC!

Review status: 0 of 1 files reviewed at latest revision, 4 unresolved discussions, some commit checks pending.

docs/RFCS/dedicated_raft_storage.md, line 114 at r1 (raw file):

```go
func BenchmarkBatchCommitSequential(b *testing.B) {

I can see the value in this benchmark, but I'd be happier with an alternating benchmark that used 2 separate RocksDB instances. Hopefully this would get similar numbers to this benchmark.

docs/RFCS/dedicated_raft_storage.md, line 241 at r1 (raw file):

done for all reads, we will stop writing out raft entries to the original
RocksDB instance.  Raft specific reads, writes and log truncations will now be
serviced by the new instance. </br>

Since we're punting on the migration story for now, it seems worthwhile to structure this so that both a shared RocksDB instance and a separate RocksDB instance can be used, controlled by an env var.

Also, given that the goal in this RFC is a performance improvement, it would be good to structure work so that we can get a full-system sanity check of the performance soon, even if this means doing something slightly hacky.

docs/RFCS/dedicated_raft_storage.md, line 259 at r1 (raw file):

TBD. The Raft log is tightly coupled to our consistency checker, this will be
something to be wary of in order to transition safely.

Also TBD is how to actually start using this RocksDB instance. One thought is that only new nodes would use a separate RocksDB instance for the Raft log. Another is to add a store-level migration. Let's not explore these too deeply yet, just mention them briefly.

docs/RFCS/dedicated_raft_storage.md, line 292 at r1 (raw file):

for the Raft log usage patterns. Possible reasons for doing so:
- A native implementation in Go would avoid the CGo overhead we incur crossing
  the G/C++ boundary

s/G/Go/g

Comments from Reviewable

a-robinson · 2017-06-06T20:17:21Z

Great writeup, @irfansharif

Review status: 0 of 1 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.

docs/RFCS/dedicated_raft_storage.md, line 114 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I can see the value in this benchmark, but I'd be happier with an alternating benchmark that used 2 separate RocksDB instances. Hopefully this would get similar numbers to this benchmark.

Yeah, I was going to say the same thing. There might be some interference between the two that would be very relevant to the accuracy of the benchmark.

docs/RFCS/dedicated_raft_storage.md, line 81 at r2 (raw file):

    for _, valueSize := range []int{1 << 10, 1 << 12, ..., 1 << 20} {
		b.Run(fmt.Sprintf("vs=%d", valueSize), func(b *testing.B) {
            // ...

You may want to clean up the indentation in here to make it render properly

docs/RFCS/dedicated_raft_storage.md, line 247 at r2 (raw file):

RocksDB by default uses faster fdatasync() to sync files. We'll need to
use fsync() instead in filesystems like ext3 where you can lose files after a
reboot, this can be done so by setting Options::use_fsync to true (for the
WriteBatch).

This comment is kinda coming out of the blue. Feel free to punt these questions out into a separate issue, but does this problem affect us today? Will we always enable use_fsync? If not, how will we tell when we need to?

Comments from Reviewable

irfansharif · 2017-06-06T20:46:12Z

Review status: 0 of 1 files reviewed at latest revision, 6 unresolved discussions.

docs/RFCS/dedicated_raft_storage.md, line 241 at r1 (raw file):

it would be good to structure work so that we can get a full-system sanity check of the performance soon, even if this means doing something slightly hacky.

I'm not sure I follow, do you mind clarifying?

docs/RFCS/dedicated_raft_storage.md, line 81 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

You may want to clean up the indentation in here to make it render properly

hmm, strange. works for me™:

docs/RFCS/dedicated_raft_storage.md, line 247 at r2 (raw file):
Source: rocksdb/wiki.

does this problem affect us today?

possibly, yes.

Will we always enable use_fsync? If not, how will we tell when we need to?

like mentioned above, only for ext3 filesystems. This is a flag set on rocksdb::Options, it should happen here but doesn't seem so. As for how, I'm sure there's a way to detect if the instance is configured to run on an ext3 filesystem.

This jumped out to me as well, I couldn't find instances of us accounting for this fact but I'll file + address separately once I confirm.

Comments from Reviewable

a-robinson · 2017-06-06T21:53:26Z

Reviewed 1 of 1 files at r2, 1 of 1 files at r3.
Review status: all files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.

docs/RFCS/dedicated_raft_storage.md, line 81 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

hmm, strange. works for me™:

Not on github in chrome:

Reviewable is even indicating that there's some whitespace weirdness in that some of the lines have the red >> arrows and others don't.

Comments from Reviewable

irfansharif · 2017-06-06T23:06:10Z

Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions.

docs/RFCS/dedicated_raft_storage.md, line 241 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

it would be good to structure work so that we can get a full-system sanity check of the performance soon, even if this means doing something slightly hacky.

I'm not sure I follow, do you mind clarifying?

Added these notes to the RFC.

docs/RFCS/dedicated_raft_storage.md, line 259 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Also TBD is how to actually start using this RocksDB instance. One thought is that only new nodes would use a separate RocksDB instance for the Raft log. Another is to add a store-level migration. Let's not explore these too deeply yet, just mention them briefly.

Done.

docs/RFCS/dedicated_raft_storage.md, line 81 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Not on github in chrome:

Reviewable is even indicating that there's some whitespace weirdness in that some of the lines have the red >> arrows and others don't.

strange, Done.

Comments from Reviewable

bdarnell · 2017-06-07T03:16:23Z

Review status: 0 of 1 files reviewed at latest revision, 9 unresolved discussions, some commit checks pending.

docs/RFCS/dedicated_raft_storage.md, line 272 at r4 (raw file):

TBD. Some things to take note of:
- The Raft log is tightly coupled to our consistency checker, this will be
  something to be wary of in order to transition safely.

I'm not seeing the issue here. The consistency checker doesn't check the raft log itself. All we require is that the consistency checker gets the right snapshot of the main database, which is controlled by the way we apply entries and track the AppliedIndex in the main database (not the raft one)

docs/RFCS/dedicated_raft_storage.md, line 274 at r4 (raw file):

  something to be wary of in order to transition safely.
- How do we actually start using this RocksDB instance? One thought is that only
  new nodes would use a separate RocksDB instance for the Raft log. Another

I'd prefer a general migration story instead of a one-time "new nodes use the new format" migration. In addition to solving the problem for existing clusters, a general migration process could be used to move the raft rocksdb instance from one location to another (such as to the non-volatile memory discussed below)

docs/RFCS/dedicated_raft_storage.md, line 284 at r4 (raw file):

It is not immediately obvious how much disk space should be allocated for the
Raft specific RocksDB engine. This is not something we had to concern ourselves

If we weren't concerned about it before, why should we be concerned about it now? It doesn't seem like anything has changed. RocksDB's management of the disk space doesn't change just because there are two instances. (The raft logs compete for space with the regular data just as much today).

docs/RFCS/dedicated_raft_storage.md, line 378 at r4 (raw file):

in non-volatile memory instead of disk (for etcd/raft)<sup>[6]</sup>.
Given we're proposing a separate storage engine for the Raft log, in the
presence of more suitable hardware medium it should be easy enough to configure

Even without more specialized hardware it might be desirable to configure the raft rocksdb and regular rocksdb to use different disks.

Comments from Reviewable

tbg · 2017-06-07T12:21:50Z

Reviewed 1 of 1 files at r4.
Review status: all files reviewed at latest revision, 11 unresolved discussions, some commit checks pending.

docs/RFCS/dedicated_raft_storage.md, line 241 at r1 (raw file):

I'm not sure I follow, do you mind clarifying?

Build it so that we can benchmark it ASAP, even if that benchmarked version has blemishes or unaddressed problems that are not expected to influence benchmarks. Just to avoid finding out at the end that the numbers are not as favorable after having put in a lot of work.

docs/RFCS/dedicated_raft_storage.md, line 57 at r4 (raw file):

alternative below).

By having a dedicated storage engine for Raft's persistent state we can address

Can you briefly list which keys actually go in the new engine? Off the top of my head I'm thinking HardState and log keys, but would be good to have a list here.

docs/RFCS/dedicated_raft_storage.md, line 256 at r4 (raw file):

**NB**: There's a subtle edge case to be wary of with respect to raft log
truncations, before truncating the raft log we need to ensure that the
application of the truncated entries have actually been persisted.</br>

You're worried that you would purge the log entries, but then not commit the updated TruncatedState, right? This command is going to be interesting anyways because I think it wants to write to both engines. Think this is worth expanding.

Comments from Reviewable

irfansharif · 2017-06-08T22:20:30Z

Review status: all files reviewed at latest revision, 11 unresolved discussions, some commit checks pending.

docs/RFCS/dedicated_raft_storage.md, line 114 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Yeah, I was going to say the same thing. There might be some interference between the two that would be very relevant to the accuracy of the benchmark.

hmm, I was looking at this today and tried out the alternating benchmark idea using 2 separate RocksDB instances, it's mostly what I expect (similar to results posted here) save for value sizes of 65536 bytes (64 KiB).

~ benchstat perf-old.txt perf-new.txt
name                      old time/op    new time/op    delta
BatchCommit/vs=1024-4       74.0µs ± 3%    71.5µs ± 2%   -3.39%  (p=0.000 n=17+17)
BatchCommit/vs=4096-4        120µs ± 9%     109µs ±16%   -9.18%  (p=0.000 n=20+20)
BatchCommit/vs=16384-4       324µs ±11%     213µs ± 2%  -34.24%  (p=0.000 n=20+18)
BatchCommit/vs=65536-4      1.13ms ±33%    1.13ms ±10%     ~     (p=0.214 n=20+19)
BatchCommit/vs=262144-4     3.63ms ± 4%    2.92ms ± 4%  -19.64%  (p=0.000 n=20+20)
BatchCommit/vs=1048576-4    10.4ms ± 5%     8.1ms ± 5%  -21.97%  (p=0.000 n=16+20)

name                      old speed      new speed      delta
BatchCommit/vs=1024-4     13.8MB/s ± 3%  14.3MB/s ± 2%   +3.50%  (p=0.000 n=17+17)
BatchCommit/vs=4096-4     34.1MB/s ± 8%  37.6MB/s ±14%  +10.31%  (p=0.000 n=20+20)
BatchCommit/vs=16384-4    50.6MB/s ±10%  76.9MB/s ± 2%  +51.81%  (p=0.000 n=20+18)
BatchCommit/vs=65536-4    58.5MB/s ±26%  57.9MB/s ±10%     ~     (p=0.214 n=20+19)
BatchCommit/vs=262144-4   72.2MB/s ± 4%  89.8MB/s ± 4%  +24.42%  (p=0.000 n=20+20)
BatchCommit/vs=1048576-4   101MB/s ± 5%   130MB/s ± 5%  +28.17%  (p=0.000 n=16+20)

The actual code is here. Any ideas for what's so special about 64 KiB?

Comments from Reviewable

petermattis · 2017-06-09T00:30:09Z

Review status: all files reviewed at latest revision, 11 unresolved discussions, some commit checks pending.

docs/RFCS/dedicated_raft_storage.md, line 114 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

hmm, I was looking at this today and tried out the alternating benchmark idea using 2 separate RocksDB instances, it's mostly what I expect (similar to results posted here) save for value sizes of 512 KiB (65536 bytes).

~ benchstat perf-old.txt perf-new.txt
name                      old time/op    new time/op    delta
BatchCommit/vs=1024-4       74.0µs ± 3%    71.5µs ± 2%   -3.39%  (p=0.000 n=17+17)
BatchCommit/vs=4096-4        120µs ± 9%     109µs ±16%   -9.18%  (p=0.000 n=20+20)
BatchCommit/vs=16384-4       324µs ±11%     213µs ± 2%  -34.24%  (p=0.000 n=20+18)
BatchCommit/vs=65536-4      1.13ms ±33%    1.13ms ±10%     ~     (p=0.214 n=20+19)
BatchCommit/vs=262144-4     3.63ms ± 4%    2.92ms ± 4%  -19.64%  (p=0.000 n=20+20)
BatchCommit/vs=1048576-4    10.4ms ± 5%     8.1ms ± 5%  -21.97%  (p=0.000 n=16+20)

name                      old speed      new speed      delta
BatchCommit/vs=1024-4     13.8MB/s ± 3%  14.3MB/s ± 2%   +3.50%  (p=0.000 n=17+17)
BatchCommit/vs=4096-4     34.1MB/s ± 8%  37.6MB/s ±14%  +10.31%  (p=0.000 n=20+20)
BatchCommit/vs=16384-4    50.6MB/s ±10%  76.9MB/s ± 2%  +51.81%  (p=0.000 n=20+18)
BatchCommit/vs=65536-4    58.5MB/s ±26%  57.9MB/s ±10%     ~     (p=0.214 n=20+19)
BatchCommit/vs=262144-4   72.2MB/s ± 4%  89.8MB/s ± 4%  +24.42%  (p=0.000 n=20+20)
BatchCommit/vs=1048576-4   101MB/s ± 5%   130MB/s ± 5%  +28.17%  (p=0.000 n=16+20)

The actual code is here. Any ideas for what's so special about 512 KiB?

The benchmark looks good. The vs=65536 runs have high variance. Were you doing anything else on the machine while running the benchmarks? Even browsing the web can foul things up.

Comments from Reviewable

irfansharif · 2017-06-09T14:51:15Z

Review status: all files reviewed at latest revision, 11 unresolved discussions, some commit checks pending.

docs/RFCS/dedicated_raft_storage.md, line 114 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

The benchmark looks good. The vs=65536 runs have high variance. Were you doing anything else on the machine while running the benchmarks? Even browsing the web can foul things up.

nope, and I've gotten similar results across multiple runs of the same experiment. Poking around this I found something else that was curious.
Here are two versions of a benchmark, only difference being that in one I have a single loop body and write out b.N synced & unsynced writes (interleaved) vs. two loop bodies with b.N synced & unsynced writes each.
As before they're directed at separately running instances, here are the perf differences (especially stark for 64 KiB workloads):

~ benchstat perf-interleaved.txt perf-sequential.txt
name                      old time/op    new time/op     delta
BatchCommit/vs=1024-4       70.1µs ± 2%     68.6µs ± 5%   -2.15%  (p=0.021 n=8+10)
BatchCommit/vs=4096-4        102µs ± 1%       97µs ± 7%   -4.10%  (p=0.013 n=9+10)
BatchCommit/vs=16384-4       207µs ± 5%      188µs ± 4%   -9.46%  (p=0.000 n=9+9)
BatchCommit/vs=65536-4      1.07ms ±12%     0.62ms ± 9%  -41.90%  (p=0.000 n=8+9)
BatchCommit/vs=262144-4     2.90ms ± 8%     2.70ms ± 4%   -6.68%  (p=0.000 n=9+10)
BatchCommit/vs=1048576-4    8.06ms ± 9%     7.90ms ± 5%     ~     (p=0.631 n=10+10)

name                      old speed      new speed       delta
BatchCommit/vs=1024-4     14.6MB/s ± 2%   14.9MB/s ± 4%   +2.22%  (p=0.021 n=8+10)
BatchCommit/vs=4096-4     40.3MB/s ± 1%   42.1MB/s ± 7%   +4.37%  (p=0.013 n=9+10)
BatchCommit/vs=16384-4    78.6MB/s ± 5%   87.4MB/s ± 3%  +11.09%  (p=0.000 n=10+9)
BatchCommit/vs=65536-4    61.6MB/s ±13%  105.5MB/s ± 8%  +71.32%  (p=0.000 n=8+9)
BatchCommit/vs=262144-4   90.6MB/s ± 7%   97.1MB/s ± 4%   +7.13%  (p=0.000 n=9+10)
BatchCommit/vs=1048576-4   130MB/s ± 8%    133MB/s ± 5%     ~     (p=0.631 n=10+10)

Only reasonable conclusion I can draw from this is that the separately running instances
are not as isolated as expected. Given it's running on a single disk machine, possible something
to do with disk and/or OS buffers? I can update the RFC with these results (i.e. skewed 64 KiB
workloads) and make a note to investigate further down the line.
I also tried initializing each instance with it's separately sized cache and upper disk usage limit
(changed setupMVCCRocksDB to this end to not use defaults) but to no effect.

Comments from Reviewable

irfansharif · 2017-06-09T20:09:00Z

Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions.

docs/RFCS/dedicated_raft_storage.md, line 57 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Can you briefly list which keys actually go in the new engine? Off the top of my head I'm thinking HardState and log keys, but would be good to have a list here.

Done.

docs/RFCS/dedicated_raft_storage.md, line 256 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

You're worried that you would purge the log entries, but then not commit the updated TruncatedState, right? This command is going to be interesting anyways because I think it wants to write to both engines. Think this is worth expanding.

Done.

docs/RFCS/dedicated_raft_storage.md, line 272 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I'm not seeing the issue here. The consistency checker doesn't check the raft log itself. All we require is that the consistency checker gets the right snapshot of the main database, which is controlled by the way we apply entries and track the AppliedIndex in the main database (not the raft one)

ah, I was conflating TruncationState being coupled to ReplicaState (thereby not allowing for truncations to happen independently across replicas). Removed.

docs/RFCS/dedicated_raft_storage.md, line 274 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I'd prefer a general migration story instead of a one-time "new nodes use the new format" migration. In addition to solving the problem for existing clusters, a general migration process could be used to move the raft rocksdb instance from one location to another (such as to the non-volatile memory discussed below)

I've added some thoughts I have on this, PTAL as I've found little precedence for this in the past.

docs/RFCS/dedicated_raft_storage.md, line 284 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

If we weren't concerned about it before, why should we be concerned about it now? It doesn't seem like anything has changed. RocksDB's management of the disk space doesn't change just because there are two instances. (The raft logs compete for space with the regular data just as much today).

I was under the impression that each RocksDB instance was bounded by the maxSize it was initialized with (storage/engine/rocksdb.go: NewRocksDB). Assuming each instance has to be initialized with it's own maxSize, knowing ahead of time what ratio of total available disk space gets assigned to each instance is not obvious as yet. The point about investigating multiple embedded RocksDB databases was to this end, i.e. if somehow both instances can be configured to compete for the same total space as before.

docs/RFCS/dedicated_raft_storage.md, line 378 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Even without more specialized hardware it might be desirable to configure the raft rocksdb and regular rocksdb to use different disks.

Added.

Comments from Reviewable

petermattis · 2017-06-09T20:24:42Z

Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions.

docs/RFCS/dedicated_raft_storage.md, line 114 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

nope, and I've gotten similar results across multiple runs of the same experiment. Poking around this I found something else that was curious.
Here are two versions of a benchmark, only difference being that in one I have a single loop body and write out b.N synced & unsynced writes (interleaved) vs. two loop bodies with b.N synced & unsynced writes each.
As before they're directed at separately running instances, here are the perf differences (especially stark for 64 KiB workloads):
~ benchstat perf-interleaved.txt perf-sequential.txt
name                      old time/op    new time/op     delta
BatchCommit/vs=1024-4       70.1µs ± 2%     68.6µs ± 5%   -2.15%  (p=0.021 n=8+10)
BatchCommit/vs=4096-4        102µs ± 1%       97µs ± 7%   -4.10%  (p=0.013 n=9+10)
BatchCommit/vs=16384-4       207µs ± 5%      188µs ± 4%   -9.46%  (p=0.000 n=9+9)
BatchCommit/vs=65536-4      1.07ms ±12%     0.62ms ± 9%  -41.90%  (p=0.000 n=8+9)
BatchCommit/vs=262144-4     2.90ms ± 8%     2.70ms ± 4%   -6.68%  (p=0.000 n=9+10)
BatchCommit/vs=1048576-4    8.06ms ± 9%     7.90ms ± 5%     ~     (p=0.631 n=10+10)

name                      old speed      new speed       delta
BatchCommit/vs=1024-4     14.6MB/s ± 2%   14.9MB/s ± 4%   +2.22%  (p=0.021 n=8+10)
BatchCommit/vs=4096-4     40.3MB/s ± 1%   42.1MB/s ± 7%   +4.37%  (p=0.013 n=9+10)
BatchCommit/vs=16384-4    78.6MB/s ± 5%   87.4MB/s ± 3%  +11.09%  (p=0.000 n=10+9)
BatchCommit/vs=65536-4    61.6MB/s ±13%  105.5MB/s ± 8%  +71.32%  (p=0.000 n=8+9)
BatchCommit/vs=262144-4   90.6MB/s ± 7%   97.1MB/s ± 4%   +7.13%  (p=0.000 n=9+10)
BatchCommit/vs=1048576-4   130MB/s ± 8%    133MB/s ± 5%     ~     (p=0.631 n=10+10)
Only reasonable conclusion I can draw from this is that the separately running instances
are not as isolated as expected. Given it's running on a single disk machine, possible something
to do with disk and/or OS buffers? I can update the RFC with these results (i.e. skewed 64 KiB
workloads) and make a note to investigate further down the line.
I also tried initializing each instance with it's separately sized cache and upper disk usage limit
(changed setupMVCCRocksDB to this end to not use defaults) but to no effect.

Given my recent work on #14108, we'll definitely have to pay attention to how the separate RocksDB instances interact. For example, I seem to recall that we can configure them to use a shared background thread pool. Would that be better, or should they have separate thread pools for background compactions?

docs/RFCS/dedicated_raft_storage.md, line 284 at r4 (raw file):

Previously, irfansharif (irfan sharif) wrote…

I was under the impression that each RocksDB instance was bounded by the maxSize it was initialized with (storage/engine/rocksdb.go: NewRocksDB). Assuming each instance has to be initialized with it's own maxSize, knowing ahead of time what ratio of total available disk space gets assigned to each instance is not obvious as yet. The point about investigating multiple embedded RocksDB databases was to this end, i.e. if somehow both instances can be configured to compete for the same total space as before.

The maxSize setting is only used for in-memory databases which are used for testing. Well, it looks like you might be able to set it for on-disk databases, but I'm not sure what we even do with the parameter in that instance.

Comments from Reviewable

irfansharif · 2017-06-09T21:33:55Z

Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions.

docs/RFCS/dedicated_raft_storage.md, line 114 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Given my recent work on #14108, we'll definitely have to pay attention to how the separate RocksDB instances interact. For example, I seem to recall that we can configure them to use a shared background thread pool. Would that be better, or should they have separate thread pools for background compactions?

they can be configured to use a shared bg compaction thread pool yes, as for whether or not they should be I think it'll have to be determined experimentally. AFAICT the usage patterns here let us get away with very few compaction threads (we only need to compact what we truncate away, compactions need not happen frequently).

docs/RFCS/dedicated_raft_storage.md, line 284 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

The maxSize setting is only used for in-memory databases which are used for testing. Well, it looks like you might be able to set it for on-disk databases, but I'm not sure what we even do with the parameter in that instance.

welp, you're right - apparently we do nothing in that case. pkg/base/store_spec.go: StoreSpec.SizeInBytes threw me off, gets passed down along from the cli (we can do --store-path=/mnt/ssd01,size=0.02TiB) but then isn't used to size on-disk instances (used for when calculating 'free space' for re-balancing purposes).

Removed this whole section.

Comments from Reviewable

bdarnell · 2017-06-12T14:57:47Z

Review status: 0 of 1 files reviewed at latest revision, 13 unresolved discussions.

docs/RFCS/dedicated_raft_storage.md, line 247 at r5 (raw file):

`raft` under our existing RocksDB storage directory.
At the time of writing the keys that would need to be written to the new engine
are the log keys and `HardState`.

We may want to consider using two column families for this, since the log keys are write-once and short-lived, while the hard state is overwritten frequently but never goes away completely.

docs/RFCS/dedicated_raft_storage.md, line 280 at r5 (raw file):

disk. If the node crashes at this point it will fail to load the
`TruncatedState` and has no way to bridge the gap between the last persisted
`ReplicaState` and the oldest entry in the truncated Raft log.

This seems bad. Do you have an idea for how to handle this? Maybe the truncation state should be stored in the log rocksdb instead of the regular one.

docs/RFCS/dedicated_raft_storage.md, line 313 at r5 (raw file):

migration at this time.

An online approach that could enable live migrations moving consensus state

We could also do a partially-online migration by using the new format for all new Replicas, so that as splits and rebalances occur we'd gradually move to the new format.

Comments from Reviewable

tbg · 2017-06-12T15:17:57Z

Reviewed 1 of 1 files at r6.
Review status: all files reviewed at latest revision, 11 unresolved discussions.

docs/RFCS/dedicated_raft_storage.md, line 247 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We may want to consider using two column families for this, since the log keys are write-once and short-lived, while the hard state is overwritten frequently but never goes away completely.

No doubt you're aware, just wanted to point out explicitly that log keys are only usually write-once (log tail can be replaced after leadership change). If they were truly write once, we could use the RocksDB SingleDelete optimization, however useful that one may be.

docs/RFCS/dedicated_raft_storage.md, line 280 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This seems bad. Do you have an idea for how to handle this? Maybe the truncation state should be stored in the log rocksdb instead of the regular one.

I think you just have to sync when you write the TruncatedState. Is that a big downside?

You can do better: you can only truncate the log once TruncatedState has been synced to disk. If we had a mechanism for finding out when that has happened, we could just remove the log entries then.

I don't know if it's relevant here, but could be related: we should refactor the way evalTruncateLog works. It currently takes writes all the way through the proposer-evaluated KV machinery, and at least from the graphs it looks that that's enough traffic to impair Raft throughput alone. We could lower the actual ranged clear below Raft (after all, no migration concerns there). We would be relaxing, somewhat, the stats which are now authoritative and would then only become "real" once the Raft log had actually been purged all the way up to the TruncatedState. I think there's no problem with that.

I haven't thought much about moving TruncatedState to the other engine. Seems possible too.

docs/RFCS/dedicated_raft_storage.md, line 313 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We could also do a partially-online migration by using the new format for all new Replicas, so that as splits and rebalances occur we'd gradually move to the new format.

That's possible, but is it really a useful approach? We try to minimize data movement, and stable deployments could take forever to actually upgrade.

Comments from Reviewable

irfansharif · 2017-06-12T16:07:09Z

thank you for the detailed reviews everyone! I don't see any outstanding comments so will be moving it to final comment period now.

Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions.

docs/RFCS/dedicated_raft_storage.md, line 247 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

No doubt you're aware, just wanted to point out explicitly that log keys are only usually write-once (log tail can be replaced after leadership change). If they were truly write once, we could use the RocksDB SingleDelete optimization, however useful that one may be.

Noted.

docs/RFCS/dedicated_raft_storage.md, line 280 at r5 (raw file):

Do you have an idea for how to handle this?

not yet, didn't get far enough into my experimentation branch to run into this. I have little context into the actual code implementation here but will mention alternatives considered when implementing (truncated state stored in the log RocksDB was one but I don't know how much internal coupling there is as yet).

Comments from Reviewable

tbg · 2017-06-12T17:50:10Z

Reviewed 1 of 1 files at r7.
Review status: all files reviewed at latest revision, 11 unresolved discussions.

docs/RFCS/dedicated_raft_storage.md, line 280 at r5 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Do you have an idea for how to handle this?

not yet, didn't get far enough into my experimentation branch to run into this. I have little context into the actual code implementation here but will mention alternatives considered when implementing (truncated state stored in the log RocksDB was one but I don't know how much internal coupling there is as yet).

This section doesn't reflect the discussion here yet. I think you can just list these alternatives here and then decide on one when you're implementing.

Comments from Reviewable

bdarnell · 2017-06-12T20:18:41Z

Reviewed 1 of 1 files at r5.
Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions.

docs/RFCS/dedicated_raft_storage.md, line 247 at r5 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Noted.

Ah, right. I was thinking about the naming scheme used for sideloaded sstables, which include the term and are really write-once.

docs/RFCS/dedicated_raft_storage.md, line 280 at r5 (raw file):

I think you just have to sync when you write the TruncatedState. Is that a big downside?

This will be the only time we explicitly sync the KV rocksdb. It'll be an expensive sync and might cause performance problems. I think using the other engine will be better if it works.

Comments from Reviewable

petermattis · 2017-06-12T20:23:42Z

Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions.

docs/RFCS/dedicated_raft_storage.md, line 280 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think you just have to sync when you write the TruncatedState. Is that a big downside?

This will be the only time we explicitly sync the KV rocksdb. It'll be an expensive sync and might cause performance problems. I think using the other engine will be better if it works.

It's not just the TruncatedState that needs to be synced, but the associated effects of those entries, right? For example, if I Put(a) the primary RocksDB engine must have synced a before I can truncate that Put operation from the Raft log.

Comments from Reviewable

tbg · 2017-06-12T20:27:21Z

Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions, some commit checks pending.

docs/RFCS/dedicated_raft_storage.md, line 280 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

It's not just the TruncatedState that needs to be synced, but the associated effects of those entries, right? For example, if I Put(a) the primary RocksDB engine must have synced a before I can truncate that Put operation from the Raft log.

You're right, we need to explicitly sync whenever the synced AppliedIndex < new first log index.

Comments from Reviewable

bdarnell · 2017-06-12T20:28:01Z

Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions, some commit checks pending.

docs/RFCS/dedicated_raft_storage.md, line 280 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

You're right, we need to explicitly sync whenever the synced AppliedIndex < new first log index.

Oops, I should have read the whole paragraph more closely. It's not just TruncatedState so moving that to the other engine probably won't help. In that case I think syncing the KV engine when writing TruncatedState is probably the best we can do.

Comments from Reviewable

irfansharif · 2017-06-12T21:05:06Z

Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions, some commit checks pending.

docs/RFCS/dedicated_raft_storage.md, line 280 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Oops, I should have read the whole paragraph more closely. It's not just TruncatedState so moving that to the other engine probably won't help. In that case I think syncing the KV engine when writing TruncatedState is probably the best we can do.

Expanded the paragraph further for clarity and added takeaways from the discussion here.

Comments from Reviewable

tbg · 2017-06-13T02:55:18Z

Reviewed 1 of 1 files at r8.
Review status: all files reviewed at latest revision, 11 unresolved discussions, all commit checks successful.

Comments from Reviewable

Each Replica is backed by a single instance of RocksDB which is used to store all modifications to the underlying state machine in addition to storing all consensus state. This RFC proposes the separation of the two, outlines the motivations for doing so and alternatives considered.

Implements cockroachdb#16361. This is a breaking change. To see why consider that prior to this we stored all consensus data in addition to all system metadata and user level keys in the same, single RocksDB instance. Here we introduce a separate, dedicated instance for raft data (log entries and HardState). Cockroach nodes simply restarting with these changes, unless migrated properly, will fail to find the most recent raft long entries and HardState data in the new RocksDB instance. Also consider a cluster running mixed versions (nodes with dedicated raft storage and nodes without), what would the communication between nodes here like in light of proposer evaluated KV? Current we propagate a storagebase.WriteBatch through raft containing a serialized representation of a RocksDB write batch, this models the changes to be made to the single underlying RocksDB instance. For log truncation requests where we delete log entries and/or admin splits where we write initial HardState for newly formed replicas, we need to similarly propagate a write batch (through raft) addressing the new RocksDB instance (if the recipient node is one with these changes) or the original RocksDB instance (if the recipient node is one without these changes). What if an older version node is the raft leader and is therefore the one upstream of raft, propagating storagebase.WriteBatches with raft data changes but addressed to the original RocksDB instance? What would rollbacks look like? To this end we introduce three modes of operation, {Disabled,Transitioning,Enabled}RaftStorage. We've made it so that it is safe to transition between DisabledRaftStorage to TransitioningRaftStorage, from TransitioningRaftStorage to EnabledRaftStorage and the reverse for rollbacks. Transition from one mode to the next will take place when all the nodes in the cluster are on the same previous mode. The operation mode is set by an env var COCKROACH_DEDICATED_RAFT_STORAGE={DISABLED,TRANSITIONING,ENABLED} - DisabledRaftStorage mode preserves the previous behavior in that we use a single RocksDB instance for both raft and user-level KV data. - EnabledRaftStorage mode enables the use of the dedicated RocksDB instance for raft data. Raft log entries and the HardState are stored on this instance alone. - TransitioningRaftStorage mode uses both RocksDB instances for raft data interoperably, the raft specific and the regular instance. We use this mode to facilitate rolling upgrades. Most of this commit is careful plumbing of an extra engine.{Engine,Batch,Reader,Writer,ReadWriter} for whenever we need to interact with the new RocksDB instance. In DisabledRaftStorage both these instances refer to the same underlying engine (thus preserving the previous behaviour). The following pattern is oft repeated: batch := ... batchRaft := batch if TransitioningRaftStorage || EnabledRaftStorage { batchRaft = ... } Here are some initial performance numbers: ~ benchstat perf-disabled perf-enabled name old time/op new time/op delta ReplicaRaftStorage/vs=1024-4 320µs ± 6% 385µs ±18% +20.29% (p=0.000 n=10+10) ReplicaRaftStorage/vs=4096-4 613µs ±14% 580µs ± 2% ~ (p=0.278 n=10+9) ReplicaRaftStorage/vs=16384-4 2.59ms ± 3% 2.05ms ± 4% -20.87% (p=0.000 n=10+9) ReplicaRaftStorage/vs=65536-4 4.11ms ± 7% 3.29ms ± 3% -19.97% (p=0.000 n=10+10) ReplicaRaftStorage/vs=262144-4 13.4ms ± 8% 10.7ms ± 3% -20.39% (p=0.000 n=10+10) ReplicaRaftStorage/vs=1048576-4 56.8ms ± 3% 36.4ms ± 2% -35.91% (p=0.000 n=10+10)

Implements cockroachdb#16361. This is a breaking change. To see why consider that prior to this we stored all consensus data in addition to all system metadata and user level keys in the same, single RocksDB instance. Here we introduce a separate, dedicated instance for raft data (log entries and HardState). Cockroach nodes simply restarting with these changes, unless migrated properly, will fail to find the most recent raft long entries and HardState data in the new RocksDB instance. Also consider a cluster running mixed versions (nodes with dedicated raft storage and nodes without), what would the communication between nodes here like in light of proposer evaluated KV? Current we propagate a storagebase.WriteBatch through raft containing a serialized representation of a RocksDB write batch, this models the changes to be made to the single underlying RocksDB instance. For log truncation requests where we delete log entries and/or admin splits where we write initial HardState for newly formed replicas, we need to similarly propagate a write batch (through raft) addressing the new RocksDB instance (if the recipient node is one with these changes) or the original RocksDB instance (if the recipient node is one without these changes). What if an older version node is the raft leader and is therefore the one upstream of raft, propagating storagebase.WriteBatches with raft data changes but addressed to the original RocksDB instance? What would rollbacks look like? To this end we introduce three modes of operation, transitioningRaftStorage and enabledRaftStorage (this is implicit if we're not in transitioning mode). We've made it so that it is safe to transition between an older cockroach version to transitioningRaftStorage, from transitioningRaftStorage to enabled and the reverse for rollbacks. Transition from one mode to the next will take place when all the nodes in the cluster are on the same previous mode. The operation mode is set by an env var COCKROACH_DEDICATED_RAFT_STORAGE={DISABLED,TRANSITIONING,ENABLED} - In the old version we use a single RocksDB instance for both raft and user-level KV data - In transitioningRaftStorage mode we use both RocksDB instances for raft data interoperably, the raft specific and the regular instance. We use this mode to facilitate rolling upgrades - In enabled mode we use the dedicated RocksDB instance for raft data. Raft log entries and the HardState are stored on this instance alone Most of this commit is careful plumbing of an extra engine.{Engine,Batch,Reader,Writer,ReadWriter} for whenever we need to interact with the new RocksDB instance.

irfansharif requested review from bdarnell, tbg and petermattis June 6, 2017 17:46

irfansharif mentioned this pull request Jun 6, 2017

storage: Consider moving raft log out of rocksdb #7807

Closed

irfansharif force-pushed the raftlog-rfc branch from 11fa3e7 to a7cb3c5 Compare June 6, 2017 18:59

irfansharif force-pushed the raftlog-rfc branch from a7cb3c5 to 8a2c766 Compare June 6, 2017 20:46

irfansharif force-pushed the raftlog-rfc branch 2 times, most recently from 0ce250b to 3db903f Compare June 6, 2017 23:04

irfansharif mentioned this pull request Jun 8, 2017

storage/engine: Do we need use_fsync? #16412

Closed

irfansharif force-pushed the raftlog-rfc branch from 3db903f to 40de8e4 Compare June 9, 2017 20:05

irfansharif force-pushed the raftlog-rfc branch from 40de8e4 to 40f7517 Compare June 9, 2017 21:33

irfansharif force-pushed the raftlog-rfc branch from 40f7517 to 5dd6301 Compare June 12, 2017 14:54

irfansharif force-pushed the raftlog-rfc branch 2 times, most recently from c61fe8f to cd8feb9 Compare June 12, 2017 16:02

irfansharif force-pushed the raftlog-rfc branch 2 times, most recently from 83ec940 to 4ac998e Compare June 12, 2017 16:04

irfansharif force-pushed the raftlog-rfc branch from 4ac998e to a50b528 Compare June 12, 2017 17:26

irfansharif force-pushed the raftlog-rfc branch from a50b528 to 7a989eb Compare June 12, 2017 20:17

irfansharif force-pushed the raftlog-rfc branch 2 times, most recently from 3b0e6c9 to 9673bd8 Compare June 12, 2017 21:00

irfansharif force-pushed the raftlog-rfc branch from 9673bd8 to 7d9fb5d Compare June 14, 2017 14:30

irfansharif merged commit 45119f1 into cockroachdb:master Jun 14, 2017

irfansharif deleted the raftlog-rfc branch June 14, 2017 15:33

irfansharif mentioned this pull request Jun 20, 2017

kvserver: separate raft log #16624

Open

irfansharif mentioned this pull request Jun 30, 2017

[wip,dnm] storage: introduced dedicated raft storage #16809

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: dedicated storage engine for Raft #16361

RFC: dedicated storage engine for Raft #16361

irfansharif commented Jun 6, 2017

cockroach-teamcity commented Jun 6, 2017

petermattis commented Jun 6, 2017

a-robinson commented Jun 6, 2017

irfansharif commented Jun 6, 2017

a-robinson commented Jun 6, 2017

irfansharif commented Jun 6, 2017

bdarnell commented Jun 7, 2017

tbg commented Jun 7, 2017

irfansharif commented Jun 8, 2017 •

edited

Loading

petermattis commented Jun 9, 2017

irfansharif commented Jun 9, 2017

irfansharif commented Jun 9, 2017

petermattis commented Jun 9, 2017

irfansharif commented Jun 9, 2017

bdarnell commented Jun 12, 2017

tbg commented Jun 12, 2017

irfansharif commented Jun 12, 2017

tbg commented Jun 12, 2017

bdarnell commented Jun 12, 2017

petermattis commented Jun 12, 2017

tbg commented Jun 12, 2017

bdarnell commented Jun 12, 2017

irfansharif commented Jun 12, 2017

tbg commented Jun 13, 2017

RFC: dedicated storage engine for Raft #16361

RFC: dedicated storage engine for Raft #16361

Conversation

irfansharif commented Jun 6, 2017

cockroach-teamcity commented Jun 6, 2017

petermattis commented Jun 6, 2017

a-robinson commented Jun 6, 2017

irfansharif commented Jun 6, 2017

a-robinson commented Jun 6, 2017

irfansharif commented Jun 6, 2017

bdarnell commented Jun 7, 2017

tbg commented Jun 7, 2017

irfansharif commented Jun 8, 2017 • edited Loading

petermattis commented Jun 9, 2017

irfansharif commented Jun 9, 2017

irfansharif commented Jun 9, 2017

petermattis commented Jun 9, 2017

irfansharif commented Jun 9, 2017

bdarnell commented Jun 12, 2017

tbg commented Jun 12, 2017

irfansharif commented Jun 12, 2017

tbg commented Jun 12, 2017

bdarnell commented Jun 12, 2017

petermattis commented Jun 12, 2017

tbg commented Jun 12, 2017

bdarnell commented Jun 12, 2017

irfansharif commented Jun 12, 2017

tbg commented Jun 13, 2017

irfansharif commented Jun 8, 2017 •

edited

Loading