Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: dedicated storage engine for Raft #16361

Merged
merged 1 commit into from
Jun 14, 2017

Conversation

irfansharif
Copy link
Contributor

Each Replica is backed by a single instance of RocksDB which is used to
store all modifications to the underlying state machine in addition to
storing all consensus state. This RFC proposes the separation of the
two, outlines the motivations for doing so and alternatives considered.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@petermattis
Copy link
Collaborator

:lgtm: Nice RFC!


Review status: 0 of 1 files reviewed at latest revision, 4 unresolved discussions, some commit checks pending.


docs/RFCS/dedicated_raft_storage.md, line 114 at r1 (raw file):

```go
func BenchmarkBatchCommitSequential(b *testing.B) {

I can see the value in this benchmark, but I'd be happier with an alternating benchmark that used 2 separate RocksDB instances. Hopefully this would get similar numbers to this benchmark.


docs/RFCS/dedicated_raft_storage.md, line 241 at r1 (raw file):

done for all reads, we will stop writing out raft entries to the original
RocksDB instance.  Raft specific reads, writes and log truncations will now be
serviced by the new instance. </br>

Since we're punting on the migration story for now, it seems worthwhile to structure this so that both a shared RocksDB instance and a separate RocksDB instance can be used, controlled by an env var.

Also, given that the goal in this RFC is a performance improvement, it would be good to structure work so that we can get a full-system sanity check of the performance soon, even if this means doing something slightly hacky.


docs/RFCS/dedicated_raft_storage.md, line 259 at r1 (raw file):

TBD. The Raft log is tightly coupled to our consistency checker, this will be
something to be wary of in order to transition safely.

Also TBD is how to actually start using this RocksDB instance. One thought is that only new nodes would use a separate RocksDB instance for the Raft log. Another is to add a store-level migration. Let's not explore these too deeply yet, just mention them briefly.


docs/RFCS/dedicated_raft_storage.md, line 292 at r1 (raw file):

for the Raft log usage patterns. Possible reasons for doing so:
- A native implementation in Go would avoid the CGo overhead we incur crossing
  the G/C++ boundary

s/G/Go/g


Comments from Reviewable

@a-robinson
Copy link
Contributor

:lgtm: Great writeup, @irfansharif


Review status: 0 of 1 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.


docs/RFCS/dedicated_raft_storage.md, line 114 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I can see the value in this benchmark, but I'd be happier with an alternating benchmark that used 2 separate RocksDB instances. Hopefully this would get similar numbers to this benchmark.

Yeah, I was going to say the same thing. There might be some interference between the two that would be very relevant to the accuracy of the benchmark.


docs/RFCS/dedicated_raft_storage.md, line 81 at r2 (raw file):

    for _, valueSize := range []int{1 << 10, 1 << 12, ..., 1 << 20} {
		b.Run(fmt.Sprintf("vs=%d", valueSize), func(b *testing.B) {
            // ...

You may want to clean up the indentation in here to make it render properly


docs/RFCS/dedicated_raft_storage.md, line 247 at r2 (raw file):

RocksDB by default uses faster fdatasync() to sync files. We'll need to
use fsync() instead in filesystems like ext3 where you can lose files after a
reboot, this can be done so by setting Options::use_fsync to true (for the
WriteBatch).

This comment is kinda coming out of the blue. Feel free to punt these questions out into a separate issue, but does this problem affect us today? Will we always enable use_fsync? If not, how will we tell when we need to?


Comments from Reviewable

@irfansharif
Copy link
Contributor Author

Review status: 0 of 1 files reviewed at latest revision, 6 unresolved discussions.


docs/RFCS/dedicated_raft_storage.md, line 241 at r1 (raw file):

it would be good to structure work so that we can get a full-system sanity check of the performance soon, even if this means doing something slightly hacky.

I'm not sure I follow, do you mind clarifying?


docs/RFCS/dedicated_raft_storage.md, line 81 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

You may want to clean up the indentation in here to make it render properly

hmm, strange. works for me™:
✂-1


docs/RFCS/dedicated_raft_storage.md, line 247 at r2 (raw file):
Source: rocksdb/wiki.

does this problem affect us today?

possibly, yes.

Will we always enable use_fsync? If not, how will we tell when we need to?

like mentioned above, only for ext3 filesystems. This is a flag set on rocksdb::Options, it should happen here but doesn't seem so. As for how, I'm sure there's a way to detect if the instance is configured to run on an ext3 filesystem.

This jumped out to me as well, I couldn't find instances of us accounting for this fact but I'll file + address separately once I confirm.


Comments from Reviewable

@a-robinson
Copy link
Contributor

Reviewed 1 of 1 files at r2, 1 of 1 files at r3.
Review status: all files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


docs/RFCS/dedicated_raft_storage.md, line 81 at r2 (raw file):

Previously, irfansharif (irfan sharif) wrote…

hmm, strange. works for me™:
✂-1

Not on github in chrome:

Screen Shot 2017-06-06 at 5.51.06 PM.png

Reviewable is even indicating that there's some whitespace weirdness in that some of the lines have the red >> arrows and others don't.


Comments from Reviewable

@irfansharif irfansharif force-pushed the raftlog-rfc branch 2 times, most recently from 0ce250b to 3db903f Compare June 6, 2017 23:04
@irfansharif
Copy link
Contributor Author

Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions.


docs/RFCS/dedicated_raft_storage.md, line 241 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

it would be good to structure work so that we can get a full-system sanity check of the performance soon, even if this means doing something slightly hacky.

I'm not sure I follow, do you mind clarifying?

Added these notes to the RFC.


docs/RFCS/dedicated_raft_storage.md, line 259 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Also TBD is how to actually start using this RocksDB instance. One thought is that only new nodes would use a separate RocksDB instance for the Raft log. Another is to add a store-level migration. Let's not explore these too deeply yet, just mention them briefly.

Done.


docs/RFCS/dedicated_raft_storage.md, line 81 at r2 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Not on github in chrome:

Screen Shot 2017-06-06 at 5.51.06 PM.png

Reviewable is even indicating that there's some whitespace weirdness in that some of the lines have the red >> arrows and others don't.

strange, Done.


Comments from Reviewable

@bdarnell
Copy link
Contributor

bdarnell commented Jun 7, 2017

:lgtm:


Review status: 0 of 1 files reviewed at latest revision, 9 unresolved discussions, some commit checks pending.


docs/RFCS/dedicated_raft_storage.md, line 272 at r4 (raw file):

TBD. Some things to take note of:
- The Raft log is tightly coupled to our consistency checker, this will be
  something to be wary of in order to transition safely.

I'm not seeing the issue here. The consistency checker doesn't check the raft log itself. All we require is that the consistency checker gets the right snapshot of the main database, which is controlled by the way we apply entries and track the AppliedIndex in the main database (not the raft one)


docs/RFCS/dedicated_raft_storage.md, line 274 at r4 (raw file):

  something to be wary of in order to transition safely.
- How do we actually start using this RocksDB instance? One thought is that only
  new nodes would use a separate RocksDB instance for the Raft log. Another

I'd prefer a general migration story instead of a one-time "new nodes use the new format" migration. In addition to solving the problem for existing clusters, a general migration process could be used to move the raft rocksdb instance from one location to another (such as to the non-volatile memory discussed below)


docs/RFCS/dedicated_raft_storage.md, line 284 at r4 (raw file):

It is not immediately obvious how much disk space should be allocated for the
Raft specific RocksDB engine. This is not something we had to concern ourselves

If we weren't concerned about it before, why should we be concerned about it now? It doesn't seem like anything has changed. RocksDB's management of the disk space doesn't change just because there are two instances. (The raft logs compete for space with the regular data just as much today).


docs/RFCS/dedicated_raft_storage.md, line 378 at r4 (raw file):

in non-volatile memory instead of disk (for etcd/raft)<sup>[6]</sup>.
Given we're proposing a separate storage engine for the Raft log, in the
presence of more suitable hardware medium it should be easy enough to configure

Even without more specialized hardware it might be desirable to configure the raft rocksdb and regular rocksdb to use different disks.


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Jun 7, 2017

:lgtm:


Reviewed 1 of 1 files at r4.
Review status: all files reviewed at latest revision, 11 unresolved discussions, some commit checks pending.


docs/RFCS/dedicated_raft_storage.md, line 241 at r1 (raw file):

I'm not sure I follow, do you mind clarifying?

Build it so that we can benchmark it ASAP, even if that benchmarked version has blemishes or unaddressed problems that are not expected to influence benchmarks. Just to avoid finding out at the end that the numbers are not as favorable after having put in a lot of work.


docs/RFCS/dedicated_raft_storage.md, line 57 at r4 (raw file):

alternative below).

By having a dedicated storage engine for Raft's persistent state we can address

Can you briefly list which keys actually go in the new engine? Off the top of my head I'm thinking HardState and log keys, but would be good to have a list here.


docs/RFCS/dedicated_raft_storage.md, line 256 at r4 (raw file):

**NB**: There's a subtle edge case to be wary of with respect to raft log
truncations, before truncating the raft log we need to ensure that the
application of the truncated entries have actually been persisted.</br>

You're worried that you would purge the log entries, but then not commit the updated TruncatedState, right? This command is going to be interesting anyways because I think it wants to write to both engines. Think this is worth expanding.


Comments from Reviewable

@irfansharif
Copy link
Contributor Author

irfansharif commented Jun 8, 2017

Review status: all files reviewed at latest revision, 11 unresolved discussions, some commit checks pending.


docs/RFCS/dedicated_raft_storage.md, line 114 at r1 (raw file):

Previously, a-robinson (Alex Robinson) wrote…

Yeah, I was going to say the same thing. There might be some interference between the two that would be very relevant to the accuracy of the benchmark.

hmm, I was looking at this today and tried out the alternating benchmark idea using 2 separate RocksDB instances, it's mostly what I expect (similar to results posted here) save for value sizes of 65536 bytes (64 KiB).

~ benchstat perf-old.txt perf-new.txt
name                      old time/op    new time/op    delta
BatchCommit/vs=1024-4       74.0µs ± 3%    71.5µs ± 2%   -3.39%  (p=0.000 n=17+17)
BatchCommit/vs=4096-4        120µs ± 9%     109µs ±16%   -9.18%  (p=0.000 n=20+20)
BatchCommit/vs=16384-4       324µs ±11%     213µs ± 2%  -34.24%  (p=0.000 n=20+18)
BatchCommit/vs=65536-4      1.13ms ±33%    1.13ms ±10%     ~     (p=0.214 n=20+19)
BatchCommit/vs=262144-4     3.63ms ± 4%    2.92ms ± 4%  -19.64%  (p=0.000 n=20+20)
BatchCommit/vs=1048576-4    10.4ms ± 5%     8.1ms ± 5%  -21.97%  (p=0.000 n=16+20)

name                      old speed      new speed      delta
BatchCommit/vs=1024-4     13.8MB/s ± 3%  14.3MB/s ± 2%   +3.50%  (p=0.000 n=17+17)
BatchCommit/vs=4096-4     34.1MB/s ± 8%  37.6MB/s ±14%  +10.31%  (p=0.000 n=20+20)
BatchCommit/vs=16384-4    50.6MB/s ±10%  76.9MB/s ± 2%  +51.81%  (p=0.000 n=20+18)
BatchCommit/vs=65536-4    58.5MB/s ±26%  57.9MB/s ±10%     ~     (p=0.214 n=20+19)
BatchCommit/vs=262144-4   72.2MB/s ± 4%  89.8MB/s ± 4%  +24.42%  (p=0.000 n=20+20)
BatchCommit/vs=1048576-4   101MB/s ± 5%   130MB/s ± 5%  +28.17%  (p=0.000 n=16+20)

The actual code is here. Any ideas for what's so special about 64 KiB?


Comments from Reviewable

@petermattis
Copy link
Collaborator

Review status: all files reviewed at latest revision, 11 unresolved discussions, some commit checks pending.


docs/RFCS/dedicated_raft_storage.md, line 114 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

hmm, I was looking at this today and tried out the alternating benchmark idea using 2 separate RocksDB instances, it's mostly what I expect (similar to results posted here) save for value sizes of 512 KiB (65536 bytes).

~ benchstat perf-old.txt perf-new.txt
name                      old time/op    new time/op    delta
BatchCommit/vs=1024-4       74.0µs ± 3%    71.5µs ± 2%   -3.39%  (p=0.000 n=17+17)
BatchCommit/vs=4096-4        120µs ± 9%     109µs ±16%   -9.18%  (p=0.000 n=20+20)
BatchCommit/vs=16384-4       324µs ±11%     213µs ± 2%  -34.24%  (p=0.000 n=20+18)
BatchCommit/vs=65536-4      1.13ms ±33%    1.13ms ±10%     ~     (p=0.214 n=20+19)
BatchCommit/vs=262144-4     3.63ms ± 4%    2.92ms ± 4%  -19.64%  (p=0.000 n=20+20)
BatchCommit/vs=1048576-4    10.4ms ± 5%     8.1ms ± 5%  -21.97%  (p=0.000 n=16+20)

name                      old speed      new speed      delta
BatchCommit/vs=1024-4     13.8MB/s ± 3%  14.3MB/s ± 2%   +3.50%  (p=0.000 n=17+17)
BatchCommit/vs=4096-4     34.1MB/s ± 8%  37.6MB/s ±14%  +10.31%  (p=0.000 n=20+20)
BatchCommit/vs=16384-4    50.6MB/s ±10%  76.9MB/s ± 2%  +51.81%  (p=0.000 n=20+18)
BatchCommit/vs=65536-4    58.5MB/s ±26%  57.9MB/s ±10%     ~     (p=0.214 n=20+19)
BatchCommit/vs=262144-4   72.2MB/s ± 4%  89.8MB/s ± 4%  +24.42%  (p=0.000 n=20+20)
BatchCommit/vs=1048576-4   101MB/s ± 5%   130MB/s ± 5%  +28.17%  (p=0.000 n=16+20)

The actual code is here. Any ideas for what's so special about 512 KiB?

The benchmark looks good. The vs=65536 runs have high variance. Were you doing anything else on the machine while running the benchmarks? Even browsing the web can foul things up.


Comments from Reviewable

@irfansharif
Copy link
Contributor Author

Review status: all files reviewed at latest revision, 11 unresolved discussions, some commit checks pending.


docs/RFCS/dedicated_raft_storage.md, line 114 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

The benchmark looks good. The vs=65536 runs have high variance. Were you doing anything else on the machine while running the benchmarks? Even browsing the web can foul things up.

nope, and I've gotten similar results across multiple runs of the same experiment. Poking around this I found something else that was curious.
Here are two versions of a benchmark, only difference being that in one I have a single loop body and write out b.N synced & unsynced writes (interleaved) vs. two loop bodies with b.N synced & unsynced writes each.
As before they're directed at separately running instances, here are the perf differences (especially stark for 64 KiB workloads):

~ benchstat perf-interleaved.txt perf-sequential.txt
name                      old time/op    new time/op     delta
BatchCommit/vs=1024-4       70.1µs ± 2%     68.6µs ± 5%   -2.15%  (p=0.021 n=8+10)
BatchCommit/vs=4096-4        102µs ± 1%       97µs ± 7%   -4.10%  (p=0.013 n=9+10)
BatchCommit/vs=16384-4       207µs ± 5%      188µs ± 4%   -9.46%  (p=0.000 n=9+9)
BatchCommit/vs=65536-4      1.07ms ±12%     0.62ms ± 9%  -41.90%  (p=0.000 n=8+9)
BatchCommit/vs=262144-4     2.90ms ± 8%     2.70ms ± 4%   -6.68%  (p=0.000 n=9+10)
BatchCommit/vs=1048576-4    8.06ms ± 9%     7.90ms ± 5%     ~     (p=0.631 n=10+10)

name                      old speed      new speed       delta
BatchCommit/vs=1024-4     14.6MB/s ± 2%   14.9MB/s ± 4%   +2.22%  (p=0.021 n=8+10)
BatchCommit/vs=4096-4     40.3MB/s ± 1%   42.1MB/s ± 7%   +4.37%  (p=0.013 n=9+10)
BatchCommit/vs=16384-4    78.6MB/s ± 5%   87.4MB/s ± 3%  +11.09%  (p=0.000 n=10+9)
BatchCommit/vs=65536-4    61.6MB/s ±13%  105.5MB/s ± 8%  +71.32%  (p=0.000 n=8+9)
BatchCommit/vs=262144-4   90.6MB/s ± 7%   97.1MB/s ± 4%   +7.13%  (p=0.000 n=9+10)
BatchCommit/vs=1048576-4   130MB/s ± 8%    133MB/s ± 5%     ~     (p=0.631 n=10+10)

Only reasonable conclusion I can draw from this is that the separately running instances
are not as isolated as expected. Given it's running on a single disk machine, possible something
to do with disk and/or OS buffers? I can update the RFC with these results (i.e. skewed 64 KiB
workloads) and make a note to investigate further down the line.
I also tried initializing each instance with it's separately sized cache and upper disk usage limit
(changed setupMVCCRocksDB to this end to not use defaults) but to no effect.


Comments from Reviewable

@irfansharif
Copy link
Contributor Author

Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions.


docs/RFCS/dedicated_raft_storage.md, line 57 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Can you briefly list which keys actually go in the new engine? Off the top of my head I'm thinking HardState and log keys, but would be good to have a list here.

Done.


docs/RFCS/dedicated_raft_storage.md, line 256 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

You're worried that you would purge the log entries, but then not commit the updated TruncatedState, right? This command is going to be interesting anyways because I think it wants to write to both engines. Think this is worth expanding.

Done.


docs/RFCS/dedicated_raft_storage.md, line 272 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I'm not seeing the issue here. The consistency checker doesn't check the raft log itself. All we require is that the consistency checker gets the right snapshot of the main database, which is controlled by the way we apply entries and track the AppliedIndex in the main database (not the raft one)

ah, I was conflating TruncationState being coupled to ReplicaState (thereby not allowing for truncations to happen independently across replicas). Removed.


docs/RFCS/dedicated_raft_storage.md, line 274 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I'd prefer a general migration story instead of a one-time "new nodes use the new format" migration. In addition to solving the problem for existing clusters, a general migration process could be used to move the raft rocksdb instance from one location to another (such as to the non-volatile memory discussed below)

I've added some thoughts I have on this, PTAL as I've found little precedence for this in the past.


docs/RFCS/dedicated_raft_storage.md, line 284 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

If we weren't concerned about it before, why should we be concerned about it now? It doesn't seem like anything has changed. RocksDB's management of the disk space doesn't change just because there are two instances. (The raft logs compete for space with the regular data just as much today).

I was under the impression that each RocksDB instance was bounded by the maxSize it was initialized with (storage/engine/rocksdb.go: NewRocksDB). Assuming each instance has to be initialized with it's own maxSize, knowing ahead of time what ratio of total available disk space gets assigned to each instance is not obvious as yet. The point about investigating multiple embedded RocksDB databases was to this end, i.e. if somehow both instances can be configured to compete for the same total space as before.


docs/RFCS/dedicated_raft_storage.md, line 378 at r4 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Even without more specialized hardware it might be desirable to configure the raft rocksdb and regular rocksdb to use different disks.

Added.


Comments from Reviewable

@petermattis
Copy link
Collaborator

Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions.


docs/RFCS/dedicated_raft_storage.md, line 114 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

nope, and I've gotten similar results across multiple runs of the same experiment. Poking around this I found something else that was curious.
Here are two versions of a benchmark, only difference being that in one I have a single loop body and write out b.N synced & unsynced writes (interleaved) vs. two loop bodies with b.N synced & unsynced writes each.
As before they're directed at separately running instances, here are the perf differences (especially stark for 64 KiB workloads):

~ benchstat perf-interleaved.txt perf-sequential.txt
name                      old time/op    new time/op     delta
BatchCommit/vs=1024-4       70.1µs ± 2%     68.6µs ± 5%   -2.15%  (p=0.021 n=8+10)
BatchCommit/vs=4096-4        102µs ± 1%       97µs ± 7%   -4.10%  (p=0.013 n=9+10)
BatchCommit/vs=16384-4       207µs ± 5%      188µs ± 4%   -9.46%  (p=0.000 n=9+9)
BatchCommit/vs=65536-4      1.07ms ±12%     0.62ms ± 9%  -41.90%  (p=0.000 n=8+9)
BatchCommit/vs=262144-4     2.90ms ± 8%     2.70ms ± 4%   -6.68%  (p=0.000 n=9+10)
BatchCommit/vs=1048576-4    8.06ms ± 9%     7.90ms ± 5%     ~     (p=0.631 n=10+10)

name                      old speed      new speed       delta
BatchCommit/vs=1024-4     14.6MB/s ± 2%   14.9MB/s ± 4%   +2.22%  (p=0.021 n=8+10)
BatchCommit/vs=4096-4     40.3MB/s ± 1%   42.1MB/s ± 7%   +4.37%  (p=0.013 n=9+10)
BatchCommit/vs=16384-4    78.6MB/s ± 5%   87.4MB/s ± 3%  +11.09%  (p=0.000 n=10+9)
BatchCommit/vs=65536-4    61.6MB/s ±13%  105.5MB/s ± 8%  +71.32%  (p=0.000 n=8+9)
BatchCommit/vs=262144-4   90.6MB/s ± 7%   97.1MB/s ± 4%   +7.13%  (p=0.000 n=9+10)
BatchCommit/vs=1048576-4   130MB/s ± 8%    133MB/s ± 5%     ~     (p=0.631 n=10+10)

Only reasonable conclusion I can draw from this is that the separately running instances
are not as isolated as expected. Given it's running on a single disk machine, possible something
to do with disk and/or OS buffers? I can update the RFC with these results (i.e. skewed 64 KiB
workloads) and make a note to investigate further down the line.
I also tried initializing each instance with it's separately sized cache and upper disk usage limit
(changed setupMVCCRocksDB to this end to not use defaults) but to no effect.

Given my recent work on #14108, we'll definitely have to pay attention to how the separate RocksDB instances interact. For example, I seem to recall that we can configure them to use a shared background thread pool. Would that be better, or should they have separate thread pools for background compactions?


docs/RFCS/dedicated_raft_storage.md, line 284 at r4 (raw file):

Previously, irfansharif (irfan sharif) wrote…

I was under the impression that each RocksDB instance was bounded by the maxSize it was initialized with (storage/engine/rocksdb.go: NewRocksDB). Assuming each instance has to be initialized with it's own maxSize, knowing ahead of time what ratio of total available disk space gets assigned to each instance is not obvious as yet. The point about investigating multiple embedded RocksDB databases was to this end, i.e. if somehow both instances can be configured to compete for the same total space as before.

The maxSize setting is only used for in-memory databases which are used for testing. Well, it looks like you might be able to set it for on-disk databases, but I'm not sure what we even do with the parameter in that instance.


Comments from Reviewable

@irfansharif
Copy link
Contributor Author

Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions.


docs/RFCS/dedicated_raft_storage.md, line 114 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Given my recent work on #14108, we'll definitely have to pay attention to how the separate RocksDB instances interact. For example, I seem to recall that we can configure them to use a shared background thread pool. Would that be better, or should they have separate thread pools for background compactions?

they can be configured to use a shared bg compaction thread pool yes, as for whether or not they should be I think it'll have to be determined experimentally. AFAICT the usage patterns here let us get away with very few compaction threads (we only need to compact what we truncate away, compactions need not happen frequently).


docs/RFCS/dedicated_raft_storage.md, line 284 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

The maxSize setting is only used for in-memory databases which are used for testing. Well, it looks like you might be able to set it for on-disk databases, but I'm not sure what we even do with the parameter in that instance.

welp, you're right - apparently we do nothing in that case. pkg/base/store_spec.go: StoreSpec.SizeInBytes threw me off, gets passed down along from the cli (we can do --store-path=/mnt/ssd01,size=0.02TiB) but then isn't used to size on-disk instances (used for when calculating 'free space' for re-balancing purposes).

Removed this whole section.


Comments from Reviewable

@bdarnell
Copy link
Contributor

Review status: 0 of 1 files reviewed at latest revision, 13 unresolved discussions.


docs/RFCS/dedicated_raft_storage.md, line 247 at r5 (raw file):

`raft` under our existing RocksDB storage directory.
At the time of writing the keys that would need to be written to the new engine
are the log keys and `HardState`.

We may want to consider using two column families for this, since the log keys are write-once and short-lived, while the hard state is overwritten frequently but never goes away completely.


docs/RFCS/dedicated_raft_storage.md, line 280 at r5 (raw file):

disk. If the node crashes at this point it will fail to load the
`TruncatedState` and has no way to bridge the gap between the last persisted
`ReplicaState` and the oldest entry in the truncated Raft log.

This seems bad. Do you have an idea for how to handle this? Maybe the truncation state should be stored in the log rocksdb instead of the regular one.


docs/RFCS/dedicated_raft_storage.md, line 313 at r5 (raw file):

migration at this time.

An online approach that could enable live migrations moving consensus state

We could also do a partially-online migration by using the new format for all new Replicas, so that as splits and rebalances occur we'd gradually move to the new format.


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Jun 12, 2017

Reviewed 1 of 1 files at r6.
Review status: all files reviewed at latest revision, 11 unresolved discussions.


docs/RFCS/dedicated_raft_storage.md, line 247 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We may want to consider using two column families for this, since the log keys are write-once and short-lived, while the hard state is overwritten frequently but never goes away completely.

No doubt you're aware, just wanted to point out explicitly that log keys are only usually write-once (log tail can be replaced after leadership change). If they were truly write once, we could use the RocksDB SingleDelete optimization, however useful that one may be.


docs/RFCS/dedicated_raft_storage.md, line 280 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This seems bad. Do you have an idea for how to handle this? Maybe the truncation state should be stored in the log rocksdb instead of the regular one.

I think you just have to sync when you write the TruncatedState. Is that a big downside?

You can do better: you can only truncate the log once TruncatedState has been synced to disk. If we had a mechanism for finding out when that has happened, we could just remove the log entries then.

I don't know if it's relevant here, but could be related: we should refactor the way evalTruncateLog works. It currently takes writes all the way through the proposer-evaluated KV machinery, and at least from the graphs it looks that that's enough traffic to impair Raft throughput alone. We could lower the actual ranged clear below Raft (after all, no migration concerns there). We would be relaxing, somewhat, the stats which are now authoritative and would then only become "real" once the Raft log had actually been purged all the way up to the TruncatedState. I think there's no problem with that.

I haven't thought much about moving TruncatedState to the other engine. Seems possible too.


docs/RFCS/dedicated_raft_storage.md, line 313 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We could also do a partially-online migration by using the new format for all new Replicas, so that as splits and rebalances occur we'd gradually move to the new format.

That's possible, but is it really a useful approach? We try to minimize data movement, and stable deployments could take forever to actually upgrade.


Comments from Reviewable

@irfansharif irfansharif force-pushed the raftlog-rfc branch 2 times, most recently from c61fe8f to cd8feb9 Compare June 12, 2017 16:02
@irfansharif irfansharif force-pushed the raftlog-rfc branch 2 times, most recently from 83ec940 to 4ac998e Compare June 12, 2017 16:04
@irfansharif
Copy link
Contributor Author

thank you for the detailed reviews everyone! I don't see any outstanding comments so will be moving it to final comment period now.


Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions.


docs/RFCS/dedicated_raft_storage.md, line 247 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

No doubt you're aware, just wanted to point out explicitly that log keys are only usually write-once (log tail can be replaced after leadership change). If they were truly write once, we could use the RocksDB SingleDelete optimization, however useful that one may be.

Noted.


docs/RFCS/dedicated_raft_storage.md, line 280 at r5 (raw file):

Do you have an idea for how to handle this?

not yet, didn't get far enough into my experimentation branch to run into this. I have little context into the actual code implementation here but will mention alternatives considered when implementing (truncated state stored in the log RocksDB was one but I don't know how much internal coupling there is as yet).


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Jun 12, 2017

Reviewed 1 of 1 files at r7.
Review status: all files reviewed at latest revision, 11 unresolved discussions.


docs/RFCS/dedicated_raft_storage.md, line 280 at r5 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Do you have an idea for how to handle this?

not yet, didn't get far enough into my experimentation branch to run into this. I have little context into the actual code implementation here but will mention alternatives considered when implementing (truncated state stored in the log RocksDB was one but I don't know how much internal coupling there is as yet).

This section doesn't reflect the discussion here yet. I think you can just list these alternatives here and then decide on one when you're implementing.


Comments from Reviewable

@bdarnell
Copy link
Contributor

Reviewed 1 of 1 files at r5.
Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions.


docs/RFCS/dedicated_raft_storage.md, line 247 at r5 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Noted.

Ah, right. I was thinking about the naming scheme used for sideloaded sstables, which include the term and are really write-once.


docs/RFCS/dedicated_raft_storage.md, line 280 at r5 (raw file):

I think you just have to sync when you write the TruncatedState. Is that a big downside?

This will be the only time we explicitly sync the KV rocksdb. It'll be an expensive sync and might cause performance problems. I think using the other engine will be better if it works.


Comments from Reviewable

@petermattis
Copy link
Collaborator

Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions.


docs/RFCS/dedicated_raft_storage.md, line 280 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think you just have to sync when you write the TruncatedState. Is that a big downside?

This will be the only time we explicitly sync the KV rocksdb. It'll be an expensive sync and might cause performance problems. I think using the other engine will be better if it works.

It's not just the TruncatedState that needs to be synced, but the associated effects of those entries, right? For example, if I Put(a) the primary RocksDB engine must have synced a before I can truncate that Put operation from the Raft log.


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Jun 12, 2017

Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions, some commit checks pending.


docs/RFCS/dedicated_raft_storage.md, line 280 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

It's not just the TruncatedState that needs to be synced, but the associated effects of those entries, right? For example, if I Put(a) the primary RocksDB engine must have synced a before I can truncate that Put operation from the Raft log.

You're right, we need to explicitly sync whenever the synced AppliedIndex < new first log index.


Comments from Reviewable

@bdarnell
Copy link
Contributor

Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions, some commit checks pending.


docs/RFCS/dedicated_raft_storage.md, line 280 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

You're right, we need to explicitly sync whenever the synced AppliedIndex < new first log index.

Oops, I should have read the whole paragraph more closely. It's not just TruncatedState so moving that to the other engine probably won't help. In that case I think syncing the KV engine when writing TruncatedState is probably the best we can do.


Comments from Reviewable

@irfansharif irfansharif force-pushed the raftlog-rfc branch 2 times, most recently from 3b0e6c9 to 9673bd8 Compare June 12, 2017 21:00
@irfansharif
Copy link
Contributor Author

Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions, some commit checks pending.


docs/RFCS/dedicated_raft_storage.md, line 280 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Oops, I should have read the whole paragraph more closely. It's not just TruncatedState so moving that to the other engine probably won't help. In that case I think syncing the KV engine when writing TruncatedState is probably the best we can do.

Expanded the paragraph further for clarity and added takeaways from the discussion here.


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Jun 13, 2017

Reviewed 1 of 1 files at r8.
Review status: all files reviewed at latest revision, 11 unresolved discussions, all commit checks successful.


Comments from Reviewable

Each Replica is backed by a single instance of RocksDB which is used to
store all modifications to the underlying state machine in addition to
storing all consensus state. This RFC proposes the separation of the
two, outlines the motivations for doing so and alternatives considered.
@irfansharif irfansharif merged commit 45119f1 into cockroachdb:master Jun 14, 2017
@irfansharif irfansharif deleted the raftlog-rfc branch June 14, 2017 15:33
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Jul 1, 2017
Implements cockroachdb#16361.

This is a breaking change. To see why consider that prior to this we
stored all consensus data in addition to all system metadata and user
level keys in the same, single RocksDB instance. Here we introduce a
separate, dedicated instance for raft data (log entries and
HardState). Cockroach nodes simply restarting with these changes, unless
migrated properly, will fail to find the most recent raft long entries
and HardState data in the new RocksDB instance.

Also consider a cluster running mixed versions (nodes with dedicated
raft storage and nodes without), what would the communication between
nodes here like in light of proposer evaluated
KV? Current we propagate a storagebase.WriteBatch through raft
containing a serialized representation of a RocksDB write batch, this
models the changes to be made to the single underlying RocksDB instance.
For log truncation requests where we delete log entries and/or admin
splits where we write initial HardState for newly formed replicas, we
need to similarly propagate a write batch (through raft) addressing the
new RocksDB instance (if the recipient node is one with these changes)
or the original RocksDB instance (if the recipient node is one without
these changes). What if an older version node is the raft leader and is
therefore the one upstream of raft, propagating storagebase.WriteBatches
with raft data changes but addressed to the original RocksDB instance?
What would rollbacks look like?

To this end we introduce three modes of operation,
{Disabled,Transitioning,Enabled}RaftStorage. We've made it so that it is
safe to transition between DisabledRaftStorage to
TransitioningRaftStorage, from TransitioningRaftStorage to
EnabledRaftStorage and the reverse for rollbacks. Transition from one
mode to the next will take place when all the nodes in the cluster are
on the same previous mode.  The operation mode is set by an env var
COCKROACH_DEDICATED_RAFT_STORAGE={DISABLED,TRANSITIONING,ENABLED}

- DisabledRaftStorage mode preserves the previous behavior in that we use
  a single RocksDB instance for both raft and user-level KV data.
- EnabledRaftStorage mode enables the use of the dedicated RocksDB
  instance for raft data. Raft log entries and the HardState are stored on
  this instance alone.
- TransitioningRaftStorage mode uses both RocksDB instances for raft data
  interoperably, the raft specific and the regular instance. We use
  this mode to facilitate rolling upgrades.

Most of this commit is careful plumbing of an extra
engine.{Engine,Batch,Reader,Writer,ReadWriter} for whenever we need to
interact with the new RocksDB instance. In DisabledRaftStorage both
these instances refer to the same underlying engine (thus preserving the
previous behaviour). The following pattern is oft repeated:

  batch := ...
  batchRaft := batch
  if TransitioningRaftStorage || EnabledRaftStorage {
    batchRaft = ...
  }

Here are some initial performance numbers:

  ~ benchstat perf-disabled perf-enabled
    name                             old time/op  new time/op  delta
    ReplicaRaftStorage/vs=1024-4      320µs ± 6%   385µs ±18%  +20.29%  (p=0.000 n=10+10)
    ReplicaRaftStorage/vs=4096-4      613µs ±14%   580µs ± 2%     ~     (p=0.278 n=10+9)
    ReplicaRaftStorage/vs=16384-4    2.59ms ± 3%  2.05ms ± 4%  -20.87%  (p=0.000 n=10+9)
    ReplicaRaftStorage/vs=65536-4    4.11ms ± 7%  3.29ms ± 3%  -19.97%  (p=0.000 n=10+10)
    ReplicaRaftStorage/vs=262144-4   13.4ms ± 8%  10.7ms ± 3%  -20.39%  (p=0.000 n=10+10)
    ReplicaRaftStorage/vs=1048576-4  56.8ms ± 3%  36.4ms ± 2%  -35.91%  (p=0.000 n=10+10)
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Jul 10, 2017
Implements cockroachdb#16361.

This is a breaking change. To see why consider that prior to this we
stored all consensus data in addition to all system metadata and user
level keys in the same, single RocksDB instance. Here we introduce a
separate, dedicated instance for raft data (log entries and
HardState). Cockroach nodes simply restarting with these changes, unless
migrated properly, will fail to find the most recent raft long entries
and HardState data in the new RocksDB instance.

Also consider a cluster running mixed versions (nodes with dedicated
raft storage and nodes without), what would the communication between
nodes here like in light of proposer evaluated
KV? Current we propagate a storagebase.WriteBatch through raft
containing a serialized representation of a RocksDB write batch, this
models the changes to be made to the single underlying RocksDB instance.
For log truncation requests where we delete log entries and/or admin
splits where we write initial HardState for newly formed replicas, we
need to similarly propagate a write batch (through raft) addressing the
new RocksDB instance (if the recipient node is one with these changes)
or the original RocksDB instance (if the recipient node is one without
these changes). What if an older version node is the raft leader and is
therefore the one upstream of raft, propagating storagebase.WriteBatches
with raft data changes but addressed to the original RocksDB instance?
What would rollbacks look like?

To this end we introduce three modes of operation,
transitioningRaftStorage and enabledRaftStorage (this is implicit if
we're not in transitioning mode). We've made it so that it is safe to
transition between an older cockroach version to
transitioningRaftStorage, from transitioningRaftStorage to
enabled and the reverse for rollbacks. Transition from one
mode to the next will take place when all the nodes in the cluster are
on the same previous mode. The operation mode is set by an env var
COCKROACH_DEDICATED_RAFT_STORAGE={DISABLED,TRANSITIONING,ENABLED}

- In the old version we use a single RocksDB instance for both raft
  and user-level KV data
- In transitioningRaftStorage mode we use both RocksDB instances for raft
  data interoperably, the raft specific and the regular instance. We use
  this mode to facilitate rolling upgrades
- In enabled mode we use the dedicated RocksDB instance for raft data.
  Raft log entries and the HardState are stored on this instance alone

Most of this commit is careful plumbing of an extra
engine.{Engine,Batch,Reader,Writer,ReadWriter} for whenever we need to
interact with the new RocksDB instance.
irfansharif added a commit to irfansharif/cockroach that referenced this pull request Jul 10, 2017
Implements cockroachdb#16361.

This is a breaking change. To see why consider that prior to this we
stored all consensus data in addition to all system metadata and user
level keys in the same, single RocksDB instance. Here we introduce a
separate, dedicated instance for raft data (log entries and
HardState). Cockroach nodes simply restarting with these changes, unless
migrated properly, will fail to find the most recent raft long entries
and HardState data in the new RocksDB instance.

Also consider a cluster running mixed versions (nodes with dedicated
raft storage and nodes without), what would the communication between
nodes here like in light of proposer evaluated
KV? Current we propagate a storagebase.WriteBatch through raft
containing a serialized representation of a RocksDB write batch, this
models the changes to be made to the single underlying RocksDB instance.
For log truncation requests where we delete log entries and/or admin
splits where we write initial HardState for newly formed replicas, we
need to similarly propagate a write batch (through raft) addressing the
new RocksDB instance (if the recipient node is one with these changes)
or the original RocksDB instance (if the recipient node is one without
these changes). What if an older version node is the raft leader and is
therefore the one upstream of raft, propagating storagebase.WriteBatches
with raft data changes but addressed to the original RocksDB instance?
What would rollbacks look like?

To this end we introduce three modes of operation,
transitioningRaftStorage and enabledRaftStorage (this is implicit if
we're not in transitioning mode). We've made it so that it is safe to
transition between an older cockroach version to
transitioningRaftStorage, from transitioningRaftStorage to
enabled and the reverse for rollbacks. Transition from one
mode to the next will take place when all the nodes in the cluster are
on the same previous mode. The operation mode is set by an env var
COCKROACH_DEDICATED_RAFT_STORAGE={DISABLED,TRANSITIONING,ENABLED}

- In the old version we use a single RocksDB instance for both raft
  and user-level KV data
- In transitioningRaftStorage mode we use both RocksDB instances for raft
  data interoperably, the raft specific and the regular instance. We use
  this mode to facilitate rolling upgrades
- In enabled mode we use the dedicated RocksDB instance for raft data.
  Raft log entries and the HardState are stored on this instance alone

Most of this commit is careful plumbing of an extra
engine.{Engine,Batch,Reader,Writer,ReadWriter} for whenever we need to
interact with the new RocksDB instance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants