-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: dedicated storage engine for Raft #16361
Conversation
Review status: 0 of 1 files reviewed at latest revision, 4 unresolved discussions, some commit checks pending. docs/RFCS/dedicated_raft_storage.md, line 114 at r1 (raw file):
I can see the value in this benchmark, but I'd be happier with an alternating benchmark that used 2 separate RocksDB instances. Hopefully this would get similar numbers to this benchmark. docs/RFCS/dedicated_raft_storage.md, line 241 at r1 (raw file):
Since we're punting on the migration story for now, it seems worthwhile to structure this so that both a shared RocksDB instance and a separate RocksDB instance can be used, controlled by an env var. Also, given that the goal in this RFC is a performance improvement, it would be good to structure work so that we can get a full-system sanity check of the performance soon, even if this means doing something slightly hacky. docs/RFCS/dedicated_raft_storage.md, line 259 at r1 (raw file):
Also TBD is how to actually start using this RocksDB instance. One thought is that only new nodes would use a separate RocksDB instance for the Raft log. Another is to add a store-level migration. Let's not explore these too deeply yet, just mention them briefly. docs/RFCS/dedicated_raft_storage.md, line 292 at r1 (raw file):
Comments from Reviewable |
Great writeup, @irfansharif Review status: 0 of 1 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful. docs/RFCS/dedicated_raft_storage.md, line 114 at r1 (raw file): Previously, petermattis (Peter Mattis) wrote…
Yeah, I was going to say the same thing. There might be some interference between the two that would be very relevant to the accuracy of the benchmark. docs/RFCS/dedicated_raft_storage.md, line 81 at r2 (raw file):
You may want to clean up the indentation in here to make it render properly docs/RFCS/dedicated_raft_storage.md, line 247 at r2 (raw file):
This comment is kinda coming out of the blue. Feel free to punt these questions out into a separate issue, but does this problem affect us today? Will we always enable Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 6 unresolved discussions. docs/RFCS/dedicated_raft_storage.md, line 241 at r1 (raw file):
I'm not sure I follow, do you mind clarifying? docs/RFCS/dedicated_raft_storage.md, line 81 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
docs/RFCS/dedicated_raft_storage.md, line 247 at r2 (raw file):
possibly, yes.
like mentioned above, only for This jumped out to me as well, I couldn't find instances of us accounting for this fact but I'll file + address separately once I confirm. Comments from Reviewable |
Reviewed 1 of 1 files at r2, 1 of 1 files at r3. docs/RFCS/dedicated_raft_storage.md, line 81 at r2 (raw file): Not on github in chrome: Reviewable is even indicating that there's some whitespace weirdness in that some of the lines have the red >> arrows and others don't. Comments from Reviewable |
0ce250b
to
3db903f
Compare
Review status: 0 of 1 files reviewed at latest revision, 5 unresolved discussions. docs/RFCS/dedicated_raft_storage.md, line 241 at r1 (raw file): Previously, irfansharif (irfan sharif) wrote…
Added these notes to the RFC. docs/RFCS/dedicated_raft_storage.md, line 259 at r1 (raw file): Previously, petermattis (Peter Mattis) wrote…
Done. docs/RFCS/dedicated_raft_storage.md, line 81 at r2 (raw file): Previously, a-robinson (Alex Robinson) wrote…
strange, Done. Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 9 unresolved discussions, some commit checks pending. docs/RFCS/dedicated_raft_storage.md, line 272 at r4 (raw file):
I'm not seeing the issue here. The consistency checker doesn't check the raft log itself. All we require is that the consistency checker gets the right snapshot of the main database, which is controlled by the way we apply entries and track the AppliedIndex in the main database (not the raft one) docs/RFCS/dedicated_raft_storage.md, line 274 at r4 (raw file):
I'd prefer a general migration story instead of a one-time "new nodes use the new format" migration. In addition to solving the problem for existing clusters, a general migration process could be used to move the raft rocksdb instance from one location to another (such as to the non-volatile memory discussed below) docs/RFCS/dedicated_raft_storage.md, line 284 at r4 (raw file):
If we weren't concerned about it before, why should we be concerned about it now? It doesn't seem like anything has changed. RocksDB's management of the disk space doesn't change just because there are two instances. (The raft logs compete for space with the regular data just as much today). docs/RFCS/dedicated_raft_storage.md, line 378 at r4 (raw file):
Even without more specialized hardware it might be desirable to configure the raft rocksdb and regular rocksdb to use different disks. Comments from Reviewable |
Reviewed 1 of 1 files at r4. docs/RFCS/dedicated_raft_storage.md, line 241 at r1 (raw file):
Build it so that we can benchmark it ASAP, even if that benchmarked version has blemishes or unaddressed problems that are not expected to influence benchmarks. Just to avoid finding out at the end that the numbers are not as favorable after having put in a lot of work. docs/RFCS/dedicated_raft_storage.md, line 57 at r4 (raw file):
Can you briefly list which keys actually go in the new engine? Off the top of my head I'm thinking docs/RFCS/dedicated_raft_storage.md, line 256 at r4 (raw file):
You're worried that you would purge the log entries, but then not commit the updated Comments from Reviewable |
Review status: all files reviewed at latest revision, 11 unresolved discussions, some commit checks pending. docs/RFCS/dedicated_raft_storage.md, line 114 at r1 (raw file): Previously, a-robinson (Alex Robinson) wrote…
hmm, I was looking at this today and tried out the alternating benchmark idea using 2 separate RocksDB instances, it's mostly what I expect (similar to results posted here) save for value sizes of 65536 bytes (64 KiB).
The actual code is here. Any ideas for what's so special about 64 KiB? Comments from Reviewable |
Review status: all files reviewed at latest revision, 11 unresolved discussions, some commit checks pending. docs/RFCS/dedicated_raft_storage.md, line 114 at r1 (raw file): Previously, irfansharif (irfan sharif) wrote…
The benchmark looks good. The Comments from Reviewable |
Review status: all files reviewed at latest revision, 11 unresolved discussions, some commit checks pending. docs/RFCS/dedicated_raft_storage.md, line 114 at r1 (raw file): Previously, petermattis (Peter Mattis) wrote…
nope, and I've gotten similar results across multiple runs of the same experiment. Poking around this I found something else that was curious.
Only reasonable conclusion I can draw from this is that the separately running instances Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions. docs/RFCS/dedicated_raft_storage.md, line 57 at r4 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. docs/RFCS/dedicated_raft_storage.md, line 256 at r4 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Done. docs/RFCS/dedicated_raft_storage.md, line 272 at r4 (raw file): Previously, bdarnell (Ben Darnell) wrote…
ah, I was conflating docs/RFCS/dedicated_raft_storage.md, line 274 at r4 (raw file): Previously, bdarnell (Ben Darnell) wrote…
I've added some thoughts I have on this, PTAL as I've found little precedence for this in the past. docs/RFCS/dedicated_raft_storage.md, line 284 at r4 (raw file): Previously, bdarnell (Ben Darnell) wrote…
I was under the impression that each RocksDB instance was bounded by the docs/RFCS/dedicated_raft_storage.md, line 378 at r4 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Added. Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions. docs/RFCS/dedicated_raft_storage.md, line 114 at r1 (raw file): Previously, irfansharif (irfan sharif) wrote…
Given my recent work on #14108, we'll definitely have to pay attention to how the separate RocksDB instances interact. For example, I seem to recall that we can configure them to use a shared background thread pool. Would that be better, or should they have separate thread pools for background compactions? docs/RFCS/dedicated_raft_storage.md, line 284 at r4 (raw file): Previously, irfansharif (irfan sharif) wrote…
The Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions. docs/RFCS/dedicated_raft_storage.md, line 114 at r1 (raw file): Previously, petermattis (Peter Mattis) wrote…
they can be configured to use a shared bg compaction thread pool yes, as for whether or not they should be I think it'll have to be determined experimentally. AFAICT the usage patterns here let us get away with very few compaction threads (we only need to compact what we truncate away, compactions need not happen frequently). docs/RFCS/dedicated_raft_storage.md, line 284 at r4 (raw file): Previously, petermattis (Peter Mattis) wrote…
welp, you're right - apparently we do nothing in that case. Removed this whole section. Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 13 unresolved discussions. docs/RFCS/dedicated_raft_storage.md, line 247 at r5 (raw file):
We may want to consider using two column families for this, since the log keys are write-once and short-lived, while the hard state is overwritten frequently but never goes away completely. docs/RFCS/dedicated_raft_storage.md, line 280 at r5 (raw file):
This seems bad. Do you have an idea for how to handle this? Maybe the truncation state should be stored in the log rocksdb instead of the regular one. docs/RFCS/dedicated_raft_storage.md, line 313 at r5 (raw file):
We could also do a partially-online migration by using the new format for all new Replicas, so that as splits and rebalances occur we'd gradually move to the new format. Comments from Reviewable |
Reviewed 1 of 1 files at r6. docs/RFCS/dedicated_raft_storage.md, line 247 at r5 (raw file): Previously, bdarnell (Ben Darnell) wrote…
No doubt you're aware, just wanted to point out explicitly that log keys are only usually write-once (log tail can be replaced after leadership change). If they were truly write once, we could use the RocksDB SingleDelete optimization, however useful that one may be. docs/RFCS/dedicated_raft_storage.md, line 280 at r5 (raw file): Previously, bdarnell (Ben Darnell) wrote…
I think you just have to sync when you write the You can do better: you can only truncate the log once I don't know if it's relevant here, but could be related: we should refactor the way I haven't thought much about moving docs/RFCS/dedicated_raft_storage.md, line 313 at r5 (raw file): Previously, bdarnell (Ben Darnell) wrote…
That's possible, but is it really a useful approach? We try to minimize data movement, and stable deployments could take forever to actually upgrade. Comments from Reviewable |
c61fe8f
to
cd8feb9
Compare
83ec940
to
4ac998e
Compare
thank you for the detailed reviews everyone! I don't see any outstanding comments so will be moving it to final comment period now. Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions. docs/RFCS/dedicated_raft_storage.md, line 247 at r5 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Noted. docs/RFCS/dedicated_raft_storage.md, line 280 at r5 (raw file):
not yet, didn't get far enough into my experimentation branch to run into this. I have little context into the actual code implementation here but will mention alternatives considered when implementing (truncated state stored in the log RocksDB was one but I don't know how much internal coupling there is as yet). Comments from Reviewable |
Reviewed 1 of 1 files at r7. docs/RFCS/dedicated_raft_storage.md, line 280 at r5 (raw file): Previously, irfansharif (irfan sharif) wrote…
This section doesn't reflect the discussion here yet. I think you can just list these alternatives here and then decide on one when you're implementing. Comments from Reviewable |
Reviewed 1 of 1 files at r5. docs/RFCS/dedicated_raft_storage.md, line 247 at r5 (raw file): Previously, irfansharif (irfan sharif) wrote…
Ah, right. I was thinking about the naming scheme used for sideloaded sstables, which include the term and are really write-once. docs/RFCS/dedicated_raft_storage.md, line 280 at r5 (raw file):
This will be the only time we explicitly sync the KV rocksdb. It'll be an expensive sync and might cause performance problems. I think using the other engine will be better if it works. Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions. docs/RFCS/dedicated_raft_storage.md, line 280 at r5 (raw file): Previously, bdarnell (Ben Darnell) wrote…
It's not just the Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions, some commit checks pending. docs/RFCS/dedicated_raft_storage.md, line 280 at r5 (raw file): Previously, petermattis (Peter Mattis) wrote…
You're right, we need to explicitly sync whenever the synced AppliedIndex < new first log index. Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions, some commit checks pending. docs/RFCS/dedicated_raft_storage.md, line 280 at r5 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Oops, I should have read the whole paragraph more closely. It's not just TruncatedState so moving that to the other engine probably won't help. In that case I think syncing the KV engine when writing TruncatedState is probably the best we can do. Comments from Reviewable |
3b0e6c9
to
9673bd8
Compare
Review status: 0 of 1 files reviewed at latest revision, 11 unresolved discussions, some commit checks pending. docs/RFCS/dedicated_raft_storage.md, line 280 at r5 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Expanded the paragraph further for clarity and added takeaways from the discussion here. Comments from Reviewable |
Reviewed 1 of 1 files at r8. Comments from Reviewable |
Each Replica is backed by a single instance of RocksDB which is used to store all modifications to the underlying state machine in addition to storing all consensus state. This RFC proposes the separation of the two, outlines the motivations for doing so and alternatives considered.
Implements cockroachdb#16361. This is a breaking change. To see why consider that prior to this we stored all consensus data in addition to all system metadata and user level keys in the same, single RocksDB instance. Here we introduce a separate, dedicated instance for raft data (log entries and HardState). Cockroach nodes simply restarting with these changes, unless migrated properly, will fail to find the most recent raft long entries and HardState data in the new RocksDB instance. Also consider a cluster running mixed versions (nodes with dedicated raft storage and nodes without), what would the communication between nodes here like in light of proposer evaluated KV? Current we propagate a storagebase.WriteBatch through raft containing a serialized representation of a RocksDB write batch, this models the changes to be made to the single underlying RocksDB instance. For log truncation requests where we delete log entries and/or admin splits where we write initial HardState for newly formed replicas, we need to similarly propagate a write batch (through raft) addressing the new RocksDB instance (if the recipient node is one with these changes) or the original RocksDB instance (if the recipient node is one without these changes). What if an older version node is the raft leader and is therefore the one upstream of raft, propagating storagebase.WriteBatches with raft data changes but addressed to the original RocksDB instance? What would rollbacks look like? To this end we introduce three modes of operation, {Disabled,Transitioning,Enabled}RaftStorage. We've made it so that it is safe to transition between DisabledRaftStorage to TransitioningRaftStorage, from TransitioningRaftStorage to EnabledRaftStorage and the reverse for rollbacks. Transition from one mode to the next will take place when all the nodes in the cluster are on the same previous mode. The operation mode is set by an env var COCKROACH_DEDICATED_RAFT_STORAGE={DISABLED,TRANSITIONING,ENABLED} - DisabledRaftStorage mode preserves the previous behavior in that we use a single RocksDB instance for both raft and user-level KV data. - EnabledRaftStorage mode enables the use of the dedicated RocksDB instance for raft data. Raft log entries and the HardState are stored on this instance alone. - TransitioningRaftStorage mode uses both RocksDB instances for raft data interoperably, the raft specific and the regular instance. We use this mode to facilitate rolling upgrades. Most of this commit is careful plumbing of an extra engine.{Engine,Batch,Reader,Writer,ReadWriter} for whenever we need to interact with the new RocksDB instance. In DisabledRaftStorage both these instances refer to the same underlying engine (thus preserving the previous behaviour). The following pattern is oft repeated: batch := ... batchRaft := batch if TransitioningRaftStorage || EnabledRaftStorage { batchRaft = ... } Here are some initial performance numbers: ~ benchstat perf-disabled perf-enabled name old time/op new time/op delta ReplicaRaftStorage/vs=1024-4 320µs ± 6% 385µs ±18% +20.29% (p=0.000 n=10+10) ReplicaRaftStorage/vs=4096-4 613µs ±14% 580µs ± 2% ~ (p=0.278 n=10+9) ReplicaRaftStorage/vs=16384-4 2.59ms ± 3% 2.05ms ± 4% -20.87% (p=0.000 n=10+9) ReplicaRaftStorage/vs=65536-4 4.11ms ± 7% 3.29ms ± 3% -19.97% (p=0.000 n=10+10) ReplicaRaftStorage/vs=262144-4 13.4ms ± 8% 10.7ms ± 3% -20.39% (p=0.000 n=10+10) ReplicaRaftStorage/vs=1048576-4 56.8ms ± 3% 36.4ms ± 2% -35.91% (p=0.000 n=10+10)
Implements cockroachdb#16361. This is a breaking change. To see why consider that prior to this we stored all consensus data in addition to all system metadata and user level keys in the same, single RocksDB instance. Here we introduce a separate, dedicated instance for raft data (log entries and HardState). Cockroach nodes simply restarting with these changes, unless migrated properly, will fail to find the most recent raft long entries and HardState data in the new RocksDB instance. Also consider a cluster running mixed versions (nodes with dedicated raft storage and nodes without), what would the communication between nodes here like in light of proposer evaluated KV? Current we propagate a storagebase.WriteBatch through raft containing a serialized representation of a RocksDB write batch, this models the changes to be made to the single underlying RocksDB instance. For log truncation requests where we delete log entries and/or admin splits where we write initial HardState for newly formed replicas, we need to similarly propagate a write batch (through raft) addressing the new RocksDB instance (if the recipient node is one with these changes) or the original RocksDB instance (if the recipient node is one without these changes). What if an older version node is the raft leader and is therefore the one upstream of raft, propagating storagebase.WriteBatches with raft data changes but addressed to the original RocksDB instance? What would rollbacks look like? To this end we introduce three modes of operation, transitioningRaftStorage and enabledRaftStorage (this is implicit if we're not in transitioning mode). We've made it so that it is safe to transition between an older cockroach version to transitioningRaftStorage, from transitioningRaftStorage to enabled and the reverse for rollbacks. Transition from one mode to the next will take place when all the nodes in the cluster are on the same previous mode. The operation mode is set by an env var COCKROACH_DEDICATED_RAFT_STORAGE={DISABLED,TRANSITIONING,ENABLED} - In the old version we use a single RocksDB instance for both raft and user-level KV data - In transitioningRaftStorage mode we use both RocksDB instances for raft data interoperably, the raft specific and the regular instance. We use this mode to facilitate rolling upgrades - In enabled mode we use the dedicated RocksDB instance for raft data. Raft log entries and the HardState are stored on this instance alone Most of this commit is careful plumbing of an extra engine.{Engine,Batch,Reader,Writer,ReadWriter} for whenever we need to interact with the new RocksDB instance.
Implements cockroachdb#16361. This is a breaking change. To see why consider that prior to this we stored all consensus data in addition to all system metadata and user level keys in the same, single RocksDB instance. Here we introduce a separate, dedicated instance for raft data (log entries and HardState). Cockroach nodes simply restarting with these changes, unless migrated properly, will fail to find the most recent raft long entries and HardState data in the new RocksDB instance. Also consider a cluster running mixed versions (nodes with dedicated raft storage and nodes without), what would the communication between nodes here like in light of proposer evaluated KV? Current we propagate a storagebase.WriteBatch through raft containing a serialized representation of a RocksDB write batch, this models the changes to be made to the single underlying RocksDB instance. For log truncation requests where we delete log entries and/or admin splits where we write initial HardState for newly formed replicas, we need to similarly propagate a write batch (through raft) addressing the new RocksDB instance (if the recipient node is one with these changes) or the original RocksDB instance (if the recipient node is one without these changes). What if an older version node is the raft leader and is therefore the one upstream of raft, propagating storagebase.WriteBatches with raft data changes but addressed to the original RocksDB instance? What would rollbacks look like? To this end we introduce three modes of operation, transitioningRaftStorage and enabledRaftStorage (this is implicit if we're not in transitioning mode). We've made it so that it is safe to transition between an older cockroach version to transitioningRaftStorage, from transitioningRaftStorage to enabled and the reverse for rollbacks. Transition from one mode to the next will take place when all the nodes in the cluster are on the same previous mode. The operation mode is set by an env var COCKROACH_DEDICATED_RAFT_STORAGE={DISABLED,TRANSITIONING,ENABLED} - In the old version we use a single RocksDB instance for both raft and user-level KV data - In transitioningRaftStorage mode we use both RocksDB instances for raft data interoperably, the raft specific and the regular instance. We use this mode to facilitate rolling upgrades - In enabled mode we use the dedicated RocksDB instance for raft data. Raft log entries and the HardState are stored on this instance alone Most of this commit is careful plumbing of an extra engine.{Engine,Batch,Reader,Writer,ReadWriter} for whenever we need to interact with the new RocksDB instance.
Each
Replica
is backed by a single instance of RocksDB which is used tostore all modifications to the underlying state machine in addition to
storing all consensus state. This RFC proposes the separation of the
two, outlines the motivations for doing so and alternatives considered.