-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: skip Pebble write-ahead log during Raft entry application #38322
Comments
Like mentioned earlier in Slack, this sounds really promising. Do you have any intuition on what benchmark needles this would move, and by how much? |
I really like the idea of using the raft log as the WAL, and decoupling it as much as possible from writing of data pages. I'm seeing a number of systems do something similar (ex. https://www.microsoft.com/en-us/research/uploads/prod/2019/05/socrates.pdf). The idea is that the WAL gets stored to a "log service", which makes it durable and then applies that in distributed fashion to various page caches and stores. In Socrates, there are 3 levels of data pages. When a data page is needed, the local cache (usually on local SSD) is checked. If the page is not there, then the much larger (but slower) page servers are consulted. If those don't have it, then the page is fetched from super-cheap, ultra-high-capacity (but higher latency) blob storage. Assuming we eventually scale up into 100's of TBs or petabytes of data, we may want to consider similar ideas. They also allow for more efficient use of EBS, since currently when we use EBS we effectively store 9 copies of the data (3 CRDB replicas * 3 EBS replicas of the data). Having a clean, well-defined separation between log storage and data storage will be essential to opening up these and other future possibilities. |
@nvanbenschoten Raft log entries and applied commands are not the only data we write to RocksDB. For example, gossip stores bootstrap info in the local RocksDB instance. I think we'd want these writes to have the same guarantees as they currently do. You've probably already thought about this, but I think it should be explicit here: the only RocksDB writes which would avoid the WAL are those for Raft command application. We'll need to check the behavior of RocksDB with regards to mixing writes which use the WAL and skip the WAL in the same instance. In Pebble, this will work as you'd expect, because the Pebble commit pipeline writes each batch to the WAL individually, while RocksDB tries to combine batches. I recall there was some special code in the RocksDB commit path to handle this scenario, but can't recall the details. And our grouping of batches in |
It's tough to say for sure, but we can use rafttoy to get an idea. I ran rafttoy with its Raft Log + Data Combined vs. Raft Log + Data (with WAL) Split
In the second set of numbers, I'm comparing a configuration with single Pebble instance for the Raft log and data storage against a configuration with two Pebble instances, one for the Raft log and one for data storage. In this scenario, the Pebble instance used only for data storage has its WAL disabled, so Raft log application is not writing to a WAL at all. We again see that at peak throughput levels the split instances roughly doubles throughput. Raft Log + Data Combined vs. Raft Log + Data (without WAL) Split
Aggregated together, we see that the maximum throughputs of each configuration are:
Note: I'm not sure what to make of the lower concurrencies being better with a merged Pebble instance. I wonder if we're so far away from saturation at these concurrencies that the extra WAL writes are somehow smoothing syncs out. Perhaps @petermattis has some intuition here.
I agree. Even if we don't move to distinct log and data "services" like Microsoft does with Socrates in that paper, keeping the abstractions separate with a clear boundary between them opens up a lot of flexibility for us to place different constraints on those components. @andreimatei has been thinking in this area recently as well.
I hadn't thought about this yet, but you're right. We'll still need a WAL for the data engine and we will need to be careful about which writes to the data engine can skip the WAL (i.e. Raft entry application) and which will still need to go in the WAL. This may require us to re-evaluate the level of durability we demand of various writes throughout our system, which I expect to be a healthy exercise. I've always been a little nervous that the rate at which we sync our single RocksDB WAL might be good at hiding areas where we're not clear enough about durability requirements necessary for correctness. The subtlety of this in areas like etcd-io/etcd#7625 (comment) shines a light on how tricky this is, though perhaps we're already very conservative (i.e. always sync) and the fear is unfounded. |
I put together a prototype that sets The prototype gave a noticeable win, but it was less pronounced than I was hoping for:
This required me to split the I also plan to rebase the prototype on top of #38568 and re-run the numbers once that's merged. |
Hmm, those numbers are not nearly as compelling as what you were seeing before. I recall you mentioning in person that using an actual separate RocksDB instance for the Raft log appeared to give a significant win for rafttoy.
I seem to recall that the RocksDB commit pipeline internally has some funkiness when batches have different options. It is possible that the source of the problem is there. And it is possible that using a single RocksDB instance is the problem itself. The WAL-skipping batches still need to wait for the previous WAL-writing batches to sync. Now that I think about it, that seems likely to be the problem. |
I talked to @ajkr about this and he said that it's likely that RocksDB is batching the two groups together under the hood. He's going to look into whether there's an easy way to avoid this. Perhaps we can disable this batching entirely since we're already doing it ourselves? If not, this change may require the raft log / data store split. Even with that split, we'll need to be careful about this effect, given your point in #38322 (comment) about some writes still requiring durability in the data store engine. |
Yeah, they might be grouped into the same uber-batch. But even without that grouping there is a problem. Consider what happens if there is a commit of batch A that writes to the WAL and a concurrent commit of batch B that skips the WAL. If the commit of B happens after the commit of A it will get queued up behind A and require waiting for A to be written to the WAL and then applied to the memtable. The sync of the WAL itself will happen afterwards, though. Hmm, I'm not sure if this is a problem or not. |
The above prototype focused on qps performance numbers, but could the more important dimension here be the IOPS? If the WAL is essentially off for the data engine it would only be flushing its memtable periodically. Roughly speaking now we write every bit three times: once into the raft log wal (and never out to an SST, since it's hopefully tombstoned before it gets there, though we then need to flush out the tombstone - I'm going to ignore this, we have thought about SingleDelete to avoid this), once into the data wal, and once into a (data) SST. Shouldn't average IOPS go down by around 33% with the prototype? |
My mental model is a little unclear here about how extra writes to the WAL map to extra IOPS. Certainly if we were syncing twice as much, we would expect double the number of writes, but we're not. When @ajkr took a look at why we weren't seeing the expected improvement here, he found that the extra WAL writes were bloating the amount we wrote to the WAL as expected, but that we were writing so little between syncs that the extra writes weren't causing us to spill into extra SSD pages, so the number of writes from the SSD's perspective remained constant. There's definitely some experimentation to be done here. |
I wonder whether we should go a step further and in cloud deployments move the raft logs of a node to a separate EBS volume (EBS is virtualized storage, and is already charged in a fine-grained manner, so there shouldn't be a cost increase). This would help with "repaving" by transferring the EBS volume corresponding to the state machine, which is where most of the data volume is. Such a transfer would not reduce resiliency since the old node would still have a raft log and can participate in the quorum (it just can't apply since it has no state machine). |
Interesting idea. I think the biggest blocker to acheiving this in cockroach is the fact that we share the storage engine containing the state machine for many ranges in a single store (LSM and EBS volume). If we had a separate LSM per range, this sort of thing becomes more feasible. It does carry other questions involve safe log truncation and snapshots. |
While I'm here, I'd like to raise a point that has come up offline a number of times. Using a separate volume for applied state might make new hardware configurations reasonable. In particular, it might be reasonable to use a single, very fast storage device for the raft logs and then apply entries onto HDDs. In a cloud setting this might look like a small, but heavily IOP provisioned EBS volume. Server configurations which lots of large disks and a single, smallish SSD are not uncommon for cool-cold storage applications. Such a server is now rather unadvisable with cockroach. Perhaps with a separate storage engine, one could achieve reasonably good performance for write-mostly workloads at a dramatically lower cost per byte. |
Not clear to me why we need an LSM per range. I was specifically thinking of the repaving/decommissioning use case, where all the ranges are being moved to a new node (the latest number for a 10TB node decommission was 12 hours, I think). Am I missing something? If we also consider rebalancing, the following is a copy-paste of an idea I had mentioned to the storage folks a few months ago
|
One insight we can steal from this issue that may also be helpful for placing the state machine in disaggregated storage is that the state machine itself stores the Raft applied index of each Range in the In a way, this is saying the same thing as the rest of this issue. Only, this issue is describing how a replica's local LSM's memtable could be ephemeral. The extension here is expanding this to a replica whose entire local LSM is ephemeral, assuming we have some other durable store backing a consistent prefix of it. Interestingly, such an architecture would motivate replicated log truncation again, because log truncation would be a function of the single durable state machine. Of course, this leaves out all the hard details.
An LSM per range makes these kinds of things easier because we can associate a set of files with exactly one Range. So it becomes more straightforward to dump a Range's state machine to something like S3. Or to pull a stale snapshot of a Range in from S3. |
As much as I'm a fan of these offhand discussions about sticking the applied state of a range in some external storage system, it feels quite premature and speculative. Maybe there are places we want to take cockroach where pursuing such a strategy makes sense but there will have to likely be a clear and tactical motivation for such a change. Either way, better isolating the raft state and applied state seems beneficial along that path as well as supporting a good many other goals. Here's the way I'd break it down: Soon:
Then, speculatively:
|
I think this may lead to extra overhead for io when make raft log and wal together into a log engine, it may lead to io problems unless it is a log engine shared by all raft groups. Here is a benchmark: tested on a 7200 rpm hdd not share log engine:
share log engine:
not share log engine code:
share log engine code:
|
That is the plan. One engine for the raft logs, one for the applied state. |
What's the design proposal for sharing the raft logs? Although adopting key-value engines such as pebble to store the separate raft log could be easy, while it would introduce redundant IO since pebble has its own WAL,too. Designing a dedicate log store for shariing raft logs is non-trivial, given no redundant IO permitted. |
In the ReplicasStorage design we stop making any assumptions regarding what is durable in the state machine when syncing a batch that commits changes to the raft log. This implies the need to make raft log truncation more loosely coupled than it is now, since we can truncate only when certain that the state machine is durable up to the truncation index. Current raft log truncation flows through raft and even though the RaftTruncatedStateKey is not a replicated key, it is coupled in the sense that the truncation is done below raft when processing the corresponding log entry (that asked for truncation to be done). The current setup also has correctness issues wrt maintaining the raft log size, when passing the delta bytes for a truncation. We compute the delta at proposal time (to avoid repeating iteration over the entries in all replicas), but we do not pass the first index corresponding to the truncation, so gaps or overlaps cannot be noticed at truncation time. We do want to continue to have the raft leader guide the truncation since we do not want either leader or followers to over-truncate, given our desire to serve snapshots from any replica. In the loosely coupled approach implemented here, the truncation request that flows through raft serves as an upper bound on what can be truncated. The truncation request includes an ExpectedFirstIndex. This is further propagated using ReplicatedEvalResult.RaftExpectedFirstIndex. This ExpectedFirstIndex allows one to notice gaps or overlaps when enacting a sequence of truncations, which results in setting the Replica.raftLogSizeTrusted to false. The correctness issue with Replica.raftLogSize is not fully addressed since there are existing consistency issues when evaluating a TruncateLogRequest (these are now noted in a code comment). Below raft, the truncation requests are queued onto a Replica in pendingLogTruncations. The queueing and dequeuing is managed by a raftLogTruncator that takes care of merging pending truncation requests and enacting the truncations when the durability of the state machine advances. The pending truncation requests are taken into account in the raftLogQueue when deciding whether to do another truncation. Most of the behavior of the raftLogQueue is unchanged. The new behavior is gated on a LooselyCoupledRaftLogTruncation cluster version. Additionally, the new behavior can be turned off using the kv.raft_log.enable_loosely_coupled_truncation cluster setting, which is true by default. The latter is expected to be a safety switch for 1 release after which we expect to remove it. That removal will also cleanup some duplicated code (that was non-trivial to refactor and share) between the previous coupled and new loosely coupled truncation. Informs cockroachdb#36262 Informs cockroachdb#16624,cockroachdb#38322 Release note: None
In the ReplicasStorage design we stop making any assumptions regarding what is durable in the state machine when syncing a batch that commits changes to the raft log. This implies the need to make raft log truncation more loosely coupled than it is now, since we can truncate only when certain that the state machine is durable up to the truncation index. Current raft log truncation flows through raft and even though the RaftTruncatedStateKey is not a replicated key, it is coupled in the sense that the truncation is done below raft when processing the corresponding log entry (that asked for truncation to be done). The current setup also has correctness issues wrt maintaining the raft log size, when passing the delta bytes for a truncation. We compute the delta at proposal time (to avoid repeating iteration over the entries in all replicas), but we do not pass the first index corresponding to the truncation, so gaps or overlaps cannot be noticed at truncation time. We do want to continue to have the raft leader guide the truncation since we do not want either leader or followers to over-truncate, given our desire to serve snapshots from any replica. In the loosely coupled approach implemented here, the truncation request that flows through raft serves as an upper bound on what can be truncated. The truncation request includes an ExpectedFirstIndex. This is further propagated using ReplicatedEvalResult.RaftExpectedFirstIndex. This ExpectedFirstIndex allows one to notice gaps or overlaps when enacting a sequence of truncations, which results in setting the Replica.raftLogSizeTrusted to false. The correctness issue with Replica.raftLogSize is not fully addressed since there are existing consistency issues when evaluating a TruncateLogRequest (these are now noted in a code comment). Below raft, the truncation requests are queued onto a Replica in pendingLogTruncations. The queueing and dequeuing is managed by a raftLogTruncator that takes care of merging pending truncation requests and enacting the truncations when the durability of the state machine advances. The pending truncation requests are taken into account in the raftLogQueue when deciding whether to do another truncation. Most of the behavior of the raftLogQueue is unchanged. The new behavior is gated on a LooselyCoupledRaftLogTruncation cluster version. Additionally, the new behavior can be turned off using the kv.raft_log.enable_loosely_coupled_truncation cluster setting, which is true by default. The latter is expected to be a safety switch for 1 release after which we expect to remove it. That removal will also cleanup some duplicated code (that was non-trivial to refactor and share) between the previous coupled and new loosely coupled truncation. Informs cockroachdb#36262 Informs cockroachdb#16624,cockroachdb#38322 Release note: None
In the ReplicasStorage design we stop making any assumptions regarding what is durable in the state machine when syncing a batch that commits changes to the raft log. This implies the need to make raft log truncation more loosely coupled than it is now, since we can truncate only when certain that the state machine is durable up to the truncation index. Current raft log truncation flows through raft and even though the RaftTruncatedStateKey is not a replicated key, it is coupled in the sense that the truncation is done below raft when processing the corresponding log entry (that asked for truncation to be done). The current setup also has correctness issues wrt maintaining the raft log size, when passing the delta bytes for a truncation. We compute the delta at proposal time (to avoid repeating iteration over the entries in all replicas), but we do not pass the first index corresponding to the truncation, so gaps or overlaps cannot be noticed at truncation time. We do want to continue to have the raft leader guide the truncation since we do not want either leader or followers to over-truncate, given our desire to serve snapshots from any replica. In the loosely coupled approach implemented here, the truncation request that flows through raft serves as an upper bound on what can be truncated. The truncation request includes an ExpectedFirstIndex. This is further propagated using ReplicatedEvalResult.RaftExpectedFirstIndex. This ExpectedFirstIndex allows one to notice gaps or overlaps when enacting a sequence of truncations, which results in setting the Replica.raftLogSizeTrusted to false. The correctness issue with Replica.raftLogSize is not fully addressed since there are existing consistency issues when evaluating a TruncateLogRequest (these are now noted in a code comment). Below raft, the truncation requests are queued onto a Replica in pendingLogTruncations. The queueing and dequeuing is managed by a raftLogTruncator that takes care of merging pending truncation requests and enacting the truncations when the durability of the state machine advances. The pending truncation requests are taken into account in the raftLogQueue when deciding whether to do another truncation. Most of the behavior of the raftLogQueue is unchanged. The new behavior is gated on a LooselyCoupledRaftLogTruncation cluster version. Additionally, the new behavior can be turned off using the kv.raft_log.enable_loosely_coupled_truncation.enabled cluster setting, which is true by default. The latter is expected to be a safety switch for 1 release after which we expect to remove it. That removal will also cleanup some duplicated code (that was non-trivial to refactor and share) between the previous coupled and new loosely coupled truncation. Note, this PR is the first of two -- loosely coupled truncation is turned off via a constant in this PR. The next one will eliminate the constant and put it under the control of the cluster setting. Informs cockroachdb#36262 Informs cockroachdb#16624,cockroachdb#38322 Release note (ops change): The cluster setting kv.raft_log.enable_loosely_coupled_truncation.enabled can be used to disable loosely coupled truncation.
In the ReplicasStorage design we stop making any assumptions regarding what is durable in the state machine when syncing a batch that commits changes to the raft log. This implies the need to make raft log truncation more loosely coupled than it is now, since we can truncate only when certain that the state machine is durable up to the truncation index. Current raft log truncation flows through raft and even though the RaftTruncatedStateKey is not a replicated key, it is coupled in the sense that the truncation is done below raft when processing the corresponding log entry (that asked for truncation to be done). The current setup also has correctness issues wrt maintaining the raft log size, when passing the delta bytes for a truncation. We compute the delta at proposal time (to avoid repeating iteration over the entries in all replicas), but we do not pass the first index corresponding to the truncation, so gaps or overlaps cannot be noticed at truncation time. We do want to continue to have the raft leader guide the truncation since we do not want either leader or followers to over-truncate, given our desire to serve snapshots from any replica. In the loosely coupled approach implemented here, the truncation request that flows through raft serves as an upper bound on what can be truncated. The truncation request includes an ExpectedFirstIndex. This is further propagated using ReplicatedEvalResult.RaftExpectedFirstIndex. This ExpectedFirstIndex allows one to notice gaps or overlaps when enacting a sequence of truncations, which results in setting the Replica.raftLogSizeTrusted to false. The correctness issue with Replica.raftLogSize is not fully addressed since there are existing consistency issues when evaluating a TruncateLogRequest (these are now noted in a code comment). Below raft, the truncation requests are queued onto a Replica in pendingLogTruncations. The queueing and dequeuing is managed by a raftLogTruncator that takes care of merging pending truncation requests and enacting the truncations when the durability of the state machine advances. The pending truncation requests are taken into account in the raftLogQueue when deciding whether to do another truncation. Most of the behavior of the raftLogQueue is unchanged. The new behavior is gated on a LooselyCoupledRaftLogTruncation cluster version. Additionally, the new behavior can be turned off using the kv.raft_log.enable_loosely_coupled_truncation.enabled cluster setting, which is true by default. The latter is expected to be a safety switch for 1 release after which we expect to remove it. That removal will also cleanup some duplicated code (that was non-trivial to refactor and share) between the previous coupled and new loosely coupled truncation. Note, this PR is the first of two -- loosely coupled truncation is turned off via a constant in this PR. The next one will eliminate the constant and put it under the control of the cluster setting. Informs cockroachdb#36262 Informs cockroachdb#16624,cockroachdb#38322 Release note (ops change): The cluster setting kv.raft_log.enable_loosely_coupled_truncation.enabled can be used to disable loosely coupled truncation.
In the ReplicasStorage design we stop making any assumptions regarding what is durable in the state machine when syncing a batch that commits changes to the raft log. This implies the need to make raft log truncation more loosely coupled than it is now, since we can truncate only when certain that the state machine is durable up to the truncation index. Current raft log truncation flows through raft and even though the RaftTruncatedStateKey is not a replicated key, it is coupled in the sense that the truncation is done below raft when processing the corresponding log entry (that asked for truncation to be done). The current setup also has correctness issues wrt maintaining the raft log size, when passing the delta bytes for a truncation. We compute the delta at proposal time (to avoid repeating iteration over the entries in all replicas), but we do not pass the first index corresponding to the truncation, so gaps or overlaps cannot be noticed at truncation time. We do want to continue to have the raft leader guide the truncation since we do not want either leader or followers to over-truncate, given our desire to serve snapshots from any replica. In the loosely coupled approach implemented here, the truncation request that flows through raft serves as an upper bound on what can be truncated. The truncation request includes an ExpectedFirstIndex. This is further propagated using ReplicatedEvalResult.RaftExpectedFirstIndex. This ExpectedFirstIndex allows one to notice gaps or overlaps when enacting a sequence of truncations, which results in setting the Replica.raftLogSizeTrusted to false. The correctness issue with Replica.raftLogSize is not fully addressed since there are existing consistency issues when evaluating a TruncateLogRequest (these are now noted in a code comment). Below raft, the truncation requests are queued onto a Replica in pendingLogTruncations. The queueing and dequeuing is managed by a raftLogTruncator that takes care of merging pending truncation requests and enacting the truncations when the durability of the state machine advances. The pending truncation requests are taken into account in the raftLogQueue when deciding whether to do another truncation. Most of the behavior of the raftLogQueue is unchanged. The new behavior is gated on a LooselyCoupledRaftLogTruncation cluster version. Additionally, the new behavior can be turned off using the kv.raft_log.enable_loosely_coupled_truncation.enabled cluster setting, which is true by default. The latter is expected to be a safety switch for 1 release after which we expect to remove it. That removal will also cleanup some duplicated code (that was non-trivial to refactor and share) between the previous coupled and new loosely coupled truncation. Note, this PR is the first of two -- loosely coupled truncation is turned off via a constant in this PR. The next one will eliminate the constant and put it under the control of the cluster setting. Informs cockroachdb#36262 Informs cockroachdb#16624,cockroachdb#38322 Release note (ops change): The cluster setting kv.raft_log.enable_loosely_coupled_truncation.enabled can be used to disable loosely coupled truncation.
In the ReplicasStorage design we stop making any assumptions regarding what is durable in the state machine when syncing a batch that commits changes to the raft log. This implies the need to make raft log truncation more loosely coupled than it is now, since we can truncate only when certain that the state machine is durable up to the truncation index. Current raft log truncation flows through raft and even though the RaftTruncatedStateKey is not a replicated key, it is coupled in the sense that the truncation is done below raft when processing the corresponding log entry (that asked for truncation to be done). The current setup also has correctness issues wrt maintaining the raft log size, when passing the delta bytes for a truncation. We compute the delta at proposal time (to avoid repeating iteration over the entries in all replicas), but we do not pass the first index corresponding to the truncation, so gaps or overlaps cannot be noticed at truncation time. We do want to continue to have the raft leader guide the truncation since we do not want either leader or followers to over-truncate, given our desire to serve snapshots from any replica. In the loosely coupled approach implemented here, the truncation request that flows through raft serves as an upper bound on what can be truncated. The truncation request includes an ExpectedFirstIndex. This is further propagated using ReplicatedEvalResult.RaftExpectedFirstIndex. This ExpectedFirstIndex allows one to notice gaps or overlaps when enacting a sequence of truncations, which results in setting the Replica.raftLogSizeTrusted to false. The correctness issue with Replica.raftLogSize is not fully addressed since there are existing consistency issues when evaluating a TruncateLogRequest (these are now noted in a code comment). Below raft, the truncation requests are queued onto a Replica in pendingLogTruncations. The queueing and dequeuing is managed by a raftLogTruncator that takes care of merging pending truncation requests and enacting the truncations when the durability of the state machine advances. The pending truncation requests are taken into account in the raftLogQueue when deciding whether to do another truncation. Most of the behavior of the raftLogQueue is unchanged. The new behavior is gated on a LooselyCoupledRaftLogTruncation cluster version. Additionally, the new behavior can be turned off using the kv.raft_log.enable_loosely_coupled_truncation.enabled cluster setting, which is true by default. The latter is expected to be a safety switch for 1 release after which we expect to remove it. That removal will also cleanup some duplicated code (that was non-trivial to refactor and share) between the previous coupled and new loosely coupled truncation. Note, this PR is the first of two -- loosely coupled truncation is turned off via a constant in this PR. The next one will eliminate the constant and put it under the control of the cluster setting. Informs cockroachdb#36262 Informs cockroachdb#16624,cockroachdb#38322 Release note (ops change): The cluster setting kv.raft_log.loosely_coupled_truncation.enabled can be used to disable loosely coupled truncation.
In the ReplicasStorage design we stop making any assumptions regarding what is durable in the state machine when syncing a batch that commits changes to the raft log. This implies the need to make raft log truncation more loosely coupled than it is now, since we can truncate only when certain that the state machine is durable up to the truncation index. Current raft log truncation flows through raft and even though the RaftTruncatedStateKey is not a replicated key, it is coupled in the sense that the truncation is done below raft when processing the corresponding log entry (that asked for truncation to be done). The current setup also has correctness issues wrt maintaining the raft log size, when passing the delta bytes for a truncation. We compute the delta at proposal time (to avoid repeating iteration over the entries in all replicas), but we do not pass the first index corresponding to the truncation, so gaps or overlaps cannot be noticed at truncation time. We do want to continue to have the raft leader guide the truncation since we do not want either leader or followers to over-truncate, given our desire to serve snapshots from any replica. In the loosely coupled approach implemented here, the truncation request that flows through raft serves as an upper bound on what can be truncated. The truncation request includes an ExpectedFirstIndex. This is further propagated using ReplicatedEvalResult.RaftExpectedFirstIndex. This ExpectedFirstIndex allows one to notice gaps or overlaps when enacting a sequence of truncations, which results in setting the Replica.raftLogSizeTrusted to false. The correctness issue with Replica.raftLogSize is not fully addressed since there are existing consistency issues when evaluating a TruncateLogRequest (these are now noted in a code comment). Below raft, the truncation requests are queued onto a Replica in pendingLogTruncations. The queueing and dequeuing is managed by a raftLogTruncator that takes care of merging pending truncation requests and enacting the truncations when the durability of the state machine advances. The pending truncation requests are taken into account in the raftLogQueue when deciding whether to do another truncation. Most of the behavior of the raftLogQueue is unchanged. The new behavior is gated on a LooselyCoupledRaftLogTruncation cluster version. Additionally, the new behavior can be turned off using the kv.raft_log.enable_loosely_coupled_truncation.enabled cluster setting, which is true by default. The latter is expected to be a safety switch for 1 release after which we expect to remove it. That removal will also cleanup some duplicated code (that was non-trivial to refactor and share) between the previous coupled and new loosely coupled truncation. Note, this PR is the first of two -- loosely coupled truncation is turned off via a constant in this PR. The next one will eliminate the constant and put it under the control of the cluster setting. Informs cockroachdb#36262 Informs cockroachdb#16624,cockroachdb#38322 Release note (ops change): The cluster setting kv.raft_log.loosely_coupled_truncation.enabled can be used to disable loosely coupled truncation.
In the ReplicasStorage design we stop making any assumptions regarding what is durable in the state machine when syncing a batch that commits changes to the raft log. This implies the need to make raft log truncation more loosely coupled than it is now, since we can truncate only when certain that the state machine is durable up to the truncation index. Current raft log truncation flows through raft and even though the RaftTruncatedStateKey is not a replicated key, it is coupled in the sense that the truncation is done below raft when processing the corresponding log entry (that asked for truncation to be done). The current setup also has correctness issues wrt maintaining the raft log size, when passing the delta bytes for a truncation. We compute the delta at proposal time (to avoid repeating iteration over the entries in all replicas), but we do not pass the first index corresponding to the truncation, so gaps or overlaps cannot be noticed at truncation time. We do want to continue to have the raft leader guide the truncation since we do not want either leader or followers to over-truncate, given our desire to serve snapshots from any replica. In the loosely coupled approach implemented here, the truncation request that flows through raft serves as an upper bound on what can be truncated. The truncation request includes an ExpectedFirstIndex. This is further propagated using ReplicatedEvalResult.RaftExpectedFirstIndex. This ExpectedFirstIndex allows one to notice gaps or overlaps when enacting a sequence of truncations, which results in setting the Replica.raftLogSizeTrusted to false. The correctness issue with Replica.raftLogSize is not fully addressed since there are existing consistency issues when evaluating a TruncateLogRequest (these are now noted in a code comment). Below raft, the truncation requests are queued onto a Replica in pendingLogTruncations. The queueing and dequeuing is managed by a raftLogTruncator that takes care of merging pending truncation requests and enacting the truncations when the durability of the state machine advances. The pending truncation requests are taken into account in the raftLogQueue when deciding whether to do another truncation. Most of the behavior of the raftLogQueue is unchanged. The new behavior is gated on a LooselyCoupledRaftLogTruncation cluster version. Additionally, the new behavior can be turned off using the kv.raft_log.enable_loosely_coupled_truncation.enabled cluster setting, which is true by default. The latter is expected to be a safety switch for 1 release after which we expect to remove it. That removal will also cleanup some duplicated code (that was non-trivial to refactor and share) between the previous coupled and new loosely coupled truncation. Note, this PR is the first of two -- loosely coupled truncation is turned off via a constant in this PR. The next one will eliminate the constant and put it under the control of the cluster setting. Informs cockroachdb#36262 Informs cockroachdb#16624,cockroachdb#38322 Release note (ops change): The cluster setting kv.raft_log.loosely_coupled_truncation.enabled can be used to disable loosely coupled truncation.
cc @cockroachdb/replication |
We have marked this issue as stale because it has been inactive for |
CockroachDB currently stores Raft log entries and data in the same RocksDB storage engine. Both log entries and their applied state are written to the RocksDB WAL, even though only log entries are required to be durable before acknowledging writes. We use this fact to avoid a call to fdatasync when applying Raft log entries, opting to only do so when appending them to a Replica's Raft log. #17500 suggests that we could even go further and delay Raft log application significantly, moving it to an asynchronous process.
Problems
This approach has two main performance problems:
Raft Log Entries + Data in a shared WAL
Raft log appends are put in the same WAL as written (but not fsync-ed) applied entry state. This means that even though applied entry state in the WAL isn't immediately made durable, it is synced almost immediately by anyone appending log entries. This slows down the process of appending to the Raft log because doing so often needs to flush more of the WAL than it's intending to.
#7807 observed that we could move the Raft log to a separate storage engine and that doing so would avoid this issue. The Raft log would use its own write-ahead log and would get tighter control over the amount of data that is flushed when it calls fdatasync. That change also hints at the use of a specialized storage engine for Raft log entries. The unmerged implementation ended up using RocksDB, but it could have opted for something else.
Data written to a WAL
Past the issue with applied state being written to the same WAL as the Raft log, it is also a problem that applied state is written to a WAL, to begin with. A write-ahead log is meant to provide control over durability and atomicity of writes, but the Raft log already serves this purpose by durably and atomically recording the writes that will soon be applied. Paying for twice the durability has a cost - write amplification. A piece of data is often written to the Raft log WAL, the Raft log LSM (if not deleted in the memtable, see #8979 (comment)), the data engine WAL, and the data engine LSM. This extra write amplification reduces the overall write throughput that the system can support. Ideally, a piece of data would only be written to the Raft log WAL and the data engine LSM.
Proposed Solution
One fairly elegant solution to address both of these concerns is to move the Raft log to a separate storage engine and to disable the data engine's WAL. Doing so on its own almost works, but it runs into issues with crash recovery and with Raft log truncation.
Both of these concerns are due to the same root cause - naively, this approach makes it hard to know when applied entry state has become durable. This is hard to determine because applied entry state skips any WAL and is added only to RocksDB's memtable. To determine when this data becomes durable, we would need to know when data is compacted from the memtable to L0 in the LSM. Furthermore, we often would like to know which Raft log entries this now-durable data corresponds to, which is difficult to answer. For instance, to determine how much of the Raft log of a Replica we can safely truncate, we'd like to ask the question: "what is the highest applied index for this Replica that is durably applied?".
The solution here is to use the
RangeAppliedStateKey
, which is updated when each log entry is applied (or at least in the same batch of entries, see #37426 (comment)). This key contains the RaftAppliedIndex, which in a sense allows us to map a RocksDB sequence number to a Raft log index. We then ask the question: "what is the durable state of this key?". To answer this question, we can query this key using theReadTier::kPersistedTier
option inReadOptions.read_tier
. This instructs the lookup to skip the memtable and only read from the LSM, which is the exact set of semantics that we want.Using this technique, we can then ask the question about the highest applied Raft log index for a particular Replica. We can then use this, in combination with recent improvements in #34660, to perform Raft log truncation only up to durable applied log indexes. To do this, the
raftLogQueue
on a Replica would simply query theRangeAppliedState.raft_applied_index
usingReadTier::kPersistedTier
and bound the index it can truncate to up to this value. Similarly, crash recovery could use the same strategy to know where it needs to reapply from. In the second case, explicitly usingReadTier::kPersistedTier
might not be necessary because the memtable updates from before the crash will already be gone and the memtable will be empty.For all of this to work, we need to ensure that we maintain the property that the application of a Raft log entry is always in the same RocksDB batch as the corresponding update to the
RangeAppliedStateKey
and that RockDB will always guarantee that these updates will atomically move from the memtable to the LSM. These properties should be true currently.The solution fixes both of the performance problems listed above and should increase write throughput in CockroachDB. This solution also opens CockroachDB up to future flexibility with the Raft log storage engine. Even if it was initially moved to another RocksDB instance, it could be specialized and iterated on in isolation going forward.
Jira issue: CRDB-5635
Epic CRDB-40197
The text was updated successfully, but these errors were encountered: