Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: skip Pebble write-ahead log during Raft entry application #38322

Open
nvanbenschoten opened this issue Jun 20, 2019 · 25 comments
Open
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-kv KV Team

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Jun 20, 2019

CockroachDB currently stores Raft log entries and data in the same RocksDB storage engine. Both log entries and their applied state are written to the RocksDB WAL, even though only log entries are required to be durable before acknowledging writes. We use this fact to avoid a call to fdatasync when applying Raft log entries, opting to only do so when appending them to a Replica's Raft log. #17500 suggests that we could even go further and delay Raft log application significantly, moving it to an asynchronous process.

Problems

This approach has two main performance problems:

Raft Log Entries + Data in a shared WAL

Raft log appends are put in the same WAL as written (but not fsync-ed) applied entry state. This means that even though applied entry state in the WAL isn't immediately made durable, it is synced almost immediately by anyone appending log entries. This slows down the process of appending to the Raft log because doing so often needs to flush more of the WAL than it's intending to.

#7807 observed that we could move the Raft log to a separate storage engine and that doing so would avoid this issue. The Raft log would use its own write-ahead log and would get tighter control over the amount of data that is flushed when it calls fdatasync. That change also hints at the use of a specialized storage engine for Raft log entries. The unmerged implementation ended up using RocksDB, but it could have opted for something else.

Data written to a WAL

Past the issue with applied state being written to the same WAL as the Raft log, it is also a problem that applied state is written to a WAL, to begin with. A write-ahead log is meant to provide control over durability and atomicity of writes, but the Raft log already serves this purpose by durably and atomically recording the writes that will soon be applied. Paying for twice the durability has a cost - write amplification. A piece of data is often written to the Raft log WAL, the Raft log LSM (if not deleted in the memtable, see #8979 (comment)), the data engine WAL, and the data engine LSM. This extra write amplification reduces the overall write throughput that the system can support. Ideally, a piece of data would only be written to the Raft log WAL and the data engine LSM.

Proposed Solution

One fairly elegant solution to address both of these concerns is to move the Raft log to a separate storage engine and to disable the data engine's WAL. Doing so on its own almost works, but it runs into issues with crash recovery and with Raft log truncation.

Both of these concerns are due to the same root cause - naively, this approach makes it hard to know when applied entry state has become durable. This is hard to determine because applied entry state skips any WAL and is added only to RocksDB's memtable. To determine when this data becomes durable, we would need to know when data is compacted from the memtable to L0 in the LSM. Furthermore, we often would like to know which Raft log entries this now-durable data corresponds to, which is difficult to answer. For instance, to determine how much of the Raft log of a Replica we can safely truncate, we'd like to ask the question: "what is the highest applied index for this Replica that is durably applied?".

The solution here is to use the RangeAppliedStateKey, which is updated when each log entry is applied (or at least in the same batch of entries, see #37426 (comment)). This key contains the RaftAppliedIndex, which in a sense allows us to map a RocksDB sequence number to a Raft log index. We then ask the question: "what is the durable state of this key?". To answer this question, we can query this key using the ReadTier::kPersistedTier option in ReadOptions.read_tier. This instructs the lookup to skip the memtable and only read from the LSM, which is the exact set of semantics that we want.

Using this technique, we can then ask the question about the highest applied Raft log index for a particular Replica. We can then use this, in combination with recent improvements in #34660, to perform Raft log truncation only up to durable applied log indexes. To do this, the raftLogQueue on a Replica would simply query the RangeAppliedState.raft_applied_index using ReadTier::kPersistedTier and bound the index it can truncate to up to this value. Similarly, crash recovery could use the same strategy to know where it needs to reapply from. In the second case, explicitly using ReadTier::kPersistedTier might not be necessary because the memtable updates from before the crash will already be gone and the memtable will be empty.

For all of this to work, we need to ensure that we maintain the property that the application of a Raft log entry is always in the same RocksDB batch as the corresponding update to the RangeAppliedStateKey and that RockDB will always guarantee that these updates will atomically move from the memtable to the LSM. These properties should be true currently.

The solution fixes both of the performance problems listed above and should increase write throughput in CockroachDB. This solution also opens CockroachDB up to future flexibility with the Raft log storage engine. Even if it was initially moved to another RocksDB instance, it could be specialized and iterated on in isolation going forward.

Jira issue: CRDB-5635

Epic CRDB-40197

@nvanbenschoten nvanbenschoten added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-storage Relating to our storage engine (Pebble) on-disk storage. labels Jun 20, 2019
@tbg
Copy link
Member

tbg commented Jun 20, 2019

Like mentioned earlier in Slack, this sounds really promising. Do you have any intuition on what benchmark needles this would move, and by how much?

@andy-kimball
Copy link
Contributor

I really like the idea of using the raft log as the WAL, and decoupling it as much as possible from writing of data pages. I'm seeing a number of systems do something similar (ex. https://www.microsoft.com/en-us/research/uploads/prod/2019/05/socrates.pdf). The idea is that the WAL gets stored to a "log service", which makes it durable and then applies that in distributed fashion to various page caches and stores. In Socrates, there are 3 levels of data pages. When a data page is needed, the local cache (usually on local SSD) is checked. If the page is not there, then the much larger (but slower) page servers are consulted. If those don't have it, then the page is fetched from super-cheap, ultra-high-capacity (but higher latency) blob storage.

Assuming we eventually scale up into 100's of TBs or petabytes of data, we may want to consider similar ideas. They also allow for more efficient use of EBS, since currently when we use EBS we effectively store 9 copies of the data (3 CRDB replicas * 3 EBS replicas of the data). Having a clean, well-defined separation between log storage and data storage will be essential to opening up these and other future possibilities.

@petermattis
Copy link
Collaborator

@nvanbenschoten Raft log entries and applied commands are not the only data we write to RocksDB. For example, gossip stores bootstrap info in the local RocksDB instance. I think we'd want these writes to have the same guarantees as they currently do. You've probably already thought about this, but I think it should be explicit here: the only RocksDB writes which would avoid the WAL are those for Raft command application.

We'll need to check the behavior of RocksDB with regards to mixing writes which use the WAL and skip the WAL in the same instance. In Pebble, this will work as you'd expect, because the Pebble commit pipeline writes each batch to the WAL individually, while RocksDB tries to combine batches. I recall there was some special code in the RocksDB commit path to handle this scenario, but can't recall the details. And our grouping of batches in storage/engine might negate any RocksDB special handling. Cc @ajkr

@nvanbenschoten
Copy link
Member Author

Do you have any intuition on what benchmark needles this would move, and by how much?

It's tough to say for sure, but we can use rafttoy to get an idea. I ran rafttoy with its parallel-append pipeline, which most closely resembles Cockroach as it is today (plus #37426) and posted some numbers below. In the first set of numbers, I'm comparing a configuration with a single Pebble instance for the Raft log and data storage against a configuration with two Pebble instances, one for the Raft log and one for data storage. Nothing is changed with the WALs. This latter approach is essentially what was proposed and prototyped in #7807. We see that at peak throughput levels, the split instances roughly doubles throughput, which is presumably in part because we no longer have Raft entry application getting mixed in the same WAL as the Raft log writes.

Raft Log + Data Combined vs. Raft Log + Data (with WAL) Split

name                          old time/op    new time/op    delta
Raft/conc=32/bytes=256-16       60.0µs ± 9%    69.7µs ± 8%   +16.23%  (p=0.000 n=15+15)
Raft/conc=64/bytes=256-16       20.9µs ± 5%    28.5µs ± 1%   +36.23%  (p=0.000 n=15+11)
Raft/conc=128/bytes=256-16      12.2µs ± 2%    13.6µs ± 5%   +11.09%  (p=0.000 n=12+15)
Raft/conc=256/bytes=256-16      10.9µs ± 1%     8.8µs ± 8%   -18.52%  (p=0.000 n=11+15)
Raft/conc=512/bytes=256-16      11.1µs ± 2%     7.0µs ± 8%   -37.57%  (p=0.000 n=12+15)
Raft/conc=1024/bytes=256-16     10.0µs ±16%     5.8µs ±10%   -41.91%  (p=0.000 n=15+15)
Raft/conc=2048/bytes=256-16     10.3µs ±21%     5.2µs ± 5%   -49.96%  (p=0.000 n=14+14)
Raft/conc=4096/bytes=256-16     10.9µs ± 7%     5.2µs ± 4%   -52.40%  (p=0.000 n=13+14)
Raft/conc=8192/bytes=256-16     9.65µs ±23%    6.27µs ± 1%   -34.98%  (p=0.000 n=15+12)
Raft/conc=16384/bytes=256-16    11.2µs ± 8%    10.1µs ± 3%    -9.59%  (p=0.000 n=15+15)

name                          old speed      new speed      delta
Raft/conc=32/bytes=256-16     4.27MB/s ± 8%  3.68MB/s ± 8%   -13.92%  (p=0.000 n=15+15)
Raft/conc=64/bytes=256-16     12.3MB/s ± 5%   9.0MB/s ± 1%   -26.55%  (p=0.000 n=15+12)
Raft/conc=128/bytes=256-16    20.9MB/s ± 2%  18.9MB/s ± 5%    -9.90%  (p=0.000 n=12+15)
Raft/conc=256/bytes=256-16    23.6MB/s ± 1%  29.0MB/s ± 9%   +22.91%  (p=0.000 n=11+15)
Raft/conc=512/bytes=256-16    23.0MB/s ± 2%  36.9MB/s ± 8%   +60.34%  (p=0.000 n=12+15)
Raft/conc=1024/bytes=256-16   25.8MB/s ±18%  44.0MB/s ± 9%   +70.72%  (p=0.000 n=15+15)
Raft/conc=2048/bytes=256-16   24.0MB/s ± 9%  49.4MB/s ± 5%  +105.45%  (p=0.000 n=12+13)
Raft/conc=4096/bytes=256-16   23.5MB/s ± 7%  49.3MB/s ± 4%  +109.87%  (p=0.000 n=13+14)
Raft/conc=8192/bytes=256-16   27.1MB/s ±25%  40.8MB/s ± 1%   +50.36%  (p=0.000 n=15+12)
Raft/conc=16384/bytes=256-16  22.9MB/s ± 8%  25.2MB/s ± 3%   +10.45%  (p=0.000 n=15+15)

In the second set of numbers, I'm comparing a configuration with single Pebble instance for the Raft log and data storage against a configuration with two Pebble instances, one for the Raft log and one for data storage. In this scenario, the Pebble instance used only for data storage has its WAL disabled, so Raft log application is not writing to a WAL at all. We again see that at peak throughput levels the split instances roughly doubles throughput.

Raft Log + Data Combined vs. Raft Log + Data (without WAL) Split

name                          old time/op    new time/op    delta
Raft/conc=32/bytes=256-16       60.0µs ± 9%    67.1µs ± 5%   +11.80%  (p=0.000 n=15+15)
Raft/conc=64/bytes=256-16       20.9µs ± 5%    27.9µs ± 4%   +33.36%  (p=0.000 n=15+15)
Raft/conc=128/bytes=256-16      12.2µs ± 2%    12.8µs ± 7%    +4.66%  (p=0.003 n=12+15)
Raft/conc=256/bytes=256-16      10.9µs ± 1%     8.1µs ± 2%   -25.73%  (p=0.000 n=11+11)
Raft/conc=512/bytes=256-16      11.1µs ± 2%     6.0µs ±14%   -45.96%  (p=0.000 n=12+15)
Raft/conc=1024/bytes=256-16     10.0µs ±16%     5.0µs ± 5%   -50.18%  (p=0.000 n=15+15)
Raft/conc=2048/bytes=256-16     10.3µs ±21%     5.7µs ±25%   -44.45%  (p=0.000 n=14+15)
Raft/conc=4096/bytes=256-16     10.9µs ± 7%     5.5µs ±26%   -49.97%  (p=0.000 n=13+15)
Raft/conc=8192/bytes=256-16     9.65µs ±23%    6.03µs ± 4%   -37.47%  (p=0.000 n=15+15)
Raft/conc=16384/bytes=256-16    11.2µs ± 8%     9.4µs ± 2%   -16.18%  (p=0.000 n=15+15)

name                          old speed      new speed      delta
Raft/conc=32/bytes=256-16     4.27MB/s ± 8%  3.82MB/s ± 5%   -10.62%  (p=0.000 n=15+15)
Raft/conc=64/bytes=256-16     12.3MB/s ± 5%   9.2MB/s ± 4%   -25.02%  (p=0.000 n=15+15)
Raft/conc=128/bytes=256-16    20.9MB/s ± 2%  20.0MB/s ± 7%    -4.34%  (p=0.004 n=12+15)
Raft/conc=256/bytes=256-16    23.6MB/s ± 1%  31.7MB/s ± 2%   +34.65%  (p=0.000 n=11+11)
Raft/conc=512/bytes=256-16    23.0MB/s ± 2%  42.7MB/s ±13%   +85.75%  (p=0.000 n=12+15)
Raft/conc=1024/bytes=256-16   25.8MB/s ±18%  51.2MB/s ± 5%   +98.87%  (p=0.000 n=15+15)
Raft/conc=2048/bytes=256-16   24.0MB/s ± 9%  46.0MB/s ±29%   +91.39%  (p=0.000 n=12+15)
Raft/conc=4096/bytes=256-16   23.5MB/s ± 7%  48.1MB/s ±22%  +104.49%  (p=0.000 n=13+15)
Raft/conc=8192/bytes=256-16   27.1MB/s ±25%  42.5MB/s ± 4%   +56.43%  (p=0.000 n=15+15)
Raft/conc=16384/bytes=256-16  22.9MB/s ± 8%  27.2MB/s ± 2%   +19.11%  (p=0.000 n=15+15)

Aggregated together, we see that the maximum throughputs of each configuration are:

merged instance              = 27.1MB/s
split instance + data WAL    = 49.4MB/s
split instance + no data WAL = 51.2MB/s

Note: I'm not sure what to make of the lower concurrencies being better with a merged Pebble instance. I wonder if we're so far away from saturation at these concurrencies that the extra WAL writes are somehow smoothing syncs out. Perhaps @petermattis has some intuition here.

I really like the idea of using the raft log as the WAL, and decoupling it as much as possible from writing of data pages. I'm seeing a number of systems do something similar...

I agree. Even if we don't move to distinct log and data "services" like Microsoft does with Socrates in that paper, keeping the abstractions separate with a clear boundary between them opens up a lot of flexibility for us to place different constraints on those components. @andreimatei has been thinking in this area recently as well.

Raft log entries and applied commands are not the only data we write to RocksDB. For example, gossip stores bootstrap info in the local RocksDB instance. I think we'd want these writes to have the same guarantees as they currently do. You've probably already thought about this, but I think it should be explicit here: the only RocksDB writes which would avoid the WAL are those for Raft command application.

I hadn't thought about this yet, but you're right. We'll still need a WAL for the data engine and we will need to be careful about which writes to the data engine can skip the WAL (i.e. Raft entry application) and which will still need to go in the WAL. This may require us to re-evaluate the level of durability we demand of various writes throughout our system, which I expect to be a healthy exercise. I've always been a little nervous that the rate at which we sync our single RocksDB WAL might be good at hiding areas where we're not clear enough about durability requirements necessary for correctness. The subtlety of this in areas like etcd-io/etcd#7625 (comment) shines a light on how tricky this is, though perhaps we're already very conservative (i.e. always sync) and the fear is unfounded.

@nvanbenschoten
Copy link
Member Author

I put together a prototype that sets WriteOptions::disableWAL to true during Raft entry application. This would potentially allow us to address this issue without the need for a separate storage engine for the Raft log. I verified that the change appeared to work by monitoring the size of the WAL for a given workload before and after the change and noticing about a 50% reduction in traffic.

The prototype gave a noticeable win, but it was less pronounced than I was hoping for:

name                            old ops/sec  new ops/sec  delta
kv0/size=1/cores=16/nodes=3      29.9k ± 0%   31.1k ± 0%   +3.89%  (p=0.008 n=5+5)
kv0/size=256/cores=16/nodes=3    28.0k ± 1%   28.7k ± 0%   +2.45%  (p=0.008 n=5+5)
kv0/size=8192/cores=16/nodes=3   3.30k ± 2%   3.49k ± 1%   +5.65%  (p=0.008 n=5+5)

name                            old p50(ms)  new p50(ms)  delta
kv0/size=1/cores=16/nodes=3       1.50 ± 0%    1.40 ± 0%   -6.67%  (p=0.008 n=5+5)
kv0/size=256/cores=16/nodes=3     1.60 ± 0%    1.50 ± 0%   -6.25%  (p=0.008 n=5+5)
kv0/size=8192/cores=16/nodes=3    10.0 ± 0%     8.9 ± 0%  -11.00%  (p=0.029 n=4+4)

name                            old p99(ms)  new p99(ms)  delta
kv0/size=1/cores=16/nodes=3       4.70 ± 0%    4.70 ± 0%     ~     (all equal)
kv0/size=256/cores=16/nodes=3     5.20 ± 0%    5.20 ± 0%     ~     (all equal)
kv0/size=8192/cores=16/nodes=3    71.3 ± 6%    65.0 ± 0%   -8.84%  (p=0.016 n=5+4)

This required me to split the engine.rocksDBBatch pipeline into a with-WAL group and a without-WAL group for the purposes of batching, so it's possible I messed something up there or that the loss in batching is counteracting the reduction in write amplification. I'll run the numbers again with the split groups but no use of WriteOptions::disableWAL to isolate the change.

I also plan to rebase the prototype on top of #38568 and re-run the numbers once that's merged.

@petermattis
Copy link
Collaborator

The prototype gave a noticeable win, but it was less pronounced than I was hoping for

Hmm, those numbers are not nearly as compelling as what you were seeing before. I recall you mentioning in person that using an actual separate RocksDB instance for the Raft log appeared to give a significant win for rafttoy.

This required me to split the engine.rocksDBBatch pipeline into a with-WAL group and a without-WAL group for the purposes of batching, so it's possible I messed something up there or that the loss in batching is counteracting the reduction in write amplification. I'll run the numbers again with the split groups but no use of WriteOptions::disableWAL to isolate the change.

I seem to recall that the RocksDB commit pipeline internally has some funkiness when batches have different options. It is possible that the source of the problem is there. And it is possible that using a single RocksDB instance is the problem itself. The WAL-skipping batches still need to wait for the previous WAL-writing batches to sync. Now that I think about it, that seems likely to be the problem.

@nvanbenschoten
Copy link
Member Author

I seem to recall that the RocksDB commit pipeline internally has some funkiness when batches have different options. It is possible that the source of the problem is there. And it is possible that using a single RocksDB instance is the problem itself. The WAL-skipping batches still need to wait for the previous WAL-writing batches to sync. Now that I think about it, that seems likely to be the problem.

I talked to @ajkr about this and he said that it's likely that RocksDB is batching the two groups together under the hood. He's going to look into whether there's an easy way to avoid this. Perhaps we can disable this batching entirely since we're already doing it ourselves?

If not, this change may require the raft log / data store split. Even with that split, we'll need to be careful about this effect, given your point in #38322 (comment) about some writes still requiring durability in the data store engine.

@petermattis
Copy link
Collaborator

I talked to @ajkr about this and he said that it's likely that RocksDB is batching the two groups together under the hood. He's going to look into whether there's an easy way to avoid this. Perhaps we can disable this batching entirely since we're already doing it ourselves?

Yeah, they might be grouped into the same uber-batch. But even without that grouping there is a problem. Consider what happens if there is a commit of batch A that writes to the WAL and a concurrent commit of batch B that skips the WAL. If the commit of B happens after the commit of A it will get queued up behind A and require waiting for A to be written to the WAL and then applied to the memtable. The sync of the WAL itself will happen afterwards, though. Hmm, I'm not sure if this is a problem or not.

@tbg
Copy link
Member

tbg commented Sep 26, 2019

The above prototype focused on qps performance numbers, but could the more important dimension here be the IOPS? If the WAL is essentially off for the data engine it would only be flushing its memtable periodically. Roughly speaking now we write every bit three times: once into the raft log wal (and never out to an SST, since it's hopefully tombstoned before it gets there, though we then need to flush out the tombstone - I'm going to ignore this, we have thought about SingleDelete to avoid this), once into the data wal, and once into a (data) SST. Shouldn't average IOPS go down by around 33% with the prototype?

@nvanbenschoten
Copy link
Member Author

My mental model is a little unclear here about how extra writes to the WAL map to extra IOPS. Certainly if we were syncing twice as much, we would expect double the number of writes, but we're not. When @ajkr took a look at why we weren't seeing the expected improvement here, he found that the extra WAL writes were bloating the amount we wrote to the WAL as expected, but that we were writing so little between syncs that the extra writes weren't causing us to spill into extra SSD pages, so the number of writes from the SSD's perspective remained constant.

There's definitely some experimentation to be done here.

@sumeerbhola
Copy link
Collaborator

I wonder whether we should go a step further and in cloud deployments move the raft logs of a node to a separate EBS volume (EBS is virtualized storage, and is already charged in a fine-grained manner, so there shouldn't be a cost increase).

This would help with "repaving" by transferring the EBS volume corresponding to the state machine, which is where most of the data volume is. Such a transfer would not reduce resiliency since the old node would still have a raft log and can participate in the quorum (it just can't apply since it has no state machine).

@ajwerner
Copy link
Contributor

I wonder whether we should go a step further and in cloud deployments move the raft logs of a node to a separate EBS volume (EBS is virtualized storage, and is already charged in a fine-grained manner, so there shouldn't be a cost increase).

This would help with "repaving" by transferring the EBS volume corresponding to the state machine, which is where most of the data volume is. Such a transfer would not reduce resiliency since the old node would still have a raft log and can participate in the quorum (it just can't apply since it has no state machine).

Interesting idea. I think the biggest blocker to acheiving this in cockroach is the fact that we share the storage engine containing the state machine for many ranges in a single store (LSM and EBS volume). If we had a separate LSM per range, this sort of thing becomes more feasible. It does carry other questions involve safe log truncation and snapshots.

@ajwerner
Copy link
Contributor

While I'm here, I'd like to raise a point that has come up offline a number of times. Using a separate volume for applied state might make new hardware configurations reasonable. In particular, it might be reasonable to use a single, very fast storage device for the raft logs and then apply entries onto HDDs. In a cloud setting this might look like a small, but heavily IOP provisioned EBS volume. Server configurations which lots of large disks and a single, smallish SSD are not uncommon for cool-cold storage applications. Such a server is now rather unadvisable with cockroach. Perhaps with a separate storage engine, one could achieve reasonably good performance for write-mostly workloads at a dramatically lower cost per byte.

@sumeerbhola
Copy link
Collaborator

I think the biggest blocker to acheiving this in cockroach is the fact that we share the storage engine containing the state machine for many ranges in a single store (LSM and EBS volume). If we had a separate LSM per range, this sort of thing becomes more feasible.

Not clear to me why we need an LSM per range. I was specifically thinking of the repaving/decommissioning use case, where all the ranges are being moved to a new node (the latest number for a 10TB node decommission was 12 hours, I think). Am I missing something?

If we also consider rebalancing, the following is a copy-paste of an idea I had mentioned to the storage folks a few months ago

I was thinking again about the storage density issue as it relates to decommissioning and rebalancing a node with 10TB of data. That is, how we do avoid the cost of copying all that data around. It seems unlikely that cloud blob storage, which allows multiple readers/writers, will have a good enough SLA any time in the future.

EBS is virtualized storage, and is already charged in a fine-grained manner, so there isn’t a cost benefit of a 10TB EBS versus a 50GB EBS. An EBS that is too small will cause too much internal fragmentation and inter-EBS load-balancing. So say we had 10 1TB EBSs/stores per node. Load based rebalancing and decommissioning would just move a 1TB store to a different node, which would usually just cause a 10% increase in the accepting node (if 10% is too much over-provisioning we could further reduce the store size, but there is a tradeoff). If byte sizes of ranges are relatively stable, there wouldn’t be too much inter-store rebalancing. And we could be lax and allow stores to have cpu consumption more than 10% for a while as long as we can find a node to host it. If that load spike turns out not to be transient we would eventually resort to inter-store rebalancing.

@nvanbenschoten
Copy link
Member Author

One insight we can steal from this issue that may also be helpful for placing the state machine in disaggregated storage is that the state machine itself stores the Raft applied index of each Range in the RangeAppliedStateKey. Since all updates to the state machine are atomic, we can always take an arbitrarily stale snapshot of the state machine and determine which log entries need to be applied to it to catch it up, assuming our log hasn't been truncated past this point.

In a way, this is saying the same thing as the rest of this issue. Only, this issue is describing how a replica's local LSM's memtable could be ephemeral. The extension here is expanding this to a replica whose entire local LSM is ephemeral, assuming we have some other durable store backing a consistent prefix of it.

Interestingly, such an architecture would motivate replicated log truncation again, because log truncation would be a function of the single durable state machine.

Of course, this leaves out all the hard details.

Not clear to me why we need an LSM per range.

An LSM per range makes these kinds of things easier because we can associate a set of files with exactly one Range. So it becomes more straightforward to dump a Range's state machine to something like S3. Or to pull a stale snapshot of a Range in from S3.

@ajwerner
Copy link
Contributor

As much as I'm a fan of these offhand discussions about sticking the applied state of a range in some external storage system, it feels quite premature and speculative. Maybe there are places we want to take cockroach where pursuing such a strategy makes sense but there will have to likely be a clear and tactical motivation for such a change. Either way, better isolating the raft state and applied state seems beneficial along that path as well as supporting a good many other goals.

Here's the way I'd break it down:

Soon:

  • Make all of the raft processing properly async (kv: make disk I/O asynchronous with respect to Raft state machine #17500).
  • Unhook the raft state storage engine from the applied state storage engine (this issue).
  • Consider doing something smart to how we tune the raft log storage engine to avoid writing much of anything to the actual LSM.
  • Consider doing something smart about scheduling how we group commands for application (preferring leaseholders, preferring system ranges, etc).

Then, speculatively:

  • Refactor replication in the code base in general to be an isolated interface with the requisite hooks.
  • Look around and see what else is going on before doing anything rash

@nnsgmsone
Copy link

nnsgmsone commented Apr 30, 2021

I think this may lead to extra overhead for io when make raft log and wal together into a log engine, it may lead to io problems unless it is a log engine shared by all raft groups. Here is a benchmark: tested on a 7200 rpm hdd

not share log engine:

wkB/s
7292.00
3620.00
8844.00
440.00
9464.00
7452.00
6916.00
7444.00
6292.00
104.00
4716.00
5760.00

share log engine:

wkB/s
1024.00
25112.00
35140.00
35144.00
35136.00
964.00
35140.00
28.00
35144.00
30136.00
35188.00

not share log engine code:

package main

import (
    "fmt"
    "os"
)

const (
    Loop  = 100
    Batch = 10000                                                                                                                                                                                                  
)

var data []byte
var fs []*os.File

func init() {
    data = make([]byte, 512)
    fs = make([]*os.File, 10)
    for i := 0; i < 10; i++ {
        fs[i], _ = os.Create(fmt.Sprintf("%v.dat", i))
    }
}

func main() {
    for i := 0; i < Loop; i++ {
        write()
    }
}

func write() {
    for i := 0; i < Batch; i++ {
        fs[i%10].Write(data)
    }
    for _, f := range fs {
        f.Sync()
    }
}

share log engine code:

package main

import (
    "fmt"
    "os"
)

const (
    Loop  = 100
    Batch = 10000                                                                                                                                                                                                  
)

var data []byte
var fs []*os.File

func init() {
    data = make([]byte, 512)
    fs = make([]*os.File, 2)
    for i := 0; i < 2; i++ {
        fs[i], _ = os.Create(fmt.Sprintf("%v.dat", i))
    }
}

func main() {
    for i := 0; i < Loop; i++ {
        write()
    }
}

func write() {
    for i := 0; i < Batch; i++ {
        fs[0].Write(data)
        fs[1].Write(data)
    }
    fs[0].Sync()
}

@ajwerner
Copy link
Contributor

unless it is a log engine shared by all raft groups.

That is the plan. One engine for the raft logs, one for the applied state.

@yingfeng
Copy link

yingfeng commented Apr 30, 2021

unless it is a log engine shared by all raft groups.

That is the plan. One engine for the raft logs, one for the applied state.

What's the design proposal for sharing the raft logs? Although adopting key-value engines such as pebble to store the separate raft log could be easy, while it would introduce redundant IO since pebble has its own WAL,too. Designing a dedicate log store for shariing raft logs is non-trivial, given no redundant IO permitted.

sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Feb 17, 2022
In the ReplicasStorage design we stop making any assumptions
regarding what is durable in the state machine when syncing a batch
that commits changes to the raft log. This implies the need to
make raft log truncation more loosely coupled than it is now, since
we can truncate only when certain that the state machine is durable
up to the truncation index.

Current raft log truncation flows through raft and even though the
RaftTruncatedStateKey is not a replicated key, it is coupled in
the sense that the truncation is done below raft when processing
the corresponding log entry (that asked for truncation to be done).

The current setup also has correctness issues wrt maintaining the
raft log size, when passing the delta bytes for a truncation. We
compute the delta at proposal time (to avoid repeating iteration over
the entries in all replicas), but we do not pass the first index
corresponding to the truncation, so gaps or overlaps cannot be
noticed at truncation time.

We do want to continue to have the raft leader guide the truncation
since we do not want either leader or followers to over-truncate,
given our desire to serve snapshots from any replica. In the loosely
coupled approach implemented here, the truncation request that flows
through raft serves as an upper bound on what can be truncated.

The truncation request includes an ExpectedFirstIndex. This is
further propagated using ReplicatedEvalResult.RaftExpectedFirstIndex.
This ExpectedFirstIndex allows one to notice gaps or overlaps when
enacting a sequence of truncations, which results in setting the
Replica.raftLogSizeTrusted to false. The correctness issue with
Replica.raftLogSize is not fully addressed since there are existing
consistency issues when evaluating a TruncateLogRequest (these
are now noted in a code comment).

Below raft, the truncation requests are queued onto a Replica
in pendingLogTruncations. The queueing and dequeuing is managed
by a raftLogTruncator that takes care of merging pending truncation
requests and enacting the truncations when the durability of the
state machine advances.

The pending truncation requests are taken into account in the
raftLogQueue when deciding whether to do another truncation.
Most of the behavior of the raftLogQueue is unchanged.

The new behavior is gated on a LooselyCoupledRaftLogTruncation
cluster version. Additionally, the new behavior can be turned
off using the kv.raft_log.enable_loosely_coupled_truncation
cluster setting, which is true by default. The latter is expected
to be a safety switch for 1 release after which we expect to
remove it. That removal will also cleanup some duplicated code
(that was non-trivial to refactor and share) between the previous
coupled and new loosely coupled truncation.

Informs cockroachdb#36262
Informs cockroachdb#16624,cockroachdb#38322

Release note: None
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Feb 17, 2022
In the ReplicasStorage design we stop making any assumptions
regarding what is durable in the state machine when syncing a batch
that commits changes to the raft log. This implies the need to
make raft log truncation more loosely coupled than it is now, since
we can truncate only when certain that the state machine is durable
up to the truncation index.

Current raft log truncation flows through raft and even though the
RaftTruncatedStateKey is not a replicated key, it is coupled in
the sense that the truncation is done below raft when processing
the corresponding log entry (that asked for truncation to be done).

The current setup also has correctness issues wrt maintaining the
raft log size, when passing the delta bytes for a truncation. We
compute the delta at proposal time (to avoid repeating iteration over
the entries in all replicas), but we do not pass the first index
corresponding to the truncation, so gaps or overlaps cannot be
noticed at truncation time.

We do want to continue to have the raft leader guide the truncation
since we do not want either leader or followers to over-truncate,
given our desire to serve snapshots from any replica. In the loosely
coupled approach implemented here, the truncation request that flows
through raft serves as an upper bound on what can be truncated.

The truncation request includes an ExpectedFirstIndex. This is
further propagated using ReplicatedEvalResult.RaftExpectedFirstIndex.
This ExpectedFirstIndex allows one to notice gaps or overlaps when
enacting a sequence of truncations, which results in setting the
Replica.raftLogSizeTrusted to false. The correctness issue with
Replica.raftLogSize is not fully addressed since there are existing
consistency issues when evaluating a TruncateLogRequest (these
are now noted in a code comment).

Below raft, the truncation requests are queued onto a Replica
in pendingLogTruncations. The queueing and dequeuing is managed
by a raftLogTruncator that takes care of merging pending truncation
requests and enacting the truncations when the durability of the
state machine advances.

The pending truncation requests are taken into account in the
raftLogQueue when deciding whether to do another truncation.
Most of the behavior of the raftLogQueue is unchanged.

The new behavior is gated on a LooselyCoupledRaftLogTruncation
cluster version. Additionally, the new behavior can be turned
off using the kv.raft_log.enable_loosely_coupled_truncation
cluster setting, which is true by default. The latter is expected
to be a safety switch for 1 release after which we expect to
remove it. That removal will also cleanup some duplicated code
(that was non-trivial to refactor and share) between the previous
coupled and new loosely coupled truncation.

Informs cockroachdb#36262
Informs cockroachdb#16624,cockroachdb#38322

Release note: None
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Feb 18, 2022
In the ReplicasStorage design we stop making any assumptions
regarding what is durable in the state machine when syncing a batch
that commits changes to the raft log. This implies the need to
make raft log truncation more loosely coupled than it is now, since
we can truncate only when certain that the state machine is durable
up to the truncation index.

Current raft log truncation flows through raft and even though the
RaftTruncatedStateKey is not a replicated key, it is coupled in
the sense that the truncation is done below raft when processing
the corresponding log entry (that asked for truncation to be done).

The current setup also has correctness issues wrt maintaining the
raft log size, when passing the delta bytes for a truncation. We
compute the delta at proposal time (to avoid repeating iteration over
the entries in all replicas), but we do not pass the first index
corresponding to the truncation, so gaps or overlaps cannot be
noticed at truncation time.

We do want to continue to have the raft leader guide the truncation
since we do not want either leader or followers to over-truncate,
given our desire to serve snapshots from any replica. In the loosely
coupled approach implemented here, the truncation request that flows
through raft serves as an upper bound on what can be truncated.

The truncation request includes an ExpectedFirstIndex. This is
further propagated using ReplicatedEvalResult.RaftExpectedFirstIndex.
This ExpectedFirstIndex allows one to notice gaps or overlaps when
enacting a sequence of truncations, which results in setting the
Replica.raftLogSizeTrusted to false. The correctness issue with
Replica.raftLogSize is not fully addressed since there are existing
consistency issues when evaluating a TruncateLogRequest (these
are now noted in a code comment).

Below raft, the truncation requests are queued onto a Replica
in pendingLogTruncations. The queueing and dequeuing is managed
by a raftLogTruncator that takes care of merging pending truncation
requests and enacting the truncations when the durability of the
state machine advances.

The pending truncation requests are taken into account in the
raftLogQueue when deciding whether to do another truncation.
Most of the behavior of the raftLogQueue is unchanged.

The new behavior is gated on a LooselyCoupledRaftLogTruncation
cluster version. Additionally, the new behavior can be turned
off using the kv.raft_log.enable_loosely_coupled_truncation.enabled
cluster setting, which is true by default. The latter is expected
to be a safety switch for 1 release after which we expect to
remove it. That removal will also cleanup some duplicated code
(that was non-trivial to refactor and share) between the previous
coupled and new loosely coupled truncation.

Note, this PR is the first of two -- loosely coupled truncation
is turned off via a constant in this PR. The next one will
eliminate the constant and put it under the control of the cluster
setting.

Informs cockroachdb#36262
Informs cockroachdb#16624,cockroachdb#38322

Release note (ops change): The cluster setting
kv.raft_log.enable_loosely_coupled_truncation.enabled can be used
to disable loosely coupled truncation.
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Feb 18, 2022
In the ReplicasStorage design we stop making any assumptions
regarding what is durable in the state machine when syncing a batch
that commits changes to the raft log. This implies the need to
make raft log truncation more loosely coupled than it is now, since
we can truncate only when certain that the state machine is durable
up to the truncation index.

Current raft log truncation flows through raft and even though the
RaftTruncatedStateKey is not a replicated key, it is coupled in
the sense that the truncation is done below raft when processing
the corresponding log entry (that asked for truncation to be done).

The current setup also has correctness issues wrt maintaining the
raft log size, when passing the delta bytes for a truncation. We
compute the delta at proposal time (to avoid repeating iteration over
the entries in all replicas), but we do not pass the first index
corresponding to the truncation, so gaps or overlaps cannot be
noticed at truncation time.

We do want to continue to have the raft leader guide the truncation
since we do not want either leader or followers to over-truncate,
given our desire to serve snapshots from any replica. In the loosely
coupled approach implemented here, the truncation request that flows
through raft serves as an upper bound on what can be truncated.

The truncation request includes an ExpectedFirstIndex. This is
further propagated using ReplicatedEvalResult.RaftExpectedFirstIndex.
This ExpectedFirstIndex allows one to notice gaps or overlaps when
enacting a sequence of truncations, which results in setting the
Replica.raftLogSizeTrusted to false. The correctness issue with
Replica.raftLogSize is not fully addressed since there are existing
consistency issues when evaluating a TruncateLogRequest (these
are now noted in a code comment).

Below raft, the truncation requests are queued onto a Replica
in pendingLogTruncations. The queueing and dequeuing is managed
by a raftLogTruncator that takes care of merging pending truncation
requests and enacting the truncations when the durability of the
state machine advances.

The pending truncation requests are taken into account in the
raftLogQueue when deciding whether to do another truncation.
Most of the behavior of the raftLogQueue is unchanged.

The new behavior is gated on a LooselyCoupledRaftLogTruncation
cluster version. Additionally, the new behavior can be turned
off using the kv.raft_log.enable_loosely_coupled_truncation.enabled
cluster setting, which is true by default. The latter is expected
to be a safety switch for 1 release after which we expect to
remove it. That removal will also cleanup some duplicated code
(that was non-trivial to refactor and share) between the previous
coupled and new loosely coupled truncation.

Note, this PR is the first of two -- loosely coupled truncation
is turned off via a constant in this PR. The next one will
eliminate the constant and put it under the control of the cluster
setting.

Informs cockroachdb#36262
Informs cockroachdb#16624,cockroachdb#38322

Release note (ops change): The cluster setting
kv.raft_log.enable_loosely_coupled_truncation.enabled can be used
to disable loosely coupled truncation.
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Feb 21, 2022
In the ReplicasStorage design we stop making any assumptions
regarding what is durable in the state machine when syncing a batch
that commits changes to the raft log. This implies the need to
make raft log truncation more loosely coupled than it is now, since
we can truncate only when certain that the state machine is durable
up to the truncation index.

Current raft log truncation flows through raft and even though the
RaftTruncatedStateKey is not a replicated key, it is coupled in
the sense that the truncation is done below raft when processing
the corresponding log entry (that asked for truncation to be done).

The current setup also has correctness issues wrt maintaining the
raft log size, when passing the delta bytes for a truncation. We
compute the delta at proposal time (to avoid repeating iteration over
the entries in all replicas), but we do not pass the first index
corresponding to the truncation, so gaps or overlaps cannot be
noticed at truncation time.

We do want to continue to have the raft leader guide the truncation
since we do not want either leader or followers to over-truncate,
given our desire to serve snapshots from any replica. In the loosely
coupled approach implemented here, the truncation request that flows
through raft serves as an upper bound on what can be truncated.

The truncation request includes an ExpectedFirstIndex. This is
further propagated using ReplicatedEvalResult.RaftExpectedFirstIndex.
This ExpectedFirstIndex allows one to notice gaps or overlaps when
enacting a sequence of truncations, which results in setting the
Replica.raftLogSizeTrusted to false. The correctness issue with
Replica.raftLogSize is not fully addressed since there are existing
consistency issues when evaluating a TruncateLogRequest (these
are now noted in a code comment).

Below raft, the truncation requests are queued onto a Replica
in pendingLogTruncations. The queueing and dequeuing is managed
by a raftLogTruncator that takes care of merging pending truncation
requests and enacting the truncations when the durability of the
state machine advances.

The pending truncation requests are taken into account in the
raftLogQueue when deciding whether to do another truncation.
Most of the behavior of the raftLogQueue is unchanged.

The new behavior is gated on a LooselyCoupledRaftLogTruncation
cluster version. Additionally, the new behavior can be turned
off using the kv.raft_log.enable_loosely_coupled_truncation.enabled
cluster setting, which is true by default. The latter is expected
to be a safety switch for 1 release after which we expect to
remove it. That removal will also cleanup some duplicated code
(that was non-trivial to refactor and share) between the previous
coupled and new loosely coupled truncation.

Note, this PR is the first of two -- loosely coupled truncation
is turned off via a constant in this PR. The next one will
eliminate the constant and put it under the control of the cluster
setting.

Informs cockroachdb#36262
Informs cockroachdb#16624,cockroachdb#38322

Release note (ops change): The cluster setting
kv.raft_log.enable_loosely_coupled_truncation.enabled can be used
to disable loosely coupled truncation.
sumeerbhola added a commit to sumeerbhola/cockroach that referenced this issue Feb 22, 2022
In the ReplicasStorage design we stop making any assumptions
regarding what is durable in the state machine when syncing a batch
that commits changes to the raft log. This implies the need to
make raft log truncation more loosely coupled than it is now, since
we can truncate only when certain that the state machine is durable
up to the truncation index.

Current raft log truncation flows through raft and even though the
RaftTruncatedStateKey is not a replicated key, it is coupled in
the sense that the truncation is done below raft when processing
the corresponding log entry (that asked for truncation to be done).

The current setup also has correctness issues wrt maintaining the
raft log size, when passing the delta bytes for a truncation. We
compute the delta at proposal time (to avoid repeating iteration over
the entries in all replicas), but we do not pass the first index
corresponding to the truncation, so gaps or overlaps cannot be
noticed at truncation time.

We do want to continue to have the raft leader guide the truncation
since we do not want either leader or followers to over-truncate,
given our desire to serve snapshots from any replica. In the loosely
coupled approach implemented here, the truncation request that flows
through raft serves as an upper bound on what can be truncated.

The truncation request includes an ExpectedFirstIndex. This is
further propagated using ReplicatedEvalResult.RaftExpectedFirstIndex.
This ExpectedFirstIndex allows one to notice gaps or overlaps when
enacting a sequence of truncations, which results in setting the
Replica.raftLogSizeTrusted to false. The correctness issue with
Replica.raftLogSize is not fully addressed since there are existing
consistency issues when evaluating a TruncateLogRequest (these
are now noted in a code comment).

Below raft, the truncation requests are queued onto a Replica
in pendingLogTruncations. The queueing and dequeuing is managed
by a raftLogTruncator that takes care of merging pending truncation
requests and enacting the truncations when the durability of the
state machine advances.

The pending truncation requests are taken into account in the
raftLogQueue when deciding whether to do another truncation.
Most of the behavior of the raftLogQueue is unchanged.

The new behavior is gated on a LooselyCoupledRaftLogTruncation
cluster version. Additionally, the new behavior can be turned
off using the kv.raft_log.enable_loosely_coupled_truncation.enabled
cluster setting, which is true by default. The latter is expected
to be a safety switch for 1 release after which we expect to
remove it. That removal will also cleanup some duplicated code
(that was non-trivial to refactor and share) between the previous
coupled and new loosely coupled truncation.

Note, this PR is the first of two -- loosely coupled truncation
is turned off via a constant in this PR. The next one will
eliminate the constant and put it under the control of the cluster
setting.

Informs cockroachdb#36262
Informs cockroachdb#16624,cockroachdb#38322

Release note (ops change): The cluster setting
kv.raft_log.loosely_coupled_truncation.enabled can be used to disable
loosely coupled truncation.
maryliag pushed a commit to maryliag/cockroach that referenced this issue Feb 28, 2022
In the ReplicasStorage design we stop making any assumptions
regarding what is durable in the state machine when syncing a batch
that commits changes to the raft log. This implies the need to
make raft log truncation more loosely coupled than it is now, since
we can truncate only when certain that the state machine is durable
up to the truncation index.

Current raft log truncation flows through raft and even though the
RaftTruncatedStateKey is not a replicated key, it is coupled in
the sense that the truncation is done below raft when processing
the corresponding log entry (that asked for truncation to be done).

The current setup also has correctness issues wrt maintaining the
raft log size, when passing the delta bytes for a truncation. We
compute the delta at proposal time (to avoid repeating iteration over
the entries in all replicas), but we do not pass the first index
corresponding to the truncation, so gaps or overlaps cannot be
noticed at truncation time.

We do want to continue to have the raft leader guide the truncation
since we do not want either leader or followers to over-truncate,
given our desire to serve snapshots from any replica. In the loosely
coupled approach implemented here, the truncation request that flows
through raft serves as an upper bound on what can be truncated.

The truncation request includes an ExpectedFirstIndex. This is
further propagated using ReplicatedEvalResult.RaftExpectedFirstIndex.
This ExpectedFirstIndex allows one to notice gaps or overlaps when
enacting a sequence of truncations, which results in setting the
Replica.raftLogSizeTrusted to false. The correctness issue with
Replica.raftLogSize is not fully addressed since there are existing
consistency issues when evaluating a TruncateLogRequest (these
are now noted in a code comment).

Below raft, the truncation requests are queued onto a Replica
in pendingLogTruncations. The queueing and dequeuing is managed
by a raftLogTruncator that takes care of merging pending truncation
requests and enacting the truncations when the durability of the
state machine advances.

The pending truncation requests are taken into account in the
raftLogQueue when deciding whether to do another truncation.
Most of the behavior of the raftLogQueue is unchanged.

The new behavior is gated on a LooselyCoupledRaftLogTruncation
cluster version. Additionally, the new behavior can be turned
off using the kv.raft_log.enable_loosely_coupled_truncation.enabled
cluster setting, which is true by default. The latter is expected
to be a safety switch for 1 release after which we expect to
remove it. That removal will also cleanup some duplicated code
(that was non-trivial to refactor and share) between the previous
coupled and new loosely coupled truncation.

Note, this PR is the first of two -- loosely coupled truncation
is turned off via a constant in this PR. The next one will
eliminate the constant and put it under the control of the cluster
setting.

Informs cockroachdb#36262
Informs cockroachdb#16624,cockroachdb#38322

Release note (ops change): The cluster setting
kv.raft_log.loosely_coupled_truncation.enabled can be used to disable
loosely coupled truncation.
RajivTS pushed a commit to RajivTS/cockroach that referenced this issue Mar 6, 2022
In the ReplicasStorage design we stop making any assumptions
regarding what is durable in the state machine when syncing a batch
that commits changes to the raft log. This implies the need to
make raft log truncation more loosely coupled than it is now, since
we can truncate only when certain that the state machine is durable
up to the truncation index.

Current raft log truncation flows through raft and even though the
RaftTruncatedStateKey is not a replicated key, it is coupled in
the sense that the truncation is done below raft when processing
the corresponding log entry (that asked for truncation to be done).

The current setup also has correctness issues wrt maintaining the
raft log size, when passing the delta bytes for a truncation. We
compute the delta at proposal time (to avoid repeating iteration over
the entries in all replicas), but we do not pass the first index
corresponding to the truncation, so gaps or overlaps cannot be
noticed at truncation time.

We do want to continue to have the raft leader guide the truncation
since we do not want either leader or followers to over-truncate,
given our desire to serve snapshots from any replica. In the loosely
coupled approach implemented here, the truncation request that flows
through raft serves as an upper bound on what can be truncated.

The truncation request includes an ExpectedFirstIndex. This is
further propagated using ReplicatedEvalResult.RaftExpectedFirstIndex.
This ExpectedFirstIndex allows one to notice gaps or overlaps when
enacting a sequence of truncations, which results in setting the
Replica.raftLogSizeTrusted to false. The correctness issue with
Replica.raftLogSize is not fully addressed since there are existing
consistency issues when evaluating a TruncateLogRequest (these
are now noted in a code comment).

Below raft, the truncation requests are queued onto a Replica
in pendingLogTruncations. The queueing and dequeuing is managed
by a raftLogTruncator that takes care of merging pending truncation
requests and enacting the truncations when the durability of the
state machine advances.

The pending truncation requests are taken into account in the
raftLogQueue when deciding whether to do another truncation.
Most of the behavior of the raftLogQueue is unchanged.

The new behavior is gated on a LooselyCoupledRaftLogTruncation
cluster version. Additionally, the new behavior can be turned
off using the kv.raft_log.enable_loosely_coupled_truncation.enabled
cluster setting, which is true by default. The latter is expected
to be a safety switch for 1 release after which we expect to
remove it. That removal will also cleanup some duplicated code
(that was non-trivial to refactor and share) between the previous
coupled and new loosely coupled truncation.

Note, this PR is the first of two -- loosely coupled truncation
is turned off via a constant in this PR. The next one will
eliminate the constant and put it under the control of the cluster
setting.

Informs cockroachdb#36262
Informs cockroachdb#16624,cockroachdb#38322

Release note (ops change): The cluster setting
kv.raft_log.loosely_coupled_truncation.enabled can be used to disable
loosely coupled truncation.
@tbg tbg added T-kv-replication and removed T-kv KV Team labels Nov 4, 2022
@blathers-crl
Copy link

blathers-crl bot commented Nov 4, 2022

cc @cockroachdb/replication

@erikgrinaker erikgrinaker changed the title storage: skip RocksDB write-ahead log during Raft entry application storage: skip Pebble write-ahead log during Raft entry application Nov 7, 2022
Copy link

github-actions bot commented May 1, 2024

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) C-performance Perf of queries or internals. Solution not expected to change functional behavior. T-kv KV Team
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

10 participants