Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
storage: truncate log only between first index and truncate index
Raft log truncations currently perform two steps (there may be others, but for the sake of this discussion, let's consider only these two): 1. above raft, they compute the stats of all raft log entries up to the truncation entry. 2. beneath raft, they use ClearIterRange to clear all raft log entries up to the truncation entry. In both steps, operations are performed on all entries up to the truncation entry, and in both steps these operations start from entry 0. A comment added in #16993 gives some idea as to why: > // We start at index zero because it's always possible that a previous > // truncation did not clean up entries made obsolete by the previous > // truncation. My current understanding is that this case where a Raft log has been truncated but its entries not cleaned up is only possible if a node crashes between `applyRaftCommand` and `handleEvalResultRaftMuLocked`. This brings up the question: why don't we truncate raft entries downstream of raft in `applyRaftCommand`? That way, the entries could be deleted atomically with the update to the `RaftTruncatedStateKey` and we wouldn't have to worry about them ever diverging or Raft entries being leaked. That seems like a trivial change, and if that was the case, would the approach here be safe? I don't see a reason why not. For motivation on why we should explore this, I've found that when running `sysbench oltp_insert` on a fresh cluster without pre-splits to measure single range write through, raft log truncation accounts for about 20% of CPU utilization. If we switch the ClearIterRange to a ClearRange downstream of raft, we improve throughput by 13% and reduce the amount of CPU that raft log truncation uses to about 5%. It's obvious why this speeds up the actual truncation itself downstream of raft. The reason why it speeds up the stats computation is less clear, but it may be allowing a RocksDB iterator to more easily skip over the deleted entry keys. If we make the change proposed here, we improve throughput by 28% and reduce the amount of CPU that raft log truncation uses to a negligible amount (< 1%, hard to tell exactly). The reason this speeds both the truncation and the stats computation is because it avoids iterating over RocksDB tombstones for all Raft entries that have ever existed on the range. The throughput improvements are of course exaggerated because we are isolating the throughput of a single range, but they're significant enough to warrant exploration about whether we can make this approach work. Finally, the outsized impact of this small change naturally justifies further exploration. If we could make the change here safe (i.e. if we could depend on replica.FirstIndex() to always be a lower bound on raft log entry keys), could we make similar changes elsewhere? Are there other places where we iterate over an entire raft log keyspace and inadvertently run into all of the deletion tombstones when we could simply skip to the `replica.FirstIndex()`? At a minimum, I believe that `clearRangeData` fits this description, so there may be room to speed up snapshots and replica GC. Release note (performance improvement): Reduce the cost of Raft log truncations and increase single-range throughput.
- Loading branch information