storage: significant space remains in RaftLog after dropping table #26339

petermattis · 2018-06-02T18:16:38Z

In performing some basic experiments to verify space reclamation I discovered something interesting: significant space can remain in the RaftLog for ranges of a dropped table.

The setup: against a local 1 node cluster (created via roachprod create local -n 1; roachprod start local) I ran:

~ workload run kv --init --read-percent=0 --max-block-bytes=1025 --min-block-bytes=1024 --max-ops=1000000 --splits=10000

[For fast testing purposes, at this point I stopped the cluster and packaged up the ~/local directory into a 1.3GB tar].

I then ran:

~ roachprod sql local -- -e "ALTER TABLE kv.kv EXPERIMENTAL CONFIGURE ZONE 'gc: {ttlseconds: 10}'"
~ roachprod sql local -- -e "DROP TABLE kv.kv"

After the compactions completed, only 1GiB out of the initial 1.3GiB had been deleted. A manual cockroach debug compact of the storage directory did not recover any additional space. The output of cockroach debug keys --sizes shows 8.3MiB of live keys and 301MiB of live values. 97% of this data is from RaftLog keys/values.

I'm not sure why these RaftLog entries are sticky around given that this is a 1 node cluster. Perhaps the sheer number of ranges prevented the Raft log queue from keeping up with truncations.

Cc @m-schneider as this might be related to the failure of the drop table roachtest.

The text was updated successfully, but these errors were encountered:

petermattis · 2018-06-02T18:24:02Z

I'm not sure why these RaftLog entries are sticky around given that this is a 1 node cluster. Perhaps the sheer number of ranges prevented the Raft log queue from keeping up with truncations.

Ah, this has an easy explanation: the raft log queue only truncates if there is 64KiB of data to delete. The debug keys --sizes output shows that none of the 10k ranges have more than 64KiB. If we estimate that each range's Raft log contains 32KiB of data, we would expect the Raft logs across all of the ranges in the table to contain 312MiB which is very close to what I'm seeing.

@benesch Merging of empty ranges would mitigate this issue. Or we could make the Raft log queue heuristic more aggressive after a ClearRange operation.

PS An empty range with an empty Raft log consumes ~260 bytes on disk due to the various range local keys (e.g. /RangeLastGC, /RangeAppliedState, /RaftTruncatedState, etc).

bdarnell · 2018-06-04T18:14:01Z

Merging of empty ranges would mitigate this issue. Or we could make the Raft log queue heuristic more aggressive after a ClearRange operation.

Or we could have deleteAllRowsFast (which is responsible for actually sending the ClearRange operations) also send TruncateLog commands.

tbg · 2018-10-26T13:55:39Z

This problem will go away for empty ranges once we have the merge queue.

It might be relevant for "regular" ranges though. I've wondered whether we should add a truncation criterion to the Raft log queue: if a range is quiescent for a while (or insert your favorite criterion for this range being worth trimming the fat for), we use a low value (3) here for

cockroach/pkg/storage/raft_log_queue.go

Lines 38 to 41 in e808caf

    
           // RaftLogQueueStaleThreshold is the minimum threshold for stale raft log 
        
           // entries. A stale entry is one which all replicas of the range have 
        
           // progressed past and thus is no longer needed and can be truncated. 
        
           RaftLogQueueStaleThreshold = 100

I've noticed that truncations never really manage to empty out the log. This is because when, say, index 100 is completely replicated, we're not going to include 100 in the truncation but will only delete up to and including 99. And in performing the truncation, we're adding an index to the Raft log ourselves, so that the resulting log will have three elements. That's why any value lower than 3 above would lead to an infinite loop.

@bdarnell is this a technical limitation of Raft ("log can't be completely empty") or could we in theory apply a truncation that results in a zero-length log?

petermattis · 2018-10-26T13:58:36Z

It might be relevant for "regular" ranges though. I've wondered whether we should add a truncation criterion to the Raft log queue: if a range is quiescent for a while (or insert your favorite criterion for this range being worth trimming the fat for), we use a low value (3) here for

Perhaps there should be a different truncation criterion for ranges that are discovered by the scanner vs ranges that are preemptively added to the Raft log queue due to activity. The scanner could truncate quiescent ranges even if there isn't much to truncate, while preemptively added ranges would continue to use the existing logic where they wait for a significant chunk of log to accumulate before truncation.

bdarnell · 2018-10-26T18:09:06Z

is this a technical limitation of Raft ("log can't be completely empty") or could we in theory apply a truncation that results in a zero-length log?

It's not a fundamental limitation (we start with a zero-length log), but as long as we put the index to truncate in the command itself, it's a little tricky to know what raft log position we're going to end up committing at. There's also a slightly higher risk of needing a raft snapshot if the truncation deletes itself since there's only one chance to broadcast the log entry before it gets deleted.

It sounds like there are some off-by-one errors in there. We should easily be able to truncate the log down to a single entry instead of leaving 3 behind.

bdarnell · 2018-10-26T18:11:28Z

This problem will go away for empty ranges once we have the merge queue.

Mostly. After DELETE * FROM tbl, the table will merge down to a single range, and that range can't be merged with its neighbors, so we could still have one range's worth of raft logs lying around for this empty table.

tbg · 2019-03-28T10:19:42Z

We can now truncate raft logs locally, which means that any range, upon quiescing, can just clear out the remainder of the raft log. (cc #36262)

petermattis added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Jun 2, 2018

petermattis added this to the 2.1 milestone Jun 2, 2018

tbg modified the milestones: 2.1, 2.2 Jul 19, 2018

tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jul 22, 2018

tbg mentioned this issue Jul 24, 2018

roachtest: drop/tpcc/w=100,nodes=9 failed on master #26147

Closed

petermattis removed this from the 2.2 milestone Oct 5, 2018

tbg self-assigned this Oct 24, 2018

tbg mentioned this issue Dec 12, 2018

storage: inability to up-replicate if Raft log is too large #33071

Closed

tbg mentioned this issue Nov 11, 2019

storage: local raft log truncation #36262

Open

lunevalex added the X-stale label Apr 23, 2021

lunevalex closed this as completed Apr 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: significant space remains in RaftLog after dropping table #26339

storage: significant space remains in RaftLog after dropping table #26339

petermattis commented Jun 2, 2018

petermattis commented Jun 2, 2018

bdarnell commented Jun 4, 2018

tbg commented Oct 26, 2018

petermattis commented Oct 26, 2018

bdarnell commented Oct 26, 2018

bdarnell commented Oct 26, 2018

tbg commented Mar 28, 2019 •

edited

Loading

storage: significant space remains in RaftLog after dropping table #26339

storage: significant space remains in RaftLog after dropping table #26339

Comments

petermattis commented Jun 2, 2018

petermattis commented Jun 2, 2018

bdarnell commented Jun 4, 2018

tbg commented Oct 26, 2018

petermattis commented Oct 26, 2018

bdarnell commented Oct 26, 2018

bdarnell commented Oct 26, 2018

tbg commented Mar 28, 2019 • edited Loading

tbg commented Mar 28, 2019 •

edited

Loading