Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: significant space remains in RaftLog after dropping table #26339

Closed
petermattis opened this issue Jun 2, 2018 · 7 comments
Closed
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) X-stale

Comments

@petermattis
Copy link
Collaborator

In performing some basic experiments to verify space reclamation I discovered something interesting: significant space can remain in the RaftLog for ranges of a dropped table.

The setup: against a local 1 node cluster (created via roachprod create local -n 1; roachprod start local) I ran:

~ workload run kv --init --read-percent=0 --max-block-bytes=1025 --min-block-bytes=1024 --max-ops=1000000 --splits=10000

[For fast testing purposes, at this point I stopped the cluster and packaged up the ~/local directory into a 1.3GB tar].

I then ran:

~ roachprod sql local -- -e "ALTER TABLE kv.kv EXPERIMENTAL CONFIGURE ZONE 'gc: {ttlseconds: 10}'"
~ roachprod sql local -- -e "DROP TABLE kv.kv"

After the compactions completed, only 1GiB out of the initial 1.3GiB had been deleted. A manual cockroach debug compact of the storage directory did not recover any additional space. The output of cockroach debug keys --sizes shows 8.3MiB of live keys and 301MiB of live values. 97% of this data is from RaftLog keys/values.

I'm not sure why these RaftLog entries are sticky around given that this is a 1 node cluster. Perhaps the sheer number of ranges prevented the Raft log queue from keeping up with truncations.

Cc @m-schneider as this might be related to the failure of the drop table roachtest.

@petermattis petermattis added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Jun 2, 2018
@petermattis petermattis added this to the 2.1 milestone Jun 2, 2018
@petermattis
Copy link
Collaborator Author

I'm not sure why these RaftLog entries are sticky around given that this is a 1 node cluster. Perhaps the sheer number of ranges prevented the Raft log queue from keeping up with truncations.

Ah, this has an easy explanation: the raft log queue only truncates if there is 64KiB of data to delete. The debug keys --sizes output shows that none of the 10k ranges have more than 64KiB. If we estimate that each range's Raft log contains 32KiB of data, we would expect the Raft logs across all of the ranges in the table to contain 312MiB which is very close to what I'm seeing.

@benesch Merging of empty ranges would mitigate this issue. Or we could make the Raft log queue heuristic more aggressive after a ClearRange operation.

PS An empty range with an empty Raft log consumes ~260 bytes on disk due to the various range local keys (e.g. /RangeLastGC, /RangeAppliedState, /RaftTruncatedState, etc).

@bdarnell
Copy link
Contributor

bdarnell commented Jun 4, 2018

Merging of empty ranges would mitigate this issue. Or we could make the Raft log queue heuristic more aggressive after a ClearRange operation.

Or we could have deleteAllRowsFast (which is responsible for actually sending the ClearRange operations) also send TruncateLog commands.

@tbg tbg modified the milestones: 2.1, 2.2 Jul 19, 2018
@tbg tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jul 22, 2018
@petermattis petermattis removed this from the 2.2 milestone Oct 5, 2018
@tbg tbg self-assigned this Oct 24, 2018
@tbg
Copy link
Member

tbg commented Oct 26, 2018

This problem will go away for empty ranges once we have the merge queue.

It might be relevant for "regular" ranges though. I've wondered whether we should add a truncation criterion to the Raft log queue: if a range is quiescent for a while (or insert your favorite criterion for this range being worth trimming the fat for), we use a low value (3) here for

// RaftLogQueueStaleThreshold is the minimum threshold for stale raft log
// entries. A stale entry is one which all replicas of the range have
// progressed past and thus is no longer needed and can be truncated.
RaftLogQueueStaleThreshold = 100

I've noticed that truncations never really manage to empty out the log. This is because when, say, index 100 is completely replicated, we're not going to include 100 in the truncation but will only delete up to and including 99. And in performing the truncation, we're adding an index to the Raft log ourselves, so that the resulting log will have three elements. That's why any value lower than 3 above would lead to an infinite loop.

@bdarnell is this a technical limitation of Raft ("log can't be completely empty") or could we in theory apply a truncation that results in a zero-length log?

@petermattis
Copy link
Collaborator Author

It might be relevant for "regular" ranges though. I've wondered whether we should add a truncation criterion to the Raft log queue: if a range is quiescent for a while (or insert your favorite criterion for this range being worth trimming the fat for), we use a low value (3) here for

Perhaps there should be a different truncation criterion for ranges that are discovered by the scanner vs ranges that are preemptively added to the Raft log queue due to activity. The scanner could truncate quiescent ranges even if there isn't much to truncate, while preemptively added ranges would continue to use the existing logic where they wait for a significant chunk of log to accumulate before truncation.

@bdarnell
Copy link
Contributor

is this a technical limitation of Raft ("log can't be completely empty") or could we in theory apply a truncation that results in a zero-length log?

It's not a fundamental limitation (we start with a zero-length log), but as long as we put the index to truncate in the command itself, it's a little tricky to know what raft log position we're going to end up committing at. There's also a slightly higher risk of needing a raft snapshot if the truncation deletes itself since there's only one chance to broadcast the log entry before it gets deleted.

It sounds like there are some off-by-one errors in there. We should easily be able to truncate the log down to a single entry instead of leaving 3 behind.

@bdarnell
Copy link
Contributor

This problem will go away for empty ranges once we have the merge queue.

Mostly. After DELETE * FROM tbl, the table will merge down to a single range, and that range can't be merged with its neighbors, so we could still have one range's worth of raft logs lying around for this empty table.

@tbg
Copy link
Member

tbg commented Mar 28, 2019

We can now truncate raft logs locally, which means that any range, upon quiescing, can just clear out the remainder of the raft log. (cc #36262)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) X-stale
Projects
None yet
Development

No branches or pull requests

4 participants