-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compaction: Don't split user keys during compactions #734
Comments
I thought the RocksDB motivation was to prevent a single sstable from getting too large if there were too many seqnums for a key, which can result in larger number of index blocks (2-level indexes) and bigger range tombstone block which can affect read performance. This is unlikely to be a problem in L0, where we have adopted the stricter behavior for flushes, but that doesn't necessarily hold for L6 which has most of the data. But because CockroachDB is storing MVCC keys in these user keys, it seems only the MVCCMetadata, which uses timestamp 0, could have a large number of sequence numbers. And IIRC, we are sparing in the use of snapshots (though I don't know the list of places we use snapshots), so should be quick to gc older seqnums. |
The sstable splitting logic in RocksDB was inherited from LevelDB which didn't need to worry about atomic compaction units because it doesn't have range tombstones. I looked at the history of how this was added, and my read was that splitting at the next user key was simply not considered as an alternative to atomic compaction units. I'm not sure why. How much bigger can the range tombstone block get from splitting at the next user key? I suspect very little in practice, but perhaps I'm not imagining some problematic corner case. How much bigger can the index block get from splitting at the next user key? I think that is dependent on how many snapshots are active. AFAIK, we only use snapshots for replica rebalancing, so we're going to only have a few active at any time. Even if there were 1000 snapshots active, that translates to the potential to add 1000 extra records to the sstable which seems reasonable for most scenarios in which L6 sstables typically have hundreds of thousands of records. |
@nicktrav — just realized there's actually an issue for this |
During a compaction, if the current sstable hits the file size limit, defer finishing the sstable if the next sstable would share a user key. This is the current behavior of flushes, and this change brings parity between the two. This change is motivated by introduction of range keys (see cockroachdb#1339). This ensures we can always cleanly truncate range keys that span range-key boundaries. This commit also removes (keyspan.Fragmenter).FlushTo. Now that we prohibit splitting sstables in the middle of a user key, the Fragmenter's FlushTo function is unnecessary. Compactions and flushes always use the TruncateAndFlushTo variant. This change required a tweak to the way grandparent limits are applied, in order to switch the grandparent splitter's comparison into a >= comparsion. This was necessary due to the shift in interpreting `splitterSuggestion`s as exclusive boundaries. Close cockroachdb#734.
During a compaction, if the current sstable hits the file size limit, defer finishing the sstable if the next sstable would share a user key. This is the current behavior of flushes, and this change brings parity between the two. This change is motivated by introduction of range keys (see cockroachdb#1339). This ensures we can always cleanly truncate range keys that span range-key boundaries. This commit also removes (keyspan.Fragmenter).FlushTo. Now that we prohibit splitting sstables in the middle of a user key, the Fragmenter's FlushTo function is unnecessary. Compactions and flushes always use the TruncateAndFlushTo variant. This change required a tweak to the way grandparent limits are applied, in order to switch the grandparent splitter's comparison into a >= comparsion. This was necessary due to the shift in interpreting `splitterSuggestion`s as exclusive boundaries. Close cockroachdb#734.
During a compaction, if the current sstable hits the file size limit, defer finishing the sstable if the next sstable would share a user key. This is the current behavior of flushes, and this change brings parity between the two. This change is motivated by introduction of range keys (see #1339). This ensures we can always cleanly truncate range keys that span range-key boundaries. This commit also removes (keyspan.Fragmenter).FlushTo. Now that we prohibit splitting sstables in the middle of a user key, the Fragmenter's FlushTo function is unnecessary. Compactions and flushes always use the TruncateAndFlushTo variant. This change required a tweak to the way grandparent limits are applied, in order to switch the grandparent splitter's comparison into a >= comparsion. This was necessary due to the shift in interpreting `splitterSuggestion`s as exclusive boundaries. Close #734.
Currently, there are some cases in the compaction loop where user keys could be output to two different sstables, like this:
This results in the implicit creation of an "atomic compaction group"; both these SSTables must
be present in compactions together, or it's possible for deleted keys to reappear after a sequence of compactions (see the comment above
expandInputs
on how and why this happens).These implications of splitting user keys are ultimately unhelpful in increasing compaction parallelization and reducing compaction sizes. The only reason why we split user keys is to maintain similarity in behaviour with RocksDB. We should explore not splitting user keys across different sstables for all compactions (something we already do for flushes as of #675). This should help simplify some compaction logic.
The text was updated successfully, but these errors were encountered: