compaction: Don't split user keys during compactions #734

itsbilal · 2020-06-10T17:42:35Z

Currently, there are some cases in the compaction loop where user keys could be output to two different sstables, like this:

000157.sst:
a.RANGEDEL.6:f
b.SET.15:foo
b.MERGE.8:bar

000158.sst:
b.RANGEDEL.6:f
b.SET.5:baz
b.DEL.3

This results in the implicit creation of an "atomic compaction group"; both these SSTables must
be present in compactions together, or it's possible for deleted keys to reappear after a sequence of compactions (see the comment above expandInputs on how and why this happens).

These implications of splitting user keys are ultimately unhelpful in increasing compaction parallelization and reducing compaction sizes. The only reason why we split user keys is to maintain similarity in behaviour with RocksDB. We should explore not splitting user keys across different sstables for all compactions (something we already do for flushes as of #675). This should help simplify some compaction logic.

The text was updated successfully, but these errors were encountered:

sumeerbhola · 2020-06-10T19:04:55Z

The only reason why we split user keys is to maintain similarity in behaviour with RocksDB.

I thought the RocksDB motivation was to prevent a single sstable from getting too large if there were too many seqnums for a key, which can result in larger number of index blocks (2-level indexes) and bigger range tombstone block which can affect read performance.

This is unlikely to be a problem in L0, where we have adopted the stricter behavior for flushes, but that doesn't necessarily hold for L6 which has most of the data. But because CockroachDB is storing MVCC keys in these user keys, it seems only the MVCCMetadata, which uses timestamp 0, could have a large number of sequence numbers. And IIRC, we are sparing in the use of snapshots (though I don't know the list of places we use snapshots), so should be quick to gc older seqnums.

petermattis · 2020-06-11T12:16:13Z

I thought the RocksDB motivation was to prevent a single sstable from getting too large if there were too many seqnums for a key, which can result in larger number of index blocks (2-level indexes) and bigger range tombstone block which can affect read performance.

The sstable splitting logic in RocksDB was inherited from LevelDB which didn't need to worry about atomic compaction units because it doesn't have range tombstones. I looked at the history of how this was added, and my read was that splitting at the next user key was simply not considered as an alternative to atomic compaction units. I'm not sure why.

How much bigger can the range tombstone block get from splitting at the next user key? I suspect very little in practice, but perhaps I'm not imagining some problematic corner case.

How much bigger can the index block get from splitting at the next user key? I think that is dependent on how many snapshots are active. AFAIK, we only use snapshots for replica rebalancing, so we're going to only have a few active at any time. Even if there were 1000 snapshots active, that translates to the potential to add 1000 extra records to the sstable which seems reasonable for most scenarios in which L6 sstables typically have hundreds of thousands of records.

jbowens · 2021-12-10T20:11:15Z

@nicktrav — just realized there's actually an issue for this

During a compaction, if the current sstable hits the file size limit, defer finishing the sstable if the next sstable would share a user key. This is the current behavior of flushes, and this change brings parity between the two. This change is motivated by introduction of range keys (see cockroachdb#1339). This ensures we can always cleanly truncate range keys that span range-key boundaries. This commit also removes (keyspan.Fragmenter).FlushTo. Now that we prohibit splitting sstables in the middle of a user key, the Fragmenter's FlushTo function is unnecessary. Compactions and flushes always use the TruncateAndFlushTo variant. This change required a tweak to the way grandparent limits are applied, in order to switch the grandparent splitter's comparison into a >= comparsion. This was necessary due to the shift in interpreting `splitterSuggestion`s as exclusive boundaries. Close cockroachdb#734.

During a compaction, if the current sstable hits the file size limit, defer finishing the sstable if the next sstable would share a user key. This is the current behavior of flushes, and this change brings parity between the two. This change is motivated by introduction of range keys (see #1339). This ensures we can always cleanly truncate range keys that span range-key boundaries. This commit also removes (keyspan.Fragmenter).FlushTo. Now that we prohibit splitting sstables in the middle of a user key, the Fragmenter's FlushTo function is unnecessary. Compactions and flushes always use the TruncateAndFlushTo variant. This change required a tweak to the way grandparent limits are applied, in order to switch the grandparent splitter's comparison into a >= comparsion. This was necessary due to the shift in interpreting `splitterSuggestion`s as exclusive boundaries. Close #734.

nicktrav mentioned this issue Dec 13, 2021

db: support range-key => value #1339

Closed

29 tasks

jbowens mentioned this issue Jan 25, 2022

compaction: don't split outputs within a user key #1470

Merged

jbowens closed this as completed in #1470 Feb 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compaction: Don't split user keys during compactions #734

compaction: Don't split user keys during compactions #734

itsbilal commented Jun 10, 2020

sumeerbhola commented Jun 10, 2020

petermattis commented Jun 11, 2020

jbowens commented Dec 10, 2021

compaction: Don't split user keys during compactions #734

compaction: Don't split user keys during compactions #734

Comments

itsbilal commented Jun 10, 2020

sumeerbhola commented Jun 10, 2020

petermattis commented Jun 11, 2020

jbowens commented Dec 10, 2021