Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compaction: Don't split user keys during compactions #734

Closed
itsbilal opened this issue Jun 10, 2020 · 3 comments · Fixed by #1470
Closed

compaction: Don't split user keys during compactions #734

itsbilal opened this issue Jun 10, 2020 · 3 comments · Fixed by #1470

Comments

@itsbilal
Copy link
Member

Currently, there are some cases in the compaction loop where user keys could be output to two different sstables, like this:

000157.sst:
a.RANGEDEL.6:f
b.SET.15:foo
b.MERGE.8:bar

000158.sst:
b.RANGEDEL.6:f
b.SET.5:baz
b.DEL.3

This results in the implicit creation of an "atomic compaction group"; both these SSTables must
be present in compactions together, or it's possible for deleted keys to reappear after a sequence of compactions (see the comment above expandInputs on how and why this happens).

These implications of splitting user keys are ultimately unhelpful in increasing compaction parallelization and reducing compaction sizes. The only reason why we split user keys is to maintain similarity in behaviour with RocksDB. We should explore not splitting user keys across different sstables for all compactions (something we already do for flushes as of #675). This should help simplify some compaction logic.

@sumeerbhola
Copy link
Collaborator

The only reason why we split user keys is to maintain similarity in behaviour with RocksDB.

I thought the RocksDB motivation was to prevent a single sstable from getting too large if there were too many seqnums for a key, which can result in larger number of index blocks (2-level indexes) and bigger range tombstone block which can affect read performance.

This is unlikely to be a problem in L0, where we have adopted the stricter behavior for flushes, but that doesn't necessarily hold for L6 which has most of the data. But because CockroachDB is storing MVCC keys in these user keys, it seems only the MVCCMetadata, which uses timestamp 0, could have a large number of sequence numbers. And IIRC, we are sparing in the use of snapshots (though I don't know the list of places we use snapshots), so should be quick to gc older seqnums.

@petermattis
Copy link
Collaborator

I thought the RocksDB motivation was to prevent a single sstable from getting too large if there were too many seqnums for a key, which can result in larger number of index blocks (2-level indexes) and bigger range tombstone block which can affect read performance.

The sstable splitting logic in RocksDB was inherited from LevelDB which didn't need to worry about atomic compaction units because it doesn't have range tombstones. I looked at the history of how this was added, and my read was that splitting at the next user key was simply not considered as an alternative to atomic compaction units. I'm not sure why.

How much bigger can the range tombstone block get from splitting at the next user key? I suspect very little in practice, but perhaps I'm not imagining some problematic corner case.

How much bigger can the index block get from splitting at the next user key? I think that is dependent on how many snapshots are active. AFAIK, we only use snapshots for replica rebalancing, so we're going to only have a few active at any time. Even if there were 1000 snapshots active, that translates to the potential to add 1000 extra records to the sstable which seems reasonable for most scenarios in which L6 sstables typically have hundreds of thousands of records.

@jbowens
Copy link
Collaborator

jbowens commented Dec 10, 2021

@nicktrav — just realized there's actually an issue for this

jbowens added a commit to jbowens/pebble that referenced this issue Jan 25, 2022
During a compaction, if the current sstable hits the file size limit, defer
finishing the sstable if the next sstable would share a user key. This is the
current behavior of flushes, and this change brings parity between the two.

This change is motivated by introduction of range keys (see cockroachdb#1339). This
ensures we can always cleanly truncate range keys that span range-key boundaries.

This commit also removes (keyspan.Fragmenter).FlushTo. Now that we prohibit
splitting sstables in the middle of a user key, the Fragmenter's FlushTo
function is unnecessary. Compactions and flushes always use the
TruncateAndFlushTo variant.

This change required a tweak to the way grandparent limits are applied, in
order to switch the grandparent splitter's comparison into a >= comparsion.
This was necessary due to the shift in interpreting `splitterSuggestion`s as
exclusive boundaries.

Close cockroachdb#734.
jbowens added a commit to jbowens/pebble that referenced this issue Jan 31, 2022
During a compaction, if the current sstable hits the file size limit, defer
finishing the sstable if the next sstable would share a user key. This is the
current behavior of flushes, and this change brings parity between the two.

This change is motivated by introduction of range keys (see cockroachdb#1339). This
ensures we can always cleanly truncate range keys that span range-key boundaries.

This commit also removes (keyspan.Fragmenter).FlushTo. Now that we prohibit
splitting sstables in the middle of a user key, the Fragmenter's FlushTo
function is unnecessary. Compactions and flushes always use the
TruncateAndFlushTo variant.

This change required a tweak to the way grandparent limits are applied, in
order to switch the grandparent splitter's comparison into a >= comparsion.
This was necessary due to the shift in interpreting `splitterSuggestion`s as
exclusive boundaries.

Close cockroachdb#734.
jbowens added a commit that referenced this issue Feb 3, 2022
During a compaction, if the current sstable hits the file size limit, defer
finishing the sstable if the next sstable would share a user key. This is the
current behavior of flushes, and this change brings parity between the two.

This change is motivated by introduction of range keys (see #1339). This
ensures we can always cleanly truncate range keys that span range-key boundaries.

This commit also removes (keyspan.Fragmenter).FlushTo. Now that we prohibit
splitting sstables in the middle of a user key, the Fragmenter's FlushTo
function is unnecessary. Compactions and flushes always use the
TruncateAndFlushTo variant.

This change required a tweak to the way grandparent limits are applied, in
order to switch the grandparent splitter's comparison into a >= comparsion.
This was necessary due to the shift in interpreting `splitterSuggestion`s as
exclusive boundaries.

Close #734.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants