Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport the various range tombstone PRs #8

Merged

Conversation

petermattis
Copy link

@petermattis petermattis commented Jul 13, 2018

Backport the various range tombstone PRs in the same order that we expect them to land upstream. Note that the target for this PR is crl-release-5.13.new, not crl-release-5.13. After this merges, I'll replace crl-release-5.13 with crl-release-5.13.new. The last 7 commits were from #2, #3 and #4 with minor updates to range_del_aggregator.cc to adapt to the new code structure. The first commit is facebook#3800.


This change is Reviewable

lingbin and others added 5 commits July 13, 2018 10:15
…tor multiple times"

Summary:
The origin commit facebook#3635  will hurt performance for users who aren't using range deletions, because unneeded std::set operations, so it was reverted by commit 44653c7. (see facebook#3672)

To fix this, move the set to  and add a check in , i.e., file will be added only if  is non-nullptr.

The db_bench command which find the performance regression:
> ./db_bench --benchmarks=fillrandom,seekrandomwhilewriting --threads=1 --num=1000000 --reads=150000 --key_size=66 > --value_size=1262 --statistics=0 --compression_ratio=0.5 --histogram=1 --seek_nexts=1 --stats_per_interval=1 > --stats_interval_seconds=600 --max_background_flushes=4 --num_multi_db=1 --max_background_compactions=16 --seed=1522388277 > -write_buffer_size=1048576 --level0_file_num_compaction_trigger=10000 --compression_type=none

Before and after the modification, I re-run this command on the machine, the results of are as follows:

  **fillrandom**
 Table | P50 | P75 | P99 | P99.9 | P99.99 |
  ---- | --- | --- | --- | ----- | ------ |
 before commit | 5.92 | 8.57 | 19.63 | 980.97 | 12196.00 |
 after commit  | 5.91 | 8.55 | 19.34 | 965.56 | 13513.56 |

 **seekrandomwhilewriting**
  Table | P50 | P75 | P99 | P99.9 | P99.99 |
   ---- | --- | --- | --- | ----- | ------ |
 before commit | 1418.62 | 1867.01 | 3823.28 | 4980.99 | 9240.00 |
 after commit  | 1450.54 | 1880.61 | 3962.87 | 5429.60 | 7542.86 |
Closes facebook#3800

Differential Revision: D7874245

Pulled By: ajkr

fbshipit-source-id: 2e8bec781b3f7399246babd66395c88619534a17
Summary:
Add a new table property, rocksdb.num.range-deletions, which tracks the
number of range deletions in a block-based table. Range deletions are no
longer counted in rocksdb.num.entries; as discovered in PR facebook#3778, there
are various code paths that implicitly assume that rocksdb.num.entries
counts only true keys, not range deletions.

/cc ajkr nvanbenschoten
Closes facebook#4016

Differential Revision: D8527575

Pulled By: ajkr

fbshipit-source-id: 92e7edbe78fda53756a558013c9fb496e7764fd7
Summary:
This fixes the same performance issue that facebook#3992 fixes but with much more invasive cleanup.

I'm more excited about this PR because it paves the way for fixing another problem we uncovered at Cockroach where range deletion tombstones can cause massive compactions. For example, suppose L4 contains deletions from [a, c) and [x, z) and no other keys, and L5 is entirely empty. L6, however, is full of data. When compacting L4 -> L5, we'll end up with one file that spans, massively, from [a, z). When we go to compact L5 -> L6, we'll have to rewrite all of L6! If, instead of range deletions in L4, we had keys a, b, x, y, and z, RocksDB would have been smart enough to create two files in L5: one for a and b and another for x, y, and z.

With the changes in this PR, it will be possible to adjust the compaction logic to split tombstones/start new output files when they would span too many files in the grandparent level.

ajkr please take a look when you have a minute!
Pull Request resolved: facebook#4014

Differential Revision: D8773253

Pulled By: ajkr

fbshipit-source-id: ec62fa85f648fdebe1380b83ed997f9baec35677
During iteration skip sstables that are entirely covered by a range
tombstone. The check to do this is loose: we check that an sstable's
[smallest_key,largest_key) at largest_seqno is covered by a range
tombstone.
@petermattis petermattis requested a review from benesch July 13, 2018 17:50
@petermattis
Copy link
Author

Sorry about the mega-PR. Let me know if there is a way to make this easier to review. Would you prefer a series of smaller PRs?

Copy link

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

I didn't go through this with a fine-toothed comb, though. Mostly assumed that code was unchanged, except for anything touching the RangeDelAggregator.

Reviewed 3 of 3 files at r1, 8 of 8 files at r2, 10 of 10 files at r3, 5 of 5 files at r4, 1 of 1 files at r5, 4 of 4 files at r6, 11 of 11 files at r7, 2 of 2 files at r8, 10 of 10 files at r9, 6 of 6 files at r10.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved


db/range_del_aggregator.cc, line 80 at r6 (raw file):

    // Only supported in CollapsedRangeDelMap.
    return RangeTombstone(Slice(), Slice(), 0);
  }

Seems like this should be an assertion failure?

Copy link
Author

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! all files reviewed, all discussions resolved


db/range_del_aggregator.cc, line 80 at r6 (raw file):

Previously, benesch (Nikhil Benesch) wrote…

Seems like this should be an assertion failure?

Yeah. This was the prior behavior of the code. I'll toss in an abort() before merging.

RangeDelAggregator::GetTombstone retrieves the range tombstone at the
specifed user-key and sequence number. This will be used by a subsequent
commit to allow skipping over ranges of keys during iteration which are
covered by a tombstone.
BlockBasedTableIterator now maintains a current tombstone which is a
range of keys that are either all deleted in the sstable or all
non-deleted. It uses this key range to quickly skip over keys during
iteration. The check for non-deleted is pessimistic: a key lying within
a non-deleted tombstone range might be deleted by a more precise check
of the key's sequence number which is taken care of by the normal
RangeDelAggregator::ShouldDelete check in DBIter.
When adding range tombstones to a RangeDelAggregator, truncate the
tombstones to the sstable boundaries. This is done as an alternative to
truncating the range tombstones within the sstables themselves as it
properly handles existing data where range tombstones are not truncated
in sstables. Truncating range tombstones to sstable boundaries avoids
having two process all of the sstables with overlapping tombstones as a
unit.

See facebook#4050
Do not consider the range tombstone sentinel key as causing 2 adjacent
sstables in a level to overlap. When a range tombstone's end key is the
largest key in an sstable, the sstable's end key is set to a "sentinel"
value that is the smallest key in the next sstable with a sequence
number of kMaxSequenceNumber. This "sentinel" is guaranteed to not
overlap in internal-key space with the next sstable. Unfortunately,
GetOverlappingFiles uses user-keys to determine overlap and was thus
considering 2 adjacent sstables in a level to overlap if they were
separated by this sentinel key. This in turn would cause compactions to
be larger than necessary.

This previous behavior of GetOverlappingInputs was necessary due to the
following scenario:

  * Write a delete range [a, d).
  * After compaction, this deleted range may be added to multiple sst
    files: a.sst, b.sst, c.sst, but the boundaries of these sst files
    are [a, b), [b, c), [c, d).
  * When a.sst and b.sst reach the bottommost level, the delete range of
    the sst files will be dropped.
  * Write a new key in the range [a, c).
  * When the newly written key [a, c) reaches the bottommost level, its
    sequence number will be set to zero.
  * When the front c.sst compacts with the key in the range [a, c) the
    sequence number of that key is zero, and the key will be incorrectly
    dropped.

That scenario no longer occurs because we are truncating range deletion
tombstones to sstable boundaries when adding them to RangeDelAggregator.
@petermattis petermattis merged commit f91cf6f into cockroachdb:crl-release-5.13.new Jul 13, 2018
@petermattis petermattis deleted the pmattis/backport branch July 13, 2018 19:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants