-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backport the various range tombstone PRs #8
Backport the various range tombstone PRs #8
Conversation
…tor multiple times" Summary: The origin commit facebook#3635 will hurt performance for users who aren't using range deletions, because unneeded std::set operations, so it was reverted by commit 44653c7. (see facebook#3672) To fix this, move the set to and add a check in , i.e., file will be added only if is non-nullptr. The db_bench command which find the performance regression: > ./db_bench --benchmarks=fillrandom,seekrandomwhilewriting --threads=1 --num=1000000 --reads=150000 --key_size=66 > --value_size=1262 --statistics=0 --compression_ratio=0.5 --histogram=1 --seek_nexts=1 --stats_per_interval=1 > --stats_interval_seconds=600 --max_background_flushes=4 --num_multi_db=1 --max_background_compactions=16 --seed=1522388277 > -write_buffer_size=1048576 --level0_file_num_compaction_trigger=10000 --compression_type=none Before and after the modification, I re-run this command on the machine, the results of are as follows: **fillrandom** Table | P50 | P75 | P99 | P99.9 | P99.99 | ---- | --- | --- | --- | ----- | ------ | before commit | 5.92 | 8.57 | 19.63 | 980.97 | 12196.00 | after commit | 5.91 | 8.55 | 19.34 | 965.56 | 13513.56 | **seekrandomwhilewriting** Table | P50 | P75 | P99 | P99.9 | P99.99 | ---- | --- | --- | --- | ----- | ------ | before commit | 1418.62 | 1867.01 | 3823.28 | 4980.99 | 9240.00 | after commit | 1450.54 | 1880.61 | 3962.87 | 5429.60 | 7542.86 | Closes facebook#3800 Differential Revision: D7874245 Pulled By: ajkr fbshipit-source-id: 2e8bec781b3f7399246babd66395c88619534a17
Summary: Add a new table property, rocksdb.num.range-deletions, which tracks the number of range deletions in a block-based table. Range deletions are no longer counted in rocksdb.num.entries; as discovered in PR facebook#3778, there are various code paths that implicitly assume that rocksdb.num.entries counts only true keys, not range deletions. /cc ajkr nvanbenschoten Closes facebook#4016 Differential Revision: D8527575 Pulled By: ajkr fbshipit-source-id: 92e7edbe78fda53756a558013c9fb496e7764fd7
Summary: This fixes the same performance issue that facebook#3992 fixes but with much more invasive cleanup. I'm more excited about this PR because it paves the way for fixing another problem we uncovered at Cockroach where range deletion tombstones can cause massive compactions. For example, suppose L4 contains deletions from [a, c) and [x, z) and no other keys, and L5 is entirely empty. L6, however, is full of data. When compacting L4 -> L5, we'll end up with one file that spans, massively, from [a, z). When we go to compact L5 -> L6, we'll have to rewrite all of L6! If, instead of range deletions in L4, we had keys a, b, x, y, and z, RocksDB would have been smart enough to create two files in L5: one for a and b and another for x, y, and z. With the changes in this PR, it will be possible to adjust the compaction logic to split tombstones/start new output files when they would span too many files in the grandparent level. ajkr please take a look when you have a minute! Pull Request resolved: facebook#4014 Differential Revision: D8773253 Pulled By: ajkr fbshipit-source-id: ec62fa85f648fdebe1380b83ed997f9baec35677
During iteration skip sstables that are entirely covered by a range tombstone. The check to do this is loose: we check that an sstable's [smallest_key,largest_key) at largest_seqno is covered by a range tombstone.
Sorry about the mega-PR. Let me know if there is a way to make this easier to review. Would you prefer a series of smaller PRs? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't go through this with a fine-toothed comb, though. Mostly assumed that code was unchanged, except for anything touching the RangeDelAggregator.
Reviewed 3 of 3 files at r1, 8 of 8 files at r2, 10 of 10 files at r3, 5 of 5 files at r4, 1 of 1 files at r5, 4 of 4 files at r6, 11 of 11 files at r7, 2 of 2 files at r8, 10 of 10 files at r9, 6 of 6 files at r10.
Reviewable status:complete! all files reviewed, all discussions resolved
db/range_del_aggregator.cc, line 80 at r6 (raw file):
// Only supported in CollapsedRangeDelMap. return RangeTombstone(Slice(), Slice(), 0); }
Seems like this should be an assertion failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! all files reviewed, all discussions resolved
db/range_del_aggregator.cc, line 80 at r6 (raw file):
Previously, benesch (Nikhil Benesch) wrote…
Seems like this should be an assertion failure?
Yeah. This was the prior behavior of the code. I'll toss in an abort()
before merging.
RangeDelAggregator::GetTombstone retrieves the range tombstone at the specifed user-key and sequence number. This will be used by a subsequent commit to allow skipping over ranges of keys during iteration which are covered by a tombstone.
BlockBasedTableIterator now maintains a current tombstone which is a range of keys that are either all deleted in the sstable or all non-deleted. It uses this key range to quickly skip over keys during iteration. The check for non-deleted is pessimistic: a key lying within a non-deleted tombstone range might be deleted by a more precise check of the key's sequence number which is taken care of by the normal RangeDelAggregator::ShouldDelete check in DBIter.
When adding range tombstones to a RangeDelAggregator, truncate the tombstones to the sstable boundaries. This is done as an alternative to truncating the range tombstones within the sstables themselves as it properly handles existing data where range tombstones are not truncated in sstables. Truncating range tombstones to sstable boundaries avoids having two process all of the sstables with overlapping tombstones as a unit. See facebook#4050
Do not consider the range tombstone sentinel key as causing 2 adjacent sstables in a level to overlap. When a range tombstone's end key is the largest key in an sstable, the sstable's end key is set to a "sentinel" value that is the smallest key in the next sstable with a sequence number of kMaxSequenceNumber. This "sentinel" is guaranteed to not overlap in internal-key space with the next sstable. Unfortunately, GetOverlappingFiles uses user-keys to determine overlap and was thus considering 2 adjacent sstables in a level to overlap if they were separated by this sentinel key. This in turn would cause compactions to be larger than necessary. This previous behavior of GetOverlappingInputs was necessary due to the following scenario: * Write a delete range [a, d). * After compaction, this deleted range may be added to multiple sst files: a.sst, b.sst, c.sst, but the boundaries of these sst files are [a, b), [b, c), [c, d). * When a.sst and b.sst reach the bottommost level, the delete range of the sst files will be dropped. * Write a new key in the range [a, c). * When the newly written key [a, c) reaches the bottommost level, its sequence number will be set to zero. * When the front c.sst compacts with the key in the range [a, c) the sequence number of that key is zero, and the key will be incorrectly dropped. That scenario no longer occurs because we are truncating range deletion tombstones to sstable boundaries when adding them to RangeDelAggregator.
c7dcb99
to
d6b0247
Compare
Backport the various range tombstone PRs in the same order that we expect them to land upstream. Note that the target for this PR is
crl-release-5.13.new
, notcrl-release-5.13
. After this merges, I'll replacecrl-release-5.13
withcrl-release-5.13.new
. The last 7 commits were from #2, #3 and #4 with minor updates torange_del_aggregator.cc
to adapt to the new code structure. The first commit is facebook#3800.This change is![Reviewable](https://camo.githubusercontent.com/1541c4039185914e83657d3683ec25920c672c6c5c7ab4240ee7bff601adec0b/68747470733a2f2f72657669657761626c652e696f2f7265766965775f627574746f6e2e737667)