Speed up performance of range tombstones #3992

benesch · 2018-06-13T08:33:33Z

I've discovered what seems to be a bug in the collapsing of overlapping range tombstones. Specifically, when adding a new tombstone to a collapsing RangeDelAggregator that overlapped with an existing tombstone, every tombstone to the right of the existing tombstone would be scanned over, instead of exiting the loop after the new tombstone's endpoints were inserted.

I have more testing/benchmarking to do, but the initial results are quite promising!

Before
    0 tombstones  14.330 micros/op 69783 ops/sec; (51 of 100 found)
 ~500 tombstones  646.740 micros/op 1546 ops/sec; (60 of 100 found)
~5000 tombstones  100793.870 micros/op 9 ops/sec; (58 of 100 found)

After
    0 tombstones  13.680 micros/op 73099 ops/sec; (51 of 100 found)
 ~500 tombstones  568.480 micros/op 1759 ops/sec; (60 of 100 found)
~5000 tombstones  5700.330 micros/op 175 ops/sec; (58 of 100 found)

(That's seekrandom -num=100 on a database created with fillrandom.)

Range collapsing is complicated enough that it deserves testing of its internal representation.

Tombstone start points are inclusive, so collapsing a tombstone that starts at the same key as an existing tombstone can be handled exactly the same way as collapsing a tombstone that starts before an existing tombstone.

tbg · 2018-06-13T11:34:45Z

db/range_del_aggregator.cc

@@ -354,6 +340,16 @@ Status RangeDelAggregator::AddTombstone(RangeTombstone tombstone) {
        }
      }

+      // If we're looking at either the last new tombstone or the last existing


The previous code did new_to_old_end_cmp = 1 which would only advance the tombstone map iter, right?

I'd probably just have to sit down and really read this code, but I don't understand why the condition here is symmetric and always results in the new tombstone ending before the existing. If new_range_dels_iter_end is not null, but tombstone_map_iter_end is nullptr, isn't the situation that the new tombstone "sticks out" of the existing one?

The previous code did new_to_old_end_cmp = 1 which would only advance the tombstone map iter, right?

Correct. That resulted in looking in every existing tombstone after new tombstone.

I'd probably just have to sit down and really read this code, but I don't understand why the condition here is symmetric and always results in the new tombstone ending before the existing. If new_range_dels_iter_end is not null, but tombstone_map_iter_end is nullptr, isn't the situation that the new tombstone "sticks out" of the existing one?

Yeah, I went back and forth on the clearest way to express this. The way to think about it—I think—is that the new_range_dels_iter_end == nullptr means the new tombstone is just one point wide. Suppose we're adding tombstone [a, f) when tombstone [e, j) already exists:

@2 a----f @1 e----j

The existing tombstones are [e, j) @ 1 and [j, nullptr) @ 0. The new tombstones are [a, f) @ 2 and [f, nullptr) @ 0. As before, [j, nullptr) @ 0 is infinitely wide. I'm claiming that, instead of also pretending that [f, nullptr) @ 0 is infinitely wide, we can pretend it's exactly one point wide, i.e., [f, f] @ 0, because it serves only to mark the endpoint of the real tombstone [a, f) @ 2.

Anyway, as soon as we see a single-point (new) tombstone, we know its end key is less than whatever existing tombstone we're comparing it against. Why? Well we're guaranteed that a) any two tombstones that the loop compares are overlapping, and b) existing tombstones are not single points. There's no way to satisfy those guarantees unless the existing tombstone's upper bound is greater than the single-point (new) tombstone.

Here are the four cases spelled out:

new_range_dels_iter_end tombstone_map_iter_end explanation

not null not null This case is easy: compare the two ends directly.

not null null Also easy: a null existing tombstone end means the tombstone extends infinitely to the right, so it's obviously ends after any finite range deletion tombstone.

null not null New tombstone end is less by the logic above.

null null New tombstone end is less by the logic above.

This is all very confusing because this method is written as if it's going to merge multiple points at once. @ajkr back in #1614 (comment) you wrote:

one of the motivations for using a vector of new points is that, in a future optimization, we're considering merging multiple new points simultaneously (e.g., all points from a snapshot stripe in an SST file that we just read), which should be faster than merging new points one-at-a-time.

Is that still your eventual plan?

Good find. We could make use of the ordering of range deletions in the meta-block (within each snapshot stripe). But doing this is not on any foreseeable roadmap, and I hope it'll be made unnecessary by Cockroach's optimizations :). So feel free to get rid of this overly general code if you wish.

These conditions are guaranteed by the loop condition. No need to recheck them.

The current implementation of range deletion tombstones in RocksDB suffers from a performance bug that causes excessive CPU usage on every read operation in a database with many range tombstones. Dropping a large table can easily result in several thousand range deletion tombstones in one store, resulting in an unusable cluster as documented in cockroachdb#24029. Backport a refactoring of range deletion tombstone that fixes the performance problem. This refactoring has also been proposed upstream as facebook/rocksdb#4014. A more minimal change was also proposed in facebook/rocksdb#3992--and that patch better highlights the exact nature of the bug than the patch backported here, for those looking to understand the problem. But this refactoring, though more invasive, gets us one step closer to solving a related problem where range deletions can cause excessively large compactions (cockroachdb#26693). These large compactions do not appear to brick the cluster but undoubtedly have some impact on performance. Fix cockroachdb#24029. Release note: None

ajkr · 2018-07-09T21:42:41Z

I will close this and review #4014 instead.

The range deletion aggregator supports "uncollapsed" and "collapsed" tombstone maps. Uncollapsed tombstone maps are naturally represented as a std::multiset, while collapsed tombstone maps are naturally represented as a std::map. The previous implementation used a std::multimap that was general enough to store either the uncollapsed or the collapsed representation. Confusingly, its keys and values had different meanings depending on which representation was being used. Extract a TombstoneMap interface which has two concrete implementations, uncollapsed and collapsed. This allows the implementations to use their natural representation, and keeps the code for manipulating that representation within one class. In the process, simplify and comment the CollapsedTombstoneMap:: AddTombstone method. This refactor brings with it a large performance improvement. The details of the performance bug in the old implementation, and a more targeted fix, can be found in an earlier, abandoned PR, facebook#3992. The refactor also exposed a bug in the ObsoleteTombstoneCleanup test, which was installing tombstones like [dr1, dr1) which cover no keys due to the exclusivity of the upper bound.

Summary: This fixes the same performance issue that #3992 fixes but with much more invasive cleanup. I'm more excited about this PR because it paves the way for fixing another problem we uncovered at Cockroach where range deletion tombstones can cause massive compactions. For example, suppose L4 contains deletions from [a, c) and [x, z) and no other keys, and L5 is entirely empty. L6, however, is full of data. When compacting L4 -> L5, we'll end up with one file that spans, massively, from [a, z). When we go to compact L5 -> L6, we'll have to rewrite all of L6! If, instead of range deletions in L4, we had keys a, b, x, y, and z, RocksDB would have been smart enough to create two files in L5: one for a and b and another for x, y, and z. With the changes in this PR, it will be possible to adjust the compaction logic to split tombstones/start new output files when they would span too many files in the grandparent level. ajkr please take a look when you have a minute! Pull Request resolved: #4014 Differential Revision: D8773253 Pulled By: ajkr fbshipit-source-id: ec62fa85f648fdebe1380b83ed997f9baec35677

Summary: This fixes the same performance issue that facebook#3992 fixes but with much more invasive cleanup. I'm more excited about this PR because it paves the way for fixing another problem we uncovered at Cockroach where range deletion tombstones can cause massive compactions. For example, suppose L4 contains deletions from [a, c) and [x, z) and no other keys, and L5 is entirely empty. L6, however, is full of data. When compacting L4 -> L5, we'll end up with one file that spans, massively, from [a, z). When we go to compact L5 -> L6, we'll have to rewrite all of L6! If, instead of range deletions in L4, we had keys a, b, x, y, and z, RocksDB would have been smart enough to create two files in L5: one for a and b and another for x, y, and z. With the changes in this PR, it will be possible to adjust the compaction logic to split tombstones/start new output files when they would span too many files in the grandparent level. ajkr please take a look when you have a minute! Pull Request resolved: facebook#4014 Differential Revision: D8773253 Pulled By: ajkr fbshipit-source-id: ec62fa85f648fdebe1380b83ed997f9baec35677

benesch added 2 commits June 13, 2018 03:25

Test internals of range collapsing

d5eb148

Range collapsing is complicated enough that it deserves testing of its internal representation.

Simplify range deletion collapsing

4535268

Tombstone start points are inclusive, so collapsing a tombstone that starts at the same key as an existing tombstone can be handled exactly the same way as collapsing a tombstone that starts before an existing tombstone.

facebook-github-bot added the CLA Signed label Jun 13, 2018

benesch mentioned this pull request Jun 13, 2018

storage: dropping a large table will brick a cluster due to compactions cockroachdb/cockroach#24029

Closed

tbg reviewed Jun 13, 2018

View reviewed changes

benesch added 2 commits June 13, 2018 12:30

Avoid unnecessary ifs in deletion collapsing

214a176

These conditions are guaranteed by the loop condition. No need to recheck them.

Avoid unnecessary scans when collapsing tombstones

825f0eb

benesch force-pushed the fast-range-tombstones branch from bde941c to 825f0eb Compare June 13, 2018 16:34

benesch mentioned this pull request Jun 18, 2018

Range deletion performance improvements + cleanup #4014

Closed

benesch mentioned this pull request Jun 21, 2018

c-deps: backport RocksDB range deletion performance fix cockroachdb/cockroach#26877

Closed

ajkr self-requested a review July 9, 2018 19:54

ajkr closed this Jul 9, 2018

benesch mentioned this pull request Jul 10, 2018

range deletion performance improvements + cleanup cockroachdb/rocksdb#5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up performance of range tombstones #3992

Speed up performance of range tombstones #3992

benesch commented Jun 13, 2018

tbg Jun 13, 2018

benesch Jun 13, 2018 •

edited

Loading

ajkr Jul 9, 2018 •

edited

Loading

ajkr commented Jul 9, 2018 •

edited

Loading

new_range_dels_iter_end	tombstone_map_iter_end	explanation
not null	not null	This case is easy: compare the two ends directly.
not null	null	Also easy: a null existing tombstone end means the tombstone extends infinitely to the right, so it's obviously ends after any finite range deletion tombstone.
null	not null	New tombstone end is less by the logic above.
null	null	New tombstone end is less by the logic above.

Speed up performance of range tombstones #3992

Speed up performance of range tombstones #3992

Conversation

benesch commented Jun 13, 2018

tbg Jun 13, 2018

Choose a reason for hiding this comment

benesch Jun 13, 2018 • edited Loading

Choose a reason for hiding this comment

ajkr Jul 9, 2018 • edited Loading

Choose a reason for hiding this comment

ajkr commented Jul 9, 2018 • edited Loading

benesch Jun 13, 2018 •

edited

Loading

ajkr Jul 9, 2018 •

edited

Loading

ajkr commented Jul 9, 2018 •

edited

Loading