Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up performance of range tombstones #3992

Closed
wants to merge 4 commits into from

Conversation

benesch
Copy link
Contributor

@benesch benesch commented Jun 13, 2018

I've discovered what seems to be a bug in the collapsing of overlapping range tombstones. Specifically, when adding a new tombstone to a collapsing RangeDelAggregator that overlapped with an existing tombstone, every tombstone to the right of the existing tombstone would be scanned over, instead of exiting the loop after the new tombstone's endpoints were inserted.

I have more testing/benchmarking to do, but the initial results are quite promising!

Before
    0 tombstones  14.330 micros/op 69783 ops/sec; (51 of 100 found)
 ~500 tombstones  646.740 micros/op 1546 ops/sec; (60 of 100 found)
~5000 tombstones  100793.870 micros/op 9 ops/sec; (58 of 100 found)

After
    0 tombstones  13.680 micros/op 73099 ops/sec; (51 of 100 found)
 ~500 tombstones  568.480 micros/op 1759 ops/sec; (60 of 100 found)
~5000 tombstones  5700.330 micros/op 175 ops/sec; (58 of 100 found)

(That's seekrandom -num=100 on a database created with fillrandom.)

benesch added 2 commits June 13, 2018 03:25
Range collapsing is complicated enough that it deserves testing of its
internal representation.
Tombstone start points are inclusive, so collapsing a tombstone that
starts at the same key as an existing tombstone can be handled exactly
the same way as collapsing a tombstone that starts before an existing
tombstone.
@@ -354,6 +340,16 @@ Status RangeDelAggregator::AddTombstone(RangeTombstone tombstone) {
}
}

// If we're looking at either the last new tombstone or the last existing
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous code did new_to_old_end_cmp = 1 which would only advance the tombstone map iter, right?

I'd probably just have to sit down and really read this code, but I don't understand why the condition here is symmetric and always results in the new tombstone ending before the existing. If new_range_dels_iter_end is not null, but tombstone_map_iter_end is nullptr, isn't the situation that the new tombstone "sticks out" of the existing one?

Copy link
Contributor Author

@benesch benesch Jun 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous code did new_to_old_end_cmp = 1 which would only advance the tombstone map iter, right?

Correct. That resulted in looking in every existing tombstone after new tombstone.

I'd probably just have to sit down and really read this code, but I don't understand why the condition here is symmetric and always results in the new tombstone ending before the existing. If new_range_dels_iter_end is not null, but tombstone_map_iter_end is nullptr, isn't the situation that the new tombstone "sticks out" of the existing one?

Yeah, I went back and forth on the clearest way to express this. The way to think about it—I think—is that the new_range_dels_iter_end == nullptr means the new tombstone is just one point wide. Suppose we're adding tombstone [a, f) when tombstone [e, j) already exists:

@2  a----f
@1     e----j

The existing tombstones are [e, j) @ 1 and [j, nullptr) @ 0. The new tombstones are [a, f) @ 2 and [f, nullptr) @ 0. As before, [j, nullptr) @ 0 is infinitely wide. I'm claiming that, instead of also pretending that [f, nullptr) @ 0 is infinitely wide, we can pretend it's exactly one point wide, i.e., [f, f] @ 0, because it serves only to mark the endpoint of the real tombstone [a, f) @ 2.

Anyway, as soon as we see a single-point (new) tombstone, we know its end key is less than whatever existing tombstone we're comparing it against. Why? Well we're guaranteed that a) any two tombstones that the loop compares are overlapping, and b) existing tombstones are not single points. There's no way to satisfy those guarantees unless the existing tombstone's upper bound is greater than the single-point (new) tombstone.

Here are the four cases spelled out:

new_range_dels_iter_end tombstone_map_iter_end explanation
not null not null This case is easy: compare the two ends directly.
not null null Also easy: a null existing tombstone end means the tombstone extends infinitely to the right, so it's obviously ends after any finite range deletion tombstone.
null not null New tombstone end is less by the logic above.
null null New tombstone end is less by the logic above.

This is all very confusing because this method is written as if it's going to merge multiple points at once. @ajkr back in #1614 (comment) you wrote:

one of the motivations for using a vector of new points is that, in a future optimization, we're considering merging multiple new points simultaneously (e.g., all points from a snapshot stripe in an SST file that we just read), which should be faster than merging new points one-at-a-time.

Is that still your eventual plan?

Copy link
Contributor

@ajkr ajkr Jul 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find. We could make use of the ordering of range deletions in the meta-block (within each snapshot stripe). But doing this is not on any foreseeable roadmap, and I hope it'll be made unnecessary by Cockroach's optimizations :). So feel free to get rid of this overly general code if you wish.

benesch added 2 commits June 13, 2018 12:30
These conditions are guaranteed by the loop condition. No need to
recheck them.
@benesch benesch force-pushed the fast-range-tombstones branch from bde941c to 825f0eb Compare June 13, 2018 16:34
benesch added a commit to benesch/cockroach that referenced this pull request Jun 21, 2018
The current implementation of range deletion tombstones in RocksDB
suffers from a performance bug that causes excessive CPU usage on every
read operation in a database with many range tombstones. Dropping a
large table can easily result in several thousand range deletion
tombstones in one store, resulting in an unusable cluster as documented
in cockroachdb#24029.

Backport a refactoring of range deletion tombstone that fixes the
performance problem. This refactoring has also been proposed upstream as
facebook/rocksdb#4014.

A more minimal change was also proposed in facebook/rocksdb#3992--and
that patch better highlights the exact nature of the bug than the patch
backported here, for those looking to understand the problem. But this
refactoring, though more invasive, gets us one step closer to solving a
related problem where range deletions can cause excessively large
compactions (cockroachdb#26693). These large compactions do not appear to brick the
cluster but undoubtedly have some impact on performance.

Fix cockroachdb#24029.

Release note: None
@ajkr ajkr self-requested a review July 9, 2018 19:54
@ajkr
Copy link
Contributor

ajkr commented Jul 9, 2018

I will close this and review #4014 instead.

@ajkr ajkr closed this Jul 9, 2018
benesch added a commit to benesch/rocksdb that referenced this pull request Jul 12, 2018
The range deletion aggregator supports "uncollapsed" and "collapsed"
tombstone maps. Uncollapsed tombstone maps are naturally represented as
a std::multiset, while collapsed tombstone maps are naturally
represented as a std::map.

The previous implementation used a std::multimap that was general enough
to store either the uncollapsed or the collapsed representation.
Confusingly, its keys and values had different meanings depending on
which representation was being used.

Extract a TombstoneMap interface which has two concrete implementations,
uncollapsed and collapsed. This allows the implementations to use their
natural representation, and keeps the code for manipulating that
representation within one class.

In the process, simplify and comment the CollapsedTombstoneMap::
AddTombstone method. This refactor brings with it a large performance
improvement. The details of the performance bug in the old
implementation, and a more targeted fix, can be found in an earlier,
abandoned PR, facebook#3992.

The refactor also exposed a bug in the ObsoleteTombstoneCleanup test, which was
installing tombstones like [dr1, dr1) which cover no keys due to the
exclusivity of the upper bound.
facebook-github-bot pushed a commit that referenced this pull request Jul 12, 2018
Summary:
This fixes the same performance issue that #3992 fixes but with much more invasive cleanup.

I'm more excited about this PR because it paves the way for fixing another problem we uncovered at Cockroach where range deletion tombstones can cause massive compactions. For example, suppose L4 contains deletions from [a, c) and [x, z) and no other keys, and L5 is entirely empty. L6, however, is full of data. When compacting L4 -> L5, we'll end up with one file that spans, massively, from [a, z). When we go to compact L5 -> L6, we'll have to rewrite all of L6! If, instead of range deletions in L4, we had keys a, b, x, y, and z, RocksDB would have been smart enough to create two files in L5: one for a and b and another for x, y, and z.

With the changes in this PR, it will be possible to adjust the compaction logic to split tombstones/start new output files when they would span too many files in the grandparent level.

ajkr please take a look when you have a minute!
Pull Request resolved: #4014

Differential Revision: D8773253

Pulled By: ajkr

fbshipit-source-id: ec62fa85f648fdebe1380b83ed997f9baec35677
petermattis pushed a commit to petermattis/rocksdb that referenced this pull request Jul 13, 2018
Summary:
This fixes the same performance issue that facebook#3992 fixes but with much more invasive cleanup.

I'm more excited about this PR because it paves the way for fixing another problem we uncovered at Cockroach where range deletion tombstones can cause massive compactions. For example, suppose L4 contains deletions from [a, c) and [x, z) and no other keys, and L5 is entirely empty. L6, however, is full of data. When compacting L4 -> L5, we'll end up with one file that spans, massively, from [a, z). When we go to compact L5 -> L6, we'll have to rewrite all of L6! If, instead of range deletions in L4, we had keys a, b, x, y, and z, RocksDB would have been smart enough to create two files in L5: one for a and b and another for x, y, and z.

With the changes in this PR, it will be possible to adjust the compaction logic to split tombstones/start new output files when they would span too many files in the grandparent level.

ajkr please take a look when you have a minute!
Pull Request resolved: facebook#4014

Differential Revision: D8773253

Pulled By: ajkr

fbshipit-source-id: ec62fa85f648fdebe1380b83ed997f9baec35677
rcane pushed a commit to rcane/rocksdb that referenced this pull request Sep 13, 2018
Summary:
This fixes the same performance issue that facebook#3992 fixes but with much more invasive cleanup.

I'm more excited about this PR because it paves the way for fixing another problem we uncovered at Cockroach where range deletion tombstones can cause massive compactions. For example, suppose L4 contains deletions from [a, c) and [x, z) and no other keys, and L5 is entirely empty. L6, however, is full of data. When compacting L4 -> L5, we'll end up with one file that spans, massively, from [a, z). When we go to compact L5 -> L6, we'll have to rewrite all of L6! If, instead of range deletions in L4, we had keys a, b, x, y, and z, RocksDB would have been smart enough to create two files in L5: one for a and b and another for x, y, and z.

With the changes in this PR, it will be possible to adjust the compaction logic to split tombstones/start new output files when they would span too many files in the grandparent level.

ajkr please take a look when you have a minute!
Pull Request resolved: facebook#4014

Differential Revision: D8773253

Pulled By: ajkr

fbshipit-source-id: ec62fa85f648fdebe1380b83ed997f9baec35677
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants