-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up performance of range tombstones #3992
Conversation
Range collapsing is complicated enough that it deserves testing of its internal representation.
Tombstone start points are inclusive, so collapsing a tombstone that starts at the same key as an existing tombstone can be handled exactly the same way as collapsing a tombstone that starts before an existing tombstone.
db/range_del_aggregator.cc
Outdated
@@ -354,6 +340,16 @@ Status RangeDelAggregator::AddTombstone(RangeTombstone tombstone) { | |||
} | |||
} | |||
|
|||
// If we're looking at either the last new tombstone or the last existing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous code did new_to_old_end_cmp = 1
which would only advance the tombstone map iter, right?
I'd probably just have to sit down and really read this code, but I don't understand why the condition here is symmetric and always results in the new tombstone ending before the existing. If new_range_dels_iter_end
is not null, but tombstone_map_iter_end
is nullptr
, isn't the situation that the new tombstone "sticks out" of the existing one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous code did new_to_old_end_cmp = 1 which would only advance the tombstone map iter, right?
Correct. That resulted in looking in every existing tombstone after new tombstone.
I'd probably just have to sit down and really read this code, but I don't understand why the condition here is symmetric and always results in the new tombstone ending before the existing. If new_range_dels_iter_end is not null, but tombstone_map_iter_end is nullptr, isn't the situation that the new tombstone "sticks out" of the existing one?
Yeah, I went back and forth on the clearest way to express this. The way to think about it—I think—is that the new_range_dels_iter_end == nullptr
means the new tombstone is just one point wide. Suppose we're adding tombstone [a, f) when tombstone [e, j) already exists:
@2 a----f
@1 e----j
The existing tombstones are [e, j) @ 1 and [j, nullptr) @ 0. The new tombstones are [a, f) @ 2 and [f, nullptr) @ 0. As before, [j, nullptr) @ 0 is infinitely wide. I'm claiming that, instead of also pretending that [f, nullptr) @ 0 is infinitely wide, we can pretend it's exactly one point wide, i.e., [f, f] @ 0, because it serves only to mark the endpoint of the real tombstone [a, f) @ 2.
Anyway, as soon as we see a single-point (new) tombstone, we know its end key is less than whatever existing tombstone we're comparing it against. Why? Well we're guaranteed that a) any two tombstones that the loop compares are overlapping, and b) existing tombstones are not single points. There's no way to satisfy those guarantees unless the existing tombstone's upper bound is greater than the single-point (new) tombstone.
Here are the four cases spelled out:
new_range_dels_iter_end | tombstone_map_iter_end | explanation |
---|---|---|
not null | not null | This case is easy: compare the two ends directly. |
not null | null | Also easy: a null existing tombstone end means the tombstone extends infinitely to the right, so it's obviously ends after any finite range deletion tombstone. |
null | not null | New tombstone end is less by the logic above. |
null | null | New tombstone end is less by the logic above. |
This is all very confusing because this method is written as if it's going to merge multiple points at once. @ajkr back in #1614 (comment) you wrote:
one of the motivations for using a vector of new points is that, in a future optimization, we're considering merging multiple new points simultaneously (e.g., all points from a snapshot stripe in an SST file that we just read), which should be faster than merging new points one-at-a-time.
Is that still your eventual plan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good find. We could make use of the ordering of range deletions in the meta-block (within each snapshot stripe). But doing this is not on any foreseeable roadmap, and I hope it'll be made unnecessary by Cockroach's optimizations :). So feel free to get rid of this overly general code if you wish.
These conditions are guaranteed by the loop condition. No need to recheck them.
bde941c
to
825f0eb
Compare
The current implementation of range deletion tombstones in RocksDB suffers from a performance bug that causes excessive CPU usage on every read operation in a database with many range tombstones. Dropping a large table can easily result in several thousand range deletion tombstones in one store, resulting in an unusable cluster as documented in cockroachdb#24029. Backport a refactoring of range deletion tombstone that fixes the performance problem. This refactoring has also been proposed upstream as facebook/rocksdb#4014. A more minimal change was also proposed in facebook/rocksdb#3992--and that patch better highlights the exact nature of the bug than the patch backported here, for those looking to understand the problem. But this refactoring, though more invasive, gets us one step closer to solving a related problem where range deletions can cause excessively large compactions (cockroachdb#26693). These large compactions do not appear to brick the cluster but undoubtedly have some impact on performance. Fix cockroachdb#24029. Release note: None
I will close this and review #4014 instead. |
The range deletion aggregator supports "uncollapsed" and "collapsed" tombstone maps. Uncollapsed tombstone maps are naturally represented as a std::multiset, while collapsed tombstone maps are naturally represented as a std::map. The previous implementation used a std::multimap that was general enough to store either the uncollapsed or the collapsed representation. Confusingly, its keys and values had different meanings depending on which representation was being used. Extract a TombstoneMap interface which has two concrete implementations, uncollapsed and collapsed. This allows the implementations to use their natural representation, and keeps the code for manipulating that representation within one class. In the process, simplify and comment the CollapsedTombstoneMap:: AddTombstone method. This refactor brings with it a large performance improvement. The details of the performance bug in the old implementation, and a more targeted fix, can be found in an earlier, abandoned PR, facebook#3992. The refactor also exposed a bug in the ObsoleteTombstoneCleanup test, which was installing tombstones like [dr1, dr1) which cover no keys due to the exclusivity of the upper bound.
Summary: This fixes the same performance issue that #3992 fixes but with much more invasive cleanup. I'm more excited about this PR because it paves the way for fixing another problem we uncovered at Cockroach where range deletion tombstones can cause massive compactions. For example, suppose L4 contains deletions from [a, c) and [x, z) and no other keys, and L5 is entirely empty. L6, however, is full of data. When compacting L4 -> L5, we'll end up with one file that spans, massively, from [a, z). When we go to compact L5 -> L6, we'll have to rewrite all of L6! If, instead of range deletions in L4, we had keys a, b, x, y, and z, RocksDB would have been smart enough to create two files in L5: one for a and b and another for x, y, and z. With the changes in this PR, it will be possible to adjust the compaction logic to split tombstones/start new output files when they would span too many files in the grandparent level. ajkr please take a look when you have a minute! Pull Request resolved: #4014 Differential Revision: D8773253 Pulled By: ajkr fbshipit-source-id: ec62fa85f648fdebe1380b83ed997f9baec35677
Summary: This fixes the same performance issue that facebook#3992 fixes but with much more invasive cleanup. I'm more excited about this PR because it paves the way for fixing another problem we uncovered at Cockroach where range deletion tombstones can cause massive compactions. For example, suppose L4 contains deletions from [a, c) and [x, z) and no other keys, and L5 is entirely empty. L6, however, is full of data. When compacting L4 -> L5, we'll end up with one file that spans, massively, from [a, z). When we go to compact L5 -> L6, we'll have to rewrite all of L6! If, instead of range deletions in L4, we had keys a, b, x, y, and z, RocksDB would have been smart enough to create two files in L5: one for a and b and another for x, y, and z. With the changes in this PR, it will be possible to adjust the compaction logic to split tombstones/start new output files when they would span too many files in the grandparent level. ajkr please take a look when you have a minute! Pull Request resolved: facebook#4014 Differential Revision: D8773253 Pulled By: ajkr fbshipit-source-id: ec62fa85f648fdebe1380b83ed997f9baec35677
Summary: This fixes the same performance issue that facebook#3992 fixes but with much more invasive cleanup. I'm more excited about this PR because it paves the way for fixing another problem we uncovered at Cockroach where range deletion tombstones can cause massive compactions. For example, suppose L4 contains deletions from [a, c) and [x, z) and no other keys, and L5 is entirely empty. L6, however, is full of data. When compacting L4 -> L5, we'll end up with one file that spans, massively, from [a, z). When we go to compact L5 -> L6, we'll have to rewrite all of L6! If, instead of range deletions in L4, we had keys a, b, x, y, and z, RocksDB would have been smart enough to create two files in L5: one for a and b and another for x, y, and z. With the changes in this PR, it will be possible to adjust the compaction logic to split tombstones/start new output files when they would span too many files in the grandparent level. ajkr please take a look when you have a minute! Pull Request resolved: facebook#4014 Differential Revision: D8773253 Pulled By: ajkr fbshipit-source-id: ec62fa85f648fdebe1380b83ed997f9baec35677
I've discovered what seems to be a bug in the collapsing of overlapping range tombstones. Specifically, when adding a new tombstone to a collapsing RangeDelAggregator that overlapped with an existing tombstone, every tombstone to the right of the existing tombstone would be scanned over, instead of exiting the loop after the new tombstone's endpoints were inserted.
I have more testing/benchmarking to do, but the initial results are quite promising!
(That's seekrandom -num=100 on a database created with fillrandom.)