Collapse range deletions #1614

ajkr · 2016-12-03T00:12:17Z

Added a tombstone-collapsing mode to RangeDelAggregator, which eliminates overlap in the TombstoneMap. In this mode, we can check whether a tombstone covers a user key using upper_bound() (i.e., binary search). However, the tradeoff is the overhead to add tombstones is now higher, so at first I've only enabled it for range scans (compaction/flush/user iterators), where we expect a high number of calls to ShouldDelete() for the same tombstones. Point queries like Get() will still use the linear scan approach.

Also in this diff I changed RangeDelAggregator's TombstoneMap to use multimap with user keys instead of map with internal keys. Callers sometimes provided ParsedInternalKey directly, from which it would've required string copying to derive an internal key Slice with which we could search the map.

Test Plan: unit tests

facebook-github-bot · 2016-12-03T01:17:15Z

@ajkr updated the pull request - view changes

facebook-github-bot · 2016-12-03T01:24:33Z

@ajkr updated the pull request - view changes

facebook-github-bot · 2016-12-03T01:36:12Z

@ajkr updated the pull request - view changes

facebook-github-bot · 2016-12-03T01:36:24Z

@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2016-12-05T04:41:07Z

@ajkr updated the pull request - view changes - changes since last import

facebook-github-bot · 2016-12-05T19:13:48Z

@ajkr updated the pull request - view changes - changes since last import

ajkr · 2016-12-06T18:38:03Z

ping

Summary: made db_stress capable of adding range deletions to its db and verifying their correctness. i'll make db_crashtest.py use this option later once the collapsing optimization (#1614) is committed because currently it slows down the test too much. Closes #1625 Differential Revision: D4293939 Pulled By: ajkr fbshipit-source-id: d3beb3a

yiwu-arbug

Some general questions since I'm want to know more about the work.

when do we have RangeDeleteAggregator? How does it work together with iterators? Does the two levels of TwoLevelIterator each hold a range aggregator?
Does the ranges add to the aggregator being add in any order? Like in increasing order of sequence id? Maybe it can used to simplify the code.

yiwu-arbug · 2016-12-07T22:33:52Z

db/range_del_aggregator.cc

@@ -50,6 +58,14 @@ bool RangeDelAggregator::ShouldDelete(const ParsedInternalKey& parsed) {
    return false;
  }
  const auto& tombstone_map = GetTombstoneMap(parsed.sequence);
+  if (collapse_deletions_) {
+    auto iter = tombstone_map.upper_bound(parsed.user_key.ToString());


questions:

key of the map is slice. should we compare slice instead of string?

why not just query lower_bound instead of getting upper_bound then iterate backward?

yes, not sure what I was thinking

lower_bound gives us the first point >= search key, but we want the last point <= search key.

yiwu-arbug · 2016-12-07T23:53:45Z

db/range_del_aggregator.cc

+    // above loop advances one too far
+    new_range_dels_iter = last_range_dels_iter;
+    auto tombstone_map_iter =
+        tombstone_map.upper_bound(new_range_dels_iter->start_key_);


lower_bound?

yiwu-arbug · 2016-12-08T00:08:44Z

db/range_del_aggregator.cc

+    // until the next tombstone starts. For gaps between real tombstones and
+    // for the last real tombstone, we denote end keys by inserting fake
+    // tombstones with sequence number zero.
+    std::vector<RangeTombstone> new_range_dels{


I think you only need to insert the fake tombstone if tombstone.end_key < tombstone_map_begin? Would this look cleaner? I think it will simplify the whole method if you have only one tombstone to add in the general case instead of having a new_range_dels vector.

if (tombstone_map.empty() || tombstone.end_key < tombstone_map_begin) { tombstone_map.emplace(seq, tombstone); tombstone_map.emplace(0, Tombstone(tombstone.end_key, null, 0)); return; } if (tombstone.start_key < tombstone_map_begin) { tombstone_map.emplace(seq, tombstone); } // general case begin.

one of the motivations for using a vector of new points is that, in a future optimization, we're considering merging multiple new points simultaneously (e.g., all points from a snapshot stripe in an SST file that we just read), which should be faster than merging new points one-at-a-time.

yiwu-arbug · 2016-12-08T03:01:23Z

db/range_del_aggregator.cc

+    // raising the seqnum for the to-be-inserted element (we insert the max
+    // seqnum between the next new interval and the unterminated interval).
+    SequenceNumber untermed_seq = kMaxSequenceNumber;
+    SequenceNumber prev_seen_seq = 0;


comment what's prev_seen_seq.

yiwu-arbug · 2016-12-08T05:40:50Z

db/range_del_aggregator.cc

+              RangeTombstone(
+                  Slice(), Slice(),
+                  std::max(
+                      untermed_seq == kMaxSequenceNumber ? 0 : untermed_seq,


set untermed_seq to 0 whenever you set it to kMaxSequenceNumber to save a check here?

oh, untermed_seq == 0 means we just covered an existing point with seqnum 0, in which case we need line 240 to evaluate to true so we can force insertion of the new point. but I think this logic will be obsolete now with your idea to split the existing ranges when they overlap a new point.

yiwu-arbug · 2016-12-08T05:44:12Z

db/range_del_aggregator.cc

+            *new_range_dels_iter_end, *tombstone_map_iter_end);
+      }
+
+      if (new_to_old_start_cmp < 0) {


I feel the logic here can be simplify. What if break it into three steps:

break the existing range that cover new_range_start into two.

break the existing range that cover new_range_end into two.

update all ranges that are fully cover by the new range.

great idea, thanks :)

If we're going to keep a vector of new range instead of one range at a time, then the current logic looks better, though I still feel it hard to understand an probably can be simplify.

ajkr

Thanks so much for the review!

ajkr · 2016-12-08T18:31:51Z

db/range_del_aggregator.cc

+            *new_range_dels_iter_end, *tombstone_map_iter_end);
+      }
+
+      if (new_to_old_start_cmp < 0) {


great idea, thanks :)

ajkr · 2016-12-08T18:38:26Z

db/range_del_aggregator.cc

+    // until the next tombstone starts. For gaps between real tombstones and
+    // for the last real tombstone, we denote end keys by inserting fake
+    // tombstones with sequence number zero.
+    std::vector<RangeTombstone> new_range_dels{


one of the motivations for using a vector of new points is that, in a future optimization, we're considering merging multiple new points simultaneously (e.g., all points from a snapshot stripe in an SST file that we just read), which should be faster than merging new points one-at-a-time.

ajkr · 2016-12-08T18:42:15Z

db/range_del_aggregator.cc

@@ -50,6 +58,14 @@ bool RangeDelAggregator::ShouldDelete(const ParsedInternalKey& parsed) {
    return false;
  }
  const auto& tombstone_map = GetTombstoneMap(parsed.sequence);
+  if (collapse_deletions_) {
+    auto iter = tombstone_map.upper_bound(parsed.user_key.ToString());


yes, not sure what I was thinking

lower_bound gives us the first point >= search key, but we want the last point <= search key.

ajkr · 2016-12-08T18:42:35Z

db/range_del_aggregator.cc

+    // raising the seqnum for the to-be-inserted element (we insert the max
+    // seqnum between the next new interval and the unterminated interval).
+    SequenceNumber untermed_seq = kMaxSequenceNumber;
+    SequenceNumber prev_seen_seq = 0;


ajkr · 2016-12-08T18:48:44Z

db/range_del_aggregator.cc

+              RangeTombstone(
+                  Slice(), Slice(),
+                  std::max(
+                      untermed_seq == kMaxSequenceNumber ? 0 : untermed_seq,


oh, untermed_seq == 0 means we just covered an existing point with seqnum 0, in which case we need line 240 to evaluate to true so we can force insertion of the new point. but I think this logic will be obsolete now with your idea to split the existing ranges when they overlap a new point.

ajkr · 2016-12-16T20:49:16Z

Sorry I missed your general questions earlier:

One RangeDelAggregator is constructed for every read operation, whether it's a get or an iterator. For iteration there is one RangeDelAggregator managed at the top-level, which is either DBIter (iterators for users) or CompactionIterator (iterators for flush or compaction). The lower-level iterators add tombstones into it, e.g., each time an SST file is read (https://github.com/facebook/rocksdb/blob/master/db/table_cache.cc#L243). Also the logic for reading values is updated to check whether keys are covered by range deletions with ShouldDelete().
The order is kind of arbitrary. For example, the Get() code path uses FilePicker to decide which file to check next for a key. We add the range deletions for that file just before checking it for the key.

facebook-github-bot · 2016-12-16T21:04:23Z

@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2016-12-16T21:04:56Z

@ajkr updated the pull request - view changes - changes since last import

facebook-github-bot · 2016-12-16T21:09:29Z

@ajkr updated the pull request - view changes - changes since last import

yiwu-arbug

I think the logic is correct so I'm letting it pass. But still, some random thoughts. I actually don't know whether which of these will simplify the code.

probably we should still want to add one range at a time instead of merging two range maps? say if you merge n range map, each with n elements. If you add one element at a time, it will take O(n^2log(n^2)) if you do binary search. If you merge one such range maps to the full range map at a time, each time doing a linear scan of the two maps, it will take O(n^3) time. Plus it will look simpler.
the input container (a vector) and existing range map (a map) are in different format, making the code look messy.
probably the existing range map don't need to key by start key, but being a std::set with a custom less-than operator? would it make it look simpler? something like iter->second.seq will become iter->seq, which look shorter.
if you don't hold the empty range (range with seq=0) in existing range map, will it make the code simpler? You then don't need to handle the extra empty range.

Totally up to you and definitely fine to come back to revisit the logic later.

yiwu-arbug · 2016-12-17T08:48:36Z

db/range_del_aggregator.cc

@@ -179,7 +179,6 @@ Status RangeDelAggregator::AddTombstone(RangeTombstone tombstone) {
    // raising the seqnum for the to-be-inserted element (we insert the max
    // seqnum between the next new interval and the unterminated interval).
    SequenceNumber untermed_seq = kMaxSequenceNumber;
-    SequenceNumber prev_seen_seq = 0;


good job getting rid of this variable. it make the state transition simpler.

ajkr · 2016-12-19T21:31:08Z

Addressing your comments:

I think for the complexity analysis we also have to account for the scanning of tombstones whose key-ranges are fully contained in the new tombstone's key-range. This is for modifying/erasing tombstones that are now covered. Maybe it's unfair to claim this is an O(n^2) scan for each tombstone since tombstones will probably be similar sizes, but I think in the worst case it will be.

2,3. Sure, I like your idea of using set with custom comparator :).

Oh, but we actually use tombstones with seq 0 (note there can be multiple, not just at the end, in case there are gaps between tombstones) as sentinels to denote end key. For example, if we want to store one tombstone with begin_key="a", end_key="b", and seq=10, then the map will contain {"a"->RangeTombstone(seq=10), "b"->RangeTombstone(seq=0)}. I thought this might be simpler because we don't have to maintain end_key since each tombstone extends until the next one starts.

yiwu-arbug · 2016-12-19T21:55:30Z

A range full covered by the new tombstone will be remove only once. Well, a range can be split but the ranges we end up with be O(N). (N = number of total input ranges). So the amortize cost for a range is O(logN) (logN for removal from a balanced binary search tree).

But well, I know the actual running time needs to be benchmark with a workload.. O(logN) operation on std::map takes a big constant factor.

ajkr · 2016-12-19T22:08:44Z

True it can only be removed once, but it may be considered by many scans if it has a higher seqnum than the new tombstones being added---i.e., it's not guaranteed to be removed just because its key-range is fully covered. Anyways, I can leave this for now and we can revisit later for further optimization, ok?

yiwu-arbug · 2016-12-19T22:20:09Z

I see. Yeah I forget about the seq_id > new seq_id case. Sure, go ahead. All the above are to share my two cents and have a little bit discussion.

ajkr · 2016-12-20T00:37:28Z

Thanks.

2,3. I tried this, but unfortunately ran into the issue that set's keys are supposed to be immutable. We could mark end_key_ and seq_ as "mutable" in the RangeTombstone struct, but I don't think it'd be intuitive to readers why start_key_ is immutable and others aren't. So, I think it's cleaner to continue with map for now.

facebook-github-bot · 2016-12-20T00:38:35Z

@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

yiwu-arbug · 2016-12-20T00:46:46Z

cool.

Summary: when writing RangeDelAggregator::AddToBuilder, I forgot that there are sentinel tombstones in the middle of the interval map since gaps between real tombstones are represented with sentinels. blame: #1614 Closes #1804 Differential Revision: D4460426 Pulled By: ajkr fbshipit-source-id: 69444b5

ajkr changed the title ~~tmp~~ Collapse range deletions Dec 3, 2016

facebook-github-bot added the CLA Signed label Dec 3, 2016

ajkr assigned lightmark Dec 3, 2016

ajkr mentioned this pull request Dec 6, 2016

db_stress support for range deletions #1625

Closed

yiwu-arbug reviewed Dec 8, 2016

View reviewed changes

ajkr commented Dec 8, 2016

View reviewed changes

ajkr added 6 commits December 14, 2016 18:32

tmp

85e3f2b

fix shit

78880ff

fix obsolete deletions count

9a0dc36

comments and naming

e12e812

reimplement yet again

74c7c4e

clean up a little bit

272228a

address comments

bfbf677

ajkr force-pushed the multimap branch from 7dae12d to bfbf677 Compare December 16, 2016 21:04

fix naming

ec24618

yiwu-arbug approved these changes Dec 17, 2016

View reviewed changes

yiwu-arbug reviewed Dec 17, 2016

View reviewed changes

facebook-github-bot closed this in 50e305d Dec 20, 2016

ajkr mentioned this pull request Jan 25, 2017

Fix DeleteRange including sentinels in output files #1804

Closed

benesch mentioned this pull request Jun 13, 2018

Speed up performance of range tombstones #3992

Closed

Collapse range deletions #1614

Collapse range deletions #1614

Conversation

ajkr commented Dec 3, 2016 • edited Loading

facebook-github-bot commented Dec 3, 2016

facebook-github-bot commented Dec 3, 2016

facebook-github-bot commented Dec 3, 2016

facebook-github-bot commented Dec 3, 2016

facebook-github-bot commented Dec 5, 2016

facebook-github-bot commented Dec 5, 2016

ajkr commented Dec 6, 2016

yiwu-arbug left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajkr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajkr commented Dec 16, 2016 • edited Loading

facebook-github-bot commented Dec 16, 2016

facebook-github-bot commented Dec 16, 2016

facebook-github-bot commented Dec 16, 2016

yiwu-arbug left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajkr commented Dec 19, 2016 • edited Loading

yiwu-arbug commented Dec 19, 2016 • edited Loading

ajkr commented Dec 19, 2016

yiwu-arbug commented Dec 19, 2016

ajkr commented Dec 20, 2016

facebook-github-bot commented Dec 20, 2016

yiwu-arbug commented Dec 20, 2016

ajkr commented Dec 3, 2016 •

edited

Loading

ajkr commented Dec 16, 2016 •

edited

Loading

yiwu-arbug left a comment •

edited

Loading

ajkr commented Dec 19, 2016 •

edited

Loading

yiwu-arbug commented Dec 19, 2016 •

edited

Loading