Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collapse range deletions #1614

Closed
wants to merge 8 commits into from
Closed

Collapse range deletions #1614

wants to merge 8 commits into from

Conversation

ajkr
Copy link
Contributor

@ajkr ajkr commented Dec 3, 2016

Added a tombstone-collapsing mode to RangeDelAggregator, which eliminates overlap in the TombstoneMap. In this mode, we can check whether a tombstone covers a user key using upper_bound() (i.e., binary search). However, the tradeoff is the overhead to add tombstones is now higher, so at first I've only enabled it for range scans (compaction/flush/user iterators), where we expect a high number of calls to ShouldDelete() for the same tombstones. Point queries like Get() will still use the linear scan approach.

Also in this diff I changed RangeDelAggregator's TombstoneMap to use multimap with user keys instead of map with internal keys. Callers sometimes provided ParsedInternalKey directly, from which it would've required string copying to derive an internal key Slice with which we could search the map.

Test Plan: unit tests

@ajkr ajkr changed the title tmp Collapse range deletions Dec 3, 2016
@facebook-github-bot
Copy link
Contributor

@ajkr updated the pull request - view changes

@facebook-github-bot
Copy link
Contributor

@ajkr updated the pull request - view changes

@facebook-github-bot
Copy link
Contributor

@ajkr updated the pull request - view changes

@facebook-github-bot
Copy link
Contributor

@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ajkr updated the pull request - view changes - changes since last import

@facebook-github-bot
Copy link
Contributor

@ajkr updated the pull request - view changes - changes since last import

@ajkr
Copy link
Contributor Author

ajkr commented Dec 6, 2016

ping

facebook-github-bot pushed a commit that referenced this pull request Dec 7, 2016
Summary:
made db_stress capable of adding range deletions to its db and verifying their correctness. i'll make db_crashtest.py use this option later once the collapsing optimization (#1614) is committed because currently it slows down the test too much.
Closes #1625

Differential Revision: D4293939

Pulled By: ajkr

fbshipit-source-id: d3beb3a
Copy link
Contributor

@yiwu-arbug yiwu-arbug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some general questions since I'm want to know more about the work.

  1. when do we have RangeDeleteAggregator? How does it work together with iterators? Does the two levels of TwoLevelIterator each hold a range aggregator?
  2. Does the ranges add to the aggregator being add in any order? Like in increasing order of sequence id? Maybe it can used to simplify the code.

@@ -50,6 +58,14 @@ bool RangeDelAggregator::ShouldDelete(const ParsedInternalKey& parsed) {
return false;
}
const auto& tombstone_map = GetTombstoneMap(parsed.sequence);
if (collapse_deletions_) {
auto iter = tombstone_map.upper_bound(parsed.user_key.ToString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

questions:

  1. key of the map is slice. should we compare slice instead of string?
  2. why not just query lower_bound instead of getting upper_bound then iterate backward?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. yes, not sure what I was thinking
  2. lower_bound gives us the first point >= search key, but we want the last point <= search key.

// above loop advances one too far
new_range_dels_iter = last_range_dels_iter;
auto tombstone_map_iter =
tombstone_map.upper_bound(new_range_dels_iter->start_key_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lower_bound?

// until the next tombstone starts. For gaps between real tombstones and
// for the last real tombstone, we denote end keys by inserting fake
// tombstones with sequence number zero.
std::vector<RangeTombstone> new_range_dels{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you only need to insert the fake tombstone if tombstone.end_key < tombstone_map_begin? Would this look cleaner? I think it will simplify the whole method if you have only one tombstone to add in the general case instead of having a new_range_dels vector.

    if (tombstone_map.empty() || tombstone.end_key < tombstone_map_begin) {
        tombstone_map.emplace(seq, tombstone);
        tombstone_map.emplace(0, Tombstone(tombstone.end_key, null, 0));
        return;
    }
    if (tombstone.start_key < tombstone_map_begin) {
        tombstone_map.emplace(seq, tombstone);
    }
    // general case begin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one of the motivations for using a vector of new points is that, in a future optimization, we're considering merging multiple new points simultaneously (e.g., all points from a snapshot stripe in an SST file that we just read), which should be faster than merging new points one-at-a-time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

// raising the seqnum for the to-be-inserted element (we insert the max
// seqnum between the next new interval and the unterminated interval).
SequenceNumber untermed_seq = kMaxSequenceNumber;
SequenceNumber prev_seen_seq = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment what's prev_seen_seq.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

RangeTombstone(
Slice(), Slice(),
std::max(
untermed_seq == kMaxSequenceNumber ? 0 : untermed_seq,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set untermed_seq to 0 whenever you set it to kMaxSequenceNumber to save a check here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, untermed_seq == 0 means we just covered an existing point with seqnum 0, in which case we need line 240 to evaluate to true so we can force insertion of the new point. but I think this logic will be obsolete now with your idea to split the existing ranges when they overlap a new point.

*new_range_dels_iter_end, *tombstone_map_iter_end);
}

if (new_to_old_start_cmp < 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel the logic here can be simplify. What if break it into three steps:

  1. break the existing range that cover new_range_start into two.
  2. break the existing range that cover new_range_end into two.
  3. update all ranges that are fully cover by the new range.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great idea, thanks :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to keep a vector of new range instead of one range at a time, then the current logic looks better, though I still feel it hard to understand an probably can be simplify.

Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for the review!

*new_range_dels_iter_end, *tombstone_map_iter_end);
}

if (new_to_old_start_cmp < 0) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great idea, thanks :)

// until the next tombstone starts. For gaps between real tombstones and
// for the last real tombstone, we denote end keys by inserting fake
// tombstones with sequence number zero.
std::vector<RangeTombstone> new_range_dels{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one of the motivations for using a vector of new points is that, in a future optimization, we're considering merging multiple new points simultaneously (e.g., all points from a snapshot stripe in an SST file that we just read), which should be faster than merging new points one-at-a-time.

@@ -50,6 +58,14 @@ bool RangeDelAggregator::ShouldDelete(const ParsedInternalKey& parsed) {
return false;
}
const auto& tombstone_map = GetTombstoneMap(parsed.sequence);
if (collapse_deletions_) {
auto iter = tombstone_map.upper_bound(parsed.user_key.ToString());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. yes, not sure what I was thinking
  2. lower_bound gives us the first point >= search key, but we want the last point <= search key.

// raising the seqnum for the to-be-inserted element (we insert the max
// seqnum between the next new interval and the unterminated interval).
SequenceNumber untermed_seq = kMaxSequenceNumber;
SequenceNumber prev_seen_seq = 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

RangeTombstone(
Slice(), Slice(),
std::max(
untermed_seq == kMaxSequenceNumber ? 0 : untermed_seq,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, untermed_seq == 0 means we just covered an existing point with seqnum 0, in which case we need line 240 to evaluate to true so we can force insertion of the new point. but I think this logic will be obsolete now with your idea to split the existing ranges when they overlap a new point.

@ajkr
Copy link
Contributor Author

ajkr commented Dec 16, 2016

Sorry I missed your general questions earlier:

  1. One RangeDelAggregator is constructed for every read operation, whether it's a get or an iterator. For iteration there is one RangeDelAggregator managed at the top-level, which is either DBIter (iterators for users) or CompactionIterator (iterators for flush or compaction). The lower-level iterators add tombstones into it, e.g., each time an SST file is read (https://github.com/facebook/rocksdb/blob/master/db/table_cache.cc#L243). Also the logic for reading values is updated to check whether keys are covered by range deletions with ShouldDelete().

  2. The order is kind of arbitrary. For example, the Get() code path uses FilePicker to decide which file to check next for a key. We add the range deletions for that file just before checking it for the key.

@facebook-github-bot
Copy link
Contributor

@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ajkr updated the pull request - view changes - changes since last import

@facebook-github-bot
Copy link
Contributor

@ajkr updated the pull request - view changes - changes since last import

Copy link
Contributor

@yiwu-arbug yiwu-arbug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the logic is correct so I'm letting it pass. But still, some random thoughts. I actually don't know whether which of these will simplify the code.

  1. probably we should still want to add one range at a time instead of merging two range maps? say if you merge n range map, each with n elements. If you add one element at a time, it will take O(n^2log(n^2)) if you do binary search. If you merge one such range maps to the full range map at a time, each time doing a linear scan of the two maps, it will take O(n^3) time. Plus it will look simpler.
  2. the input container (a vector) and existing range map (a map) are in different format, making the code look messy.
  3. probably the existing range map don't need to key by start key, but being a std::set with a custom less-than operator? would it make it look simpler? something like iter->second.seq will become iter->seq, which look shorter.
  4. if you don't hold the empty range (range with seq=0) in existing range map, will it make the code simpler? You then don't need to handle the extra empty range.

Totally up to you and definitely fine to come back to revisit the logic later.

@@ -179,7 +179,6 @@ Status RangeDelAggregator::AddTombstone(RangeTombstone tombstone) {
// raising the seqnum for the to-be-inserted element (we insert the max
// seqnum between the next new interval and the unterminated interval).
SequenceNumber untermed_seq = kMaxSequenceNumber;
SequenceNumber prev_seen_seq = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good job getting rid of this variable. it make the state transition simpler.

@ajkr
Copy link
Contributor Author

ajkr commented Dec 19, 2016

Addressing your comments:

  1. I think for the complexity analysis we also have to account for the scanning of tombstones whose key-ranges are fully contained in the new tombstone's key-range. This is for modifying/erasing tombstones that are now covered. Maybe it's unfair to claim this is an O(n^2) scan for each tombstone since tombstones will probably be similar sizes, but I think in the worst case it will be.

2,3. Sure, I like your idea of using set with custom comparator :).

  1. Oh, but we actually use tombstones with seq 0 (note there can be multiple, not just at the end, in case there are gaps between tombstones) as sentinels to denote end key. For example, if we want to store one tombstone with begin_key="a", end_key="b", and seq=10, then the map will contain {"a"->RangeTombstone(seq=10), "b"->RangeTombstone(seq=0)}. I thought this might be simpler because we don't have to maintain end_key since each tombstone extends until the next one starts.

@yiwu-arbug
Copy link
Contributor

yiwu-arbug commented Dec 19, 2016

  1. A range full covered by the new tombstone will be remove only once. Well, a range can be split but the ranges we end up with be O(N). (N = number of total input ranges). So the amortize cost for a range is O(logN) (logN for removal from a balanced binary search tree).

But well, I know the actual running time needs to be benchmark with a workload.. O(logN) operation on std::map takes a big constant factor.

@ajkr
Copy link
Contributor Author

ajkr commented Dec 19, 2016

  1. True it can only be removed once, but it may be considered by many scans if it has a higher seqnum than the new tombstones being added---i.e., it's not guaranteed to be removed just because its key-range is fully covered. Anyways, I can leave this for now and we can revisit later for further optimization, ok?

@yiwu-arbug
Copy link
Contributor

I see. Yeah I forget about the seq_id > new seq_id case. Sure, go ahead. All the above are to share my two cents and have a little bit discussion.

@ajkr
Copy link
Contributor Author

ajkr commented Dec 20, 2016

Thanks.

2,3. I tried this, but unfortunately ran into the issue that set's keys are supposed to be immutable. We could mark end_key_ and seq_ as "mutable" in the RangeTombstone struct, but I don't think it'd be intuitive to readers why start_key_ is immutable and others aren't. So, I think it's cleaner to continue with map for now.

@facebook-github-bot
Copy link
Contributor

@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@yiwu-arbug
Copy link
Contributor

cool.

facebook-github-bot pushed a commit that referenced this pull request Jan 25, 2017
Summary:
when writing RangeDelAggregator::AddToBuilder, I forgot that there are sentinel tombstones in the middle of the interval map since gaps between real tombstones are represented with sentinels.

blame: #1614
Closes #1804

Differential Revision: D4460426

Pulled By: ajkr

fbshipit-source-id: 69444b5
ajkr added a commit that referenced this pull request Jan 27, 2017
Summary:
when writing RangeDelAggregator::AddToBuilder, I forgot that there are sentinel tombstones in the middle of the interval map since gaps between real tombstones are represented with sentinels.

blame: #1614
Closes #1804

Differential Revision: D4460426

Pulled By: ajkr

fbshipit-source-id: 69444b5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants