-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collapse range deletions #1614
Collapse range deletions #1614
Conversation
@ajkr updated the pull request - view changes |
@ajkr updated the pull request - view changes |
@ajkr updated the pull request - view changes |
@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@ajkr updated the pull request - view changes - changes since last import |
@ajkr updated the pull request - view changes - changes since last import |
ping |
Summary: made db_stress capable of adding range deletions to its db and verifying their correctness. i'll make db_crashtest.py use this option later once the collapsing optimization (#1614) is committed because currently it slows down the test too much. Closes #1625 Differential Revision: D4293939 Pulled By: ajkr fbshipit-source-id: d3beb3a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some general questions since I'm want to know more about the work.
- when do we have RangeDeleteAggregator? How does it work together with iterators? Does the two levels of TwoLevelIterator each hold a range aggregator?
- Does the ranges add to the aggregator being add in any order? Like in increasing order of sequence id? Maybe it can used to simplify the code.
@@ -50,6 +58,14 @@ bool RangeDelAggregator::ShouldDelete(const ParsedInternalKey& parsed) { | |||
return false; | |||
} | |||
const auto& tombstone_map = GetTombstoneMap(parsed.sequence); | |||
if (collapse_deletions_) { | |||
auto iter = tombstone_map.upper_bound(parsed.user_key.ToString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
questions:
- key of the map is slice. should we compare slice instead of string?
- why not just query lower_bound instead of getting upper_bound then iterate backward?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- yes, not sure what I was thinking
- lower_bound gives us the first point >= search key, but we want the last point <= search key.
// above loop advances one too far | ||
new_range_dels_iter = last_range_dels_iter; | ||
auto tombstone_map_iter = | ||
tombstone_map.upper_bound(new_range_dels_iter->start_key_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lower_bound?
// until the next tombstone starts. For gaps between real tombstones and | ||
// for the last real tombstone, we denote end keys by inserting fake | ||
// tombstones with sequence number zero. | ||
std::vector<RangeTombstone> new_range_dels{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you only need to insert the fake tombstone if tombstone.end_key < tombstone_map_begin
? Would this look cleaner? I think it will simplify the whole method if you have only one tombstone to add in the general case instead of having a new_range_dels
vector.
if (tombstone_map.empty() || tombstone.end_key < tombstone_map_begin) {
tombstone_map.emplace(seq, tombstone);
tombstone_map.emplace(0, Tombstone(tombstone.end_key, null, 0));
return;
}
if (tombstone.start_key < tombstone_map_begin) {
tombstone_map.emplace(seq, tombstone);
}
// general case begin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one of the motivations for using a vector of new points is that, in a future optimization, we're considering merging multiple new points simultaneously (e.g., all points from a snapshot stripe in an SST file that we just read), which should be faster than merging new points one-at-a-time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
// raising the seqnum for the to-be-inserted element (we insert the max | ||
// seqnum between the next new interval and the unterminated interval). | ||
SequenceNumber untermed_seq = kMaxSequenceNumber; | ||
SequenceNumber prev_seen_seq = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment what's prev_seen_seq
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
RangeTombstone( | ||
Slice(), Slice(), | ||
std::max( | ||
untermed_seq == kMaxSequenceNumber ? 0 : untermed_seq, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set untermed_seq
to 0 whenever you set it to kMaxSequenceNumber
to save a check here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, untermed_seq == 0 means we just covered an existing point with seqnum 0, in which case we need line 240 to evaluate to true so we can force insertion of the new point. but I think this logic will be obsolete now with your idea to split the existing ranges when they overlap a new point.
*new_range_dels_iter_end, *tombstone_map_iter_end); | ||
} | ||
|
||
if (new_to_old_start_cmp < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel the logic here can be simplify. What if break it into three steps:
- break the existing range that cover new_range_start into two.
- break the existing range that cover new_range_end into two.
- update all ranges that are fully cover by the new range.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great idea, thanks :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to keep a vector of new range instead of one range at a time, then the current logic looks better, though I still feel it hard to understand an probably can be simplify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for the review!
*new_range_dels_iter_end, *tombstone_map_iter_end); | ||
} | ||
|
||
if (new_to_old_start_cmp < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great idea, thanks :)
// until the next tombstone starts. For gaps between real tombstones and | ||
// for the last real tombstone, we denote end keys by inserting fake | ||
// tombstones with sequence number zero. | ||
std::vector<RangeTombstone> new_range_dels{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one of the motivations for using a vector of new points is that, in a future optimization, we're considering merging multiple new points simultaneously (e.g., all points from a snapshot stripe in an SST file that we just read), which should be faster than merging new points one-at-a-time.
@@ -50,6 +58,14 @@ bool RangeDelAggregator::ShouldDelete(const ParsedInternalKey& parsed) { | |||
return false; | |||
} | |||
const auto& tombstone_map = GetTombstoneMap(parsed.sequence); | |||
if (collapse_deletions_) { | |||
auto iter = tombstone_map.upper_bound(parsed.user_key.ToString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- yes, not sure what I was thinking
- lower_bound gives us the first point >= search key, but we want the last point <= search key.
// raising the seqnum for the to-be-inserted element (we insert the max | ||
// seqnum between the next new interval and the unterminated interval). | ||
SequenceNumber untermed_seq = kMaxSequenceNumber; | ||
SequenceNumber prev_seen_seq = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
RangeTombstone( | ||
Slice(), Slice(), | ||
std::max( | ||
untermed_seq == kMaxSequenceNumber ? 0 : untermed_seq, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, untermed_seq == 0 means we just covered an existing point with seqnum 0, in which case we need line 240 to evaluate to true so we can force insertion of the new point. but I think this logic will be obsolete now with your idea to split the existing ranges when they overlap a new point.
Sorry I missed your general questions earlier:
|
@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@ajkr updated the pull request - view changes - changes since last import |
@ajkr updated the pull request - view changes - changes since last import |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the logic is correct so I'm letting it pass. But still, some random thoughts. I actually don't know whether which of these will simplify the code.
- probably we should still want to add one range at a time instead of merging two range maps? say if you merge n range map, each with n elements. If you add one element at a time, it will take O(n^2log(n^2)) if you do binary search. If you merge one such range maps to the full range map at a time, each time doing a linear scan of the two maps, it will take O(n^3) time. Plus it will look simpler.
- the input container (a vector) and existing range map (a map) are in different format, making the code look messy.
- probably the existing range map don't need to key by start key, but being a std::set with a custom less-than operator? would it make it look simpler? something like
iter->second.seq
will becomeiter->seq
, which look shorter. - if you don't hold the empty range (range with seq=0) in existing range map, will it make the code simpler? You then don't need to handle the extra empty range.
Totally up to you and definitely fine to come back to revisit the logic later.
@@ -179,7 +179,6 @@ Status RangeDelAggregator::AddTombstone(RangeTombstone tombstone) { | |||
// raising the seqnum for the to-be-inserted element (we insert the max | |||
// seqnum between the next new interval and the unterminated interval). | |||
SequenceNumber untermed_seq = kMaxSequenceNumber; | |||
SequenceNumber prev_seen_seq = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good job getting rid of this variable. it make the state transition simpler.
Addressing your comments:
2,3. Sure, I like your idea of using set with custom comparator :).
|
But well, I know the actual running time needs to be benchmark with a workload.. O(logN) operation on std::map takes a big constant factor. |
|
I see. Yeah I forget about the seq_id > new seq_id case. Sure, go ahead. All the above are to share my two cents and have a little bit discussion. |
Thanks. 2,3. I tried this, but unfortunately ran into the issue that set's keys are supposed to be immutable. We could mark end_key_ and seq_ as "mutable" in the RangeTombstone struct, but I don't think it'd be intuitive to readers why start_key_ is immutable and others aren't. So, I think it's cleaner to continue with map for now. |
@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
cool. |
Summary: when writing RangeDelAggregator::AddToBuilder, I forgot that there are sentinel tombstones in the middle of the interval map since gaps between real tombstones are represented with sentinels. blame: #1614 Closes #1804 Differential Revision: D4460426 Pulled By: ajkr fbshipit-source-id: 69444b5
Summary: when writing RangeDelAggregator::AddToBuilder, I forgot that there are sentinel tombstones in the middle of the interval map since gaps between real tombstones are represented with sentinels. blame: #1614 Closes #1804 Differential Revision: D4460426 Pulled By: ajkr fbshipit-source-id: 69444b5
Added a tombstone-collapsing mode to RangeDelAggregator, which eliminates overlap in the TombstoneMap. In this mode, we can check whether a tombstone covers a user key using upper_bound() (i.e., binary search). However, the tradeoff is the overhead to add tombstones is now higher, so at first I've only enabled it for range scans (compaction/flush/user iterators), where we expect a high number of calls to ShouldDelete() for the same tombstones. Point queries like Get() will still use the linear scan approach.
Also in this diff I changed RangeDelAggregator's TombstoneMap to use multimap with user keys instead of map with internal keys. Callers sometimes provided ParsedInternalKey directly, from which it would've required string copying to derive an internal key Slice with which we could search the map.
Test Plan: unit tests