-
Notifications
You must be signed in to change notification settings - Fork 472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
db: fix cases where SeekPrefixGE prefix doesn't match key #3845
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 7 files reviewed, 1 unresolved discussion (waiting on @jbowens)
merging_iter.go
line 907 at r1 (raw file):
func (m *mergingIter) seekGE(key []byte, level int, flags base.SeekGEFlags) error { if invariants.Enabled && m.lower != nil && m.heap.cmp(key, m.lower) < 0 { m.logger.Fatalf("mergingIter: lower bound violation: %s < %s\n%s", key, m.lower, debug.Stack())
Any good reason for this code to use Fatalf
instead of panic
? I'm thinking that panic
should be just fine under invariants.Enabled
since it won't happen in production anyway
This commit adds `SeekPrefixGE` assertions verifying that the `prefix` actually is the prefix for the `key`. This was not always the case so the offending code paths are adjusted. In the future, we should create a wrapper iterator that verifies this sort of thing. Informs cockroachdb#3794
7cdedc8
to
92187e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm looking at this strict stuff for the first time, and there is a question below related to that and rangedel handling. Rest looks great.
Reviewed 5 of 7 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @jbowens and @RaduBerinde)
merging_iter.go
line 907 at r1 (raw file):
Previously, RaduBerinde wrote…
Any good reason for this code to use
Fatalf
instead ofpanic
? I'm thinking thatpanic
should be just fine underinvariants.Enabled
since it won't happen in production anyway
no good reason, afaik.
merging_iter.go
line 961 at r2 (raw file):
for ; level < len(m.levels); level++ { if invariants.Enabled && m.lower != nil && m.heap.cmp(key, m.lower) < 0 { m.logger.Fatalf("mergingIter: lower bound violation: %s < %s\n%s", key, m.lower, debug.Stack())
I think key
is getting updated in this loop, which is why the assertion was inside the loop.
merging_iter.go
line 986 at r2 (raw file):
// keys, and there might exist live range keys within the range // tombstone's span that need to be observed to trigger a switch to // combined iteration.
Is there an invariant that if we are doing a prefix seek, that key
is always a prefix of m.prefix
, even though it may advance in this loop?
Which is why we can't seek using l.tombstone.End
if it does not match the prefix? Say the prefix was c and we were seeking to c@30. And level 1 landed at c@5, and had a tombstone [a,f). We can't seek the lower levels to f. So we will seek them using c@30. Say level 2 lands at c@20. Now we have in some sense lost the benefit of the tombstone, in that we have not seeked far enough. We won't bother comparing with the tombstone so that's ok-ish. Say the next prefix seek done by the user was c@3. Then level 1 may not have a c, but it still has the tombstone. Then we will seek these lower levels to c@3.
Is this strictness resulting in unnecessary seeking in a part of the key space which is known to be range deleted?
sstable/reader_iter_single_lvl.go
line 676 at r2 (raw file):
// Iterator is not positioned based on last seek. flags = flags.DisableTrySeekUsingNext() i.didNotPositionOnLastSeekGE = false
Is this just a defensive addition, in case we start optimizing by not positioning on the regular SeekGE
path too?
sstable/reader_iter_two_lvl.go
line 40 at r2 (raw file):
// is set, the TrySeekUsingNext optimization is disabled on the next seek. // This happens for example when the bloom filter excludes a prefix. didNotPositionOnLastSeekGE bool
Is it ok for the default value for a new iterator to be false? Will that cause us to assume that it was positioned? Do we get away with it because the caller will not set seekGEFlagTrySeekUsingNext
?
sstable/reader_iter_two_lvl.go
line 404 at r2 (raw file):
if flags.TrySeekUsingNext() && (i.secondLevel.exhaustedBounds == +1 || (PD(&i.secondLevel.data).IsDataInvalidated() && i.secondLevel.index.IsDataInvalidated())) && err == nil {
How about adding a comment:
// INVARIANT: flags.TrySeekUsingNext(). Which implies i.didNotPositionOnLastSeekGE
was false at the start of SeekPrefixGE
, and is still false. So future seeks will continue to be able to use the current position to optimize.
sstable/reader_iter_two_lvl.go
line 529 at r2 (raw file):
} if !dontSeekWithinSingleLevelIter {
How about adding something like. Otherwise it is a bit confusing to see singleLevelIterator
also having a if-block that disables TrySeekUsingNext, and worrying that it will get executed.
if invariants.Enabled && i.didNotPositionOnLastSeekGE {
// i.didNotPositionOnLastSeekGE must always be false, since we've fiddled with that setting, and we want
// singleLevelIterator to blindly do what we are telling it to in flags.
panic(...)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 7 files at r1, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @RaduBerinde and @sumeerbhola)
merging_iter.go
line 986 at r2 (raw file):
Previously, sumeerbhola wrote…
Is there an invariant that if we are doing a prefix seek, that
key
is always a prefix ofm.prefix
, even though it may advance in this loop?Which is why we can't seek using
l.tombstone.End
if it does not match the prefix? Say the prefix was c and we were seeking to c@30. And level 1 landed at c@5, and had a tombstone [a,f). We can't seek the lower levels to f. So we will seek them using c@30. Say level 2 lands at c@20. Now we have in some sense lost the benefit of the tombstone, in that we have not seeked far enough. We won't bother comparing with the tombstone so that's ok-ish. Say the next prefix seek done by the user was c@3. Then level 1 may not have a c, but it still has the tombstone. Then we will seek these lower levels to c@3.Is this strictness resulting in unnecessary seeking in a part of the key space which is known to be range deleted?
Hm I think this is true. Could we get away with not performing the remaining lower-level seeks at all and instead remember on their mergingIterLevel
s that these levels' iterators are unpositioned? I think no because a subsequent SeekGE(k2) may find that k2 is < the position of the higher-level iterators (if they too were re-positioned by range deletes), which could cause the lower levels' keys to improperly appear to be undeleted by higher-level tombstones.
We would either need to a) remember l.tombstone.End
so that a subsequent SeekGE(k2)
seeks to max(k2,l.tombstone.End)
, or b) maintain an invariant that all the higher levels remain in the heap and their l.tombstone
s cascade down to l.tombstone.End
such that a subsequent SeekGE(k2)
will end up adjusting its seek key to the original l.tombstone.End
by the time it reaches the unpositioned levels.
sstable/reader_iter_two_lvl.go
line 40 at r2 (raw file):
Do we get away with it because the caller will not set
seekGEFlagTrySeekUsingNext
?
I think this is true. The levelIter takes care to disable the TrySeekUsingNext flag when it opens a new file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @sumeerbhola)
merging_iter.go
line 961 at r2 (raw file):
Previously, sumeerbhola wrote…
I think
key
is getting updated in this loop, which is why the assertion was inside the loop.
Right, but we only update it to a larger one.
merging_iter.go
line 986 at r2 (raw file):
Previously, jbowens (Jackson Owens) wrote…
Hm I think this is true. Could we get away with not performing the remaining lower-level seeks at all and instead remember on their
mergingIterLevel
s that these levels' iterators are unpositioned? I think no because a subsequent SeekGE(k2) may find that k2 is < the position of the higher-level iterators (if they too were re-positioned by range deletes), which could cause the lower levels' keys to improperly appear to be undeleted by higher-level tombstones.We would either need to a) remember
l.tombstone.End
so that a subsequentSeekGE(k2)
seeks tomax(k2,l.tombstone.End)
, or b) maintain an invariant that all the higher levels remain in the heap and theirl.tombstone
s cascade down tol.tombstone.End
such that a subsequentSeekGE(k2)
will end up adjusting its seek key to the originall.tombstone.End
by the time it reaches the unpositioned levels.
Would doing SeekGE(tombstone.End)
on the lower levels be better? AFAICT that's the closest thing to what we were doing before?
sstable/reader_iter_single_lvl.go
line 676 at r2 (raw file):
Previously, sumeerbhola wrote…
Is this just a defensive addition, in case we start optimizing by not positioning on the regular
SeekGE
path too?
No, in case we get a SeekGE w/TrySeekUsingNext
after a SeekPrefixGE
(which I ran into if I remember correctly).
sstable/reader_iter_two_lvl.go
line 40 at r2 (raw file):
Previously, jbowens (Jackson Owens) wrote…
Do we get away with it because the caller will not set
seekGEFlagTrySeekUsingNext
?I think this is true. The levelIter takes care to disable the TrySeekUsingNext flag when it opens a new file.
Good point though.. I can change it to allowSeekUsingNext
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @sumeerbhola)
merging_iter.go
line 986 at r2 (raw file):
Previously, RaduBerinde wrote…
Would doing
SeekGE(tombstone.End)
on the lower levels be better? AFAICT that's the closest thing to what we were doing before?
Jackson, I don't understand your point - is there a correctness issue if we just remember which levels we are allowed to use TrySeekUsingNext? It seems equivalent to the case where the existing SeekPrefixGE
did not match the bloom filter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @jbowens and @RaduBerinde)
merging_iter.go
line 986 at r2 (raw file):
Would doing
SeekGE(tombstone.End)
on the lower levels be better? AFAICT that's the closest thing to what we were doing before?
We may want something that knows the prefix so the levelIter
will seek the manifest.LevelIter
, but realizing that the seek key is beyond the prefix, will not load the file containing key, and return nil.
How important is this case to optimize? (given that we now use excise for rebalancing) |
I think the difference is that in the simple bloom filter case, we still seeked all the levels to the correct file. This paragraph in that TrySeekUsingNext PR is referring to this problem but it's very high level: https://github.com/cockroachdb/pebble/pull/3329/files#diff-5f17b9292c13dad71775aec366c4df449009f7fc211d69ed6cb134e6880feff0R64-R69 Ah yeah, here's a more concrete example in the mergingIter comments: Lines 929 to 954 in 25ac678
I'm having trouble articulating the invariant we're relying on here. [That's never a good sign :)] |
I think range deletions are still common although maybe not particularly broad ones. I believe MVCC GC will write them if there's a sufficiently long span of all keys that need GC-ing. I wouldn't be surprised if it's not important to optimize in practice, but I'm still a little wary of a regression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @jbowens and @RaduBerinde)
merging_iter.go
line 986 at r2 (raw file):
Previously, sumeerbhola wrote…
Would doing
SeekGE(tombstone.End)
on the lower levels be better? AFAICT that's the closest thing to what we were doing before?We may want something that knows the prefix so the
levelIter
will seek themanifest.LevelIter
, but realizing that the seek key is beyond the prefix, will not load the file containing key, and return nil.
Hmm, I'm forgetting the benefit of restricting the key in SeekPrefixGE
such that prefix
must be a prefix of it.
Also, I was forgetting the motivation for the plan to change SeekPrefixGE
so that every iterator has SeekPrefixGEStrict
(actually, is that the plan?) and went back to looking at 1f7e8ee and #2182 and #2178 (review), where we started doing strictness inside mergingIter
and can't follow the reasoning anymore.
The logic to init a heap is
n := h.len()
for i := n/2 - 1; i >= 0; i-- {
h.down(i, n)
}
Say we have a heap of 8 elements (memtable plus all levels with 1 sub-level in L0). I think the min and max number of comparisons is 7 and 11 respectively.
A comparison between a key that does not match the prefix and one that does match the prefix will stop after the same number of bytes as the check that the former key matches the prefix. So prefix checks are not necessarily faster (in one of those threads there was a claim of "faster prefix checks").
For levels with bloom filter checking (everything except memtable, and L6?), the probability that the key satisfies the prefix is very high (let's assume 100% for now) -- I may be forgetting something but I think we don't do an explicit prefix check in the sstable iterators. I think the earlier analysis ignored this high probability.
So let's consider some cases:
Case 1: even with prefix comparisons there are 8 elements
The code before that commit would do [7, 11] key comparisons to build the heap. Now we do 8 comparisons to check the prefix, that will all say yes. Then [7, 11] key comparisons to build the heap. So we are doing more work. In fairness, the old comments do mention the possibility of regression.
Case 2: prefix comparisons can eliminate memtable and L6, and there are 6 elements left
The code before that commit would do [7, 11] key comparisons to build the heap. Now we have done 8 comparisons for the prefix, and we need to build a heap with 6 elements. I think the latter is [5, 7] comparisons. So the total is still worse.
Case 3: prefix comparisons eliminate everything but L0 and L6
L1-L5 would not have returned a key. So the code before that commit would be working with 3 elements, corresponding to memtable, L0, L6. That is 2 key comparisons to build the heap. Now we do 3 prefix comparisons, and then 1 key comparison to build the heap, for a total of 4.
Case 4: prefix comparisons eliminate everything but L6
L0-L5 would not have returned a key. So the code before that commit would be working with 2 elements, corresponding to memtable, L6. That is 1 key comparison to build the heap. Now we do 2 prefix comparisons, and then no key comparison to build the heap, for a total of 2.
There is also a mention of benefit for iteration "If all of the keys with the prefix reside in a single level, iteration over the prefix also benefits from the reduced heap size removing k log n comparisons, where k is the number of times the iterator is stepped." I am not sure this is precise enough:
If everything except the top of the heap does not have the prefix, they are beyond the prefix. Iteration requires 2 key comparisons when there are at least 3 elements in the heap, since the top won't change. So we do have a benefit if we have managed to filter the heap down to 1 element, or 2 elements, since then iteration is doing 0 or 1 key comparison respectively. I think the CockroachDB cases related to MVCCGet which find a key (common case, I think), and doing a read for a MVCCPut, don't do a Next. If we wanted to optimize this iteration, we could have the mergingIter know (a) which levels already have a high probability of matching the prefix, so it doesn't waste prefix comparisons (b) remember which of the levels that don't have that high probability and for which it has not performed a prefix check. In a Next, it would do a prefix check for (b) and remove the entry if it does not match the prefix. If it does it would flip the bit, so it doesn't bother doing it again while that level iter is not nexted.
I am probably missing an interesting case, or missing some other obvious thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @RaduBerinde and @sumeerbhola)
merging_iter.go
line 986 at r2 (raw file):
So prefix checks are not necessarily faster (in one of those threads there was a claim of "faster prefix checks").
I haven't benchmarked to measure the difference, but because prefix checks are equality comparisons (eg, we know n the length of the prefix, then we're determining equality of the first n bytes), it seems like it should be faster. If Split(key) != len(prefix)
, then we can early exit. Otherwise, we're performing an equality comparison, which is generally faster than a comparison.
For levels with bloom filter checking (everything except memtable, and L6?), the probability that the key satisfies the prefix is very high (let's assume 100% for now)
Hrm, it'd be interesting to have some measure of this: how frequently the bloom filter fails to exclude a table but the table doesn't contain a key. I know the filter effectiveness in terms of percent of tables successfully excluded is high, and that's inversely related to the false positive rate, but I don't know if we can infer much from it without knowing the true frequency of a prefix in multiple levels.
I agree that if we replace the top-level Iterator prefix check with leaf prefix checks that never exclude keys, we're increasing the amount of work. Although with columnar blocks, this won't be the case because our initial seek will be performed on the prefixes themselves, and we can implement strictness for free as a part of the block seek.
So we do have a benefit if we have managed to filter the heap down to 1 element, or 2 elements, since then iteration is doing 0 or 1 key comparison respectively.
I think the benefit is also not just constrained to a single level containing the prefix, but I haven't stepped through the possibilities to bound the comparisons for various combinations of levels containing prefixes or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @sumeerbhola)
merging_iter.go
line 986 at r2 (raw file):
Previously, jbowens (Jackson Owens) wrote…
So prefix checks are not necessarily faster (in one of those threads there was a claim of "faster prefix checks").
I haven't benchmarked to measure the difference, but because prefix checks are equality comparisons (eg, we know n the length of the prefix, then we're determining equality of the first n bytes), it seems like it should be faster. If
Split(key) != len(prefix)
, then we can early exit. Otherwise, we're performing an equality comparison, which is generally faster than a comparison.For levels with bloom filter checking (everything except memtable, and L6?), the probability that the key satisfies the prefix is very high (let's assume 100% for now)
Hrm, it'd be interesting to have some measure of this: how frequently the bloom filter fails to exclude a table but the table doesn't contain a key. I know the filter effectiveness in terms of percent of tables successfully excluded is high, and that's inversely related to the false positive rate, but I don't know if we can infer much from it without knowing the true frequency of a prefix in multiple levels.
I agree that if we replace the top-level Iterator prefix check with leaf prefix checks that never exclude keys, we're increasing the amount of work. Although with columnar blocks, this won't be the case because our initial seek will be performed on the prefixes themselves, and we can implement strictness for free as a part of the block seek.
So we do have a benefit if we have managed to filter the heap down to 1 element, or 2 elements, since then iteration is doing 0 or 1 key comparison respectively.
I think the benefit is also not just constrained to a single level containing the prefix, but I haven't stepped through the possibilities to bound the comparisons for various combinations of levels containing prefixes or not.
- The prefix not matching key problem.
The main motivation for me is for understandability of the code. SeekPrefixGE definition says "The supplied prefix will be the prefix of the given key returned by that Split function.". I am trying to make the code match the definition. If we want to relax the definition, we have to define what the operation even means when the prefix doesn't match, what are the expectations from the iterator, and why would anyone ever call something like that. Is there a reasonable proposal around that definition?
A secondary practical motivation is that we will probably want to keep statistics about bloom filter false positives - i.e. when the bloom filter passes but we find no key with the prefix. We would need to check the prefix to know if really the bloom filter failed to exclude something, or we're looking for an entirely different key.
- The strictness problem
Like before, the main motivation here is understandability. Some operators implement a SeekPrefixGEStrict
which leads to a second TopLevelIterator
interface which makes things more complicated. There are instances in the code where SeekPrefixGE
is used with the expectation that it is strict: pebble.Iterator.SeekPrefixGE
is strict and calls l.iter.SeekPrefixGE
(which is strict because there's a merging iterator under there). This is a mess.
A practical motivation is that the new columnar format allows us to very cheaply know when the prefix changes, so if we push the strictness down to those iterators, we can avoid that comparison altogether. Even without that, prefix checks can be cheaper (like Jackson mentioned above, and also we could do a quick check of the last e.g. 8 bytes of the prefix - we expect some common prefix because the keys are ordered).
Back to the PR: I don't understand all the nuances, but it feels like we want this SeekPrefixGE(tombstone.End)
call not necessarily for what it does at the sstable level, but for what it does inside the levelIter
? If that is the case, can't we have a separate method on levelIter
that does whatever it is that we need? The merging iterator already has a reference to each *levelIter
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @jbowens and @RaduBerinde)
merging_iter.go
line 986 at r2 (raw file):
Previously, RaduBerinde wrote…
- The prefix not matching key problem.
The main motivation for me is for understandability of the code. SeekPrefixGE definition says "The supplied prefix will be the prefix of the given key returned by that Split function.". I am trying to make the code match the definition. If we want to relax the definition, we have to define what the operation even means when the prefix doesn't match, what are the expectations from the iterator, and why would anyone ever call something like that. Is there a reasonable proposal around that definition?
A secondary practical motivation is that we will probably want to keep statistics about bloom filter false positives - i.e. when the bloom filter passes but we find no key with the prefix. We would need to check the prefix to know if really the bloom filter failed to exclude something, or we're looking for an entirely different key.
- The strictness problem
Like before, the main motivation here is understandability. Some operators implement a
SeekPrefixGEStrict
which leads to a secondTopLevelIterator
interface which makes things more complicated. There are instances in the code whereSeekPrefixGE
is used with the expectation that it is strict:pebble.Iterator.SeekPrefixGE
is strict and callsl.iter.SeekPrefixGE
(which is strict because there's a merging iterator under there). This is a mess.A practical motivation is that the new columnar format allows us to very cheaply know when the prefix changes, so if we push the strictness down to those iterators, we can avoid that comparison altogether. Even without that, prefix checks can be cheaper (like Jackson mentioned above, and also we could do a quick check of the last e.g. 8 bytes of the prefix - we expect some common prefix because the keys are ordered).
Back to the PR: I don't understand all the nuances, but it feels like we want this
SeekPrefixGE(tombstone.End)
call not necessarily for what it does at the sstable level, but for what it does inside thelevelIter
? If that is the case, can't we have a separate method onlevelIter
that does whatever it is that we need? The merging iterator already has a reference to each*levelIter
.
I suspect there is a path forward that allows for enforcing that the prefix matches the key, while not becoming a de-optimization. I am mostly thinking about this in the context of the other PR, since the correctness and optimization reasoning need to go hand in hand.
Regarding strictness, I am ok with whatever you all decide. I was merely trying to point out that some of the original performance motivation for strictness may be flawed, in that it didn't take into account that the sstable levels, which will typically dominate the levels, may already have a high probability that the key matches the prefix. I see the point of colblk
allowing for cheap prefix change checks -- that is beneficial for SeekPrefixGE followed by Nexts, but won't help with just SeekPrefixGE (which may be a more common case for CRDB's OLTP workloads).
I think it should come at no extra cost to keep track of whether the binary search of the prefix ended in an equality or non-equality for SeekPrefixGE. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @RaduBerinde and @sumeerbhola)
merging_iter.go
line 986 at r2 (raw file):
won't help with just SeekPrefixGE
It still will if bloom filters are imperfect, yes?
This commit adds
SeekPrefixGE
assertions verifying that theprefix
actually is the prefix for thekey
. This was not always the case so the offending code paths are adjusted.In the future, we should create a wrapper iterator that verifies this sort of thing.
Informs #3794