Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db: fix cases where SeekPrefixGE prefix doesn't match key #3845

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

RaduBerinde
Copy link
Member

This commit adds SeekPrefixGE assertions verifying that the prefix actually is the prefix for the key. This was not always the case so the offending code paths are adjusted.

In the future, we should create a wrapper iterator that verifies this sort of thing.

Informs #3794

@RaduBerinde RaduBerinde requested a review from jbowens August 12, 2024 20:23
@RaduBerinde RaduBerinde requested a review from a team as a code owner August 12, 2024 20:23
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 7 files reviewed, 1 unresolved discussion (waiting on @jbowens)


merging_iter.go line 907 at r1 (raw file):

func (m *mergingIter) seekGE(key []byte, level int, flags base.SeekGEFlags) error {
	if invariants.Enabled && m.lower != nil && m.heap.cmp(key, m.lower) < 0 {
		m.logger.Fatalf("mergingIter: lower bound violation: %s < %s\n%s", key, m.lower, debug.Stack())

Any good reason for this code to use Fatalf instead of panic? I'm thinking that panic should be just fine under invariants.Enabled since it won't happen in production anyway

This commit adds `SeekPrefixGE` assertions verifying that the `prefix`
actually is the prefix for the `key`. This was not always the case so
the offending code paths are adjusted.

In the future, we should create a wrapper iterator that verifies this
sort of thing.

Informs cockroachdb#3794
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking at this strict stuff for the first time, and there is a question below related to that and rangedel handling. Rest looks great.

Reviewed 5 of 7 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @jbowens and @RaduBerinde)


merging_iter.go line 907 at r1 (raw file):

Previously, RaduBerinde wrote…

Any good reason for this code to use Fatalf instead of panic? I'm thinking that panic should be just fine under invariants.Enabled since it won't happen in production anyway

no good reason, afaik.


merging_iter.go line 961 at r2 (raw file):

	for ; level < len(m.levels); level++ {
		if invariants.Enabled && m.lower != nil && m.heap.cmp(key, m.lower) < 0 {
			m.logger.Fatalf("mergingIter: lower bound violation: %s < %s\n%s", key, m.lower, debug.Stack())

I think key is getting updated in this loop, which is why the assertion was inside the loop.


merging_iter.go line 986 at r2 (raw file):

		// keys, and there might exist live range keys within the range
		// tombstone's span that need to be observed to trigger a switch to
		// combined iteration.

Is there an invariant that if we are doing a prefix seek, that key is always a prefix of m.prefix, even though it may advance in this loop?

Which is why we can't seek using l.tombstone.End if it does not match the prefix? Say the prefix was c and we were seeking to c@30. And level 1 landed at c@5, and had a tombstone [a,f). We can't seek the lower levels to f. So we will seek them using c@30. Say level 2 lands at c@20. Now we have in some sense lost the benefit of the tombstone, in that we have not seeked far enough. We won't bother comparing with the tombstone so that's ok-ish. Say the next prefix seek done by the user was c@3. Then level 1 may not have a c, but it still has the tombstone. Then we will seek these lower levels to c@3.

Is this strictness resulting in unnecessary seeking in a part of the key space which is known to be range deleted?


sstable/reader_iter_single_lvl.go line 676 at r2 (raw file):

		// Iterator is not positioned based on last seek.
		flags = flags.DisableTrySeekUsingNext()
		i.didNotPositionOnLastSeekGE = false

Is this just a defensive addition, in case we start optimizing by not positioning on the regular SeekGE path too?


sstable/reader_iter_two_lvl.go line 40 at r2 (raw file):

	// is set, the TrySeekUsingNext optimization is disabled on the next seek.
	// This happens for example when the bloom filter excludes a prefix.
	didNotPositionOnLastSeekGE bool

Is it ok for the default value for a new iterator to be false? Will that cause us to assume that it was positioned? Do we get away with it because the caller will not set seekGEFlagTrySeekUsingNext?


sstable/reader_iter_two_lvl.go line 404 at r2 (raw file):

	if flags.TrySeekUsingNext() &&
		(i.secondLevel.exhaustedBounds == +1 || (PD(&i.secondLevel.data).IsDataInvalidated() && i.secondLevel.index.IsDataInvalidated())) &&
		err == nil {

How about adding a comment:

// INVARIANT: flags.TrySeekUsingNext(). Which implies i.didNotPositionOnLastSeekGE was false at the start of SeekPrefixGE, and is still false. So future seeks will continue to be able to use the current position to optimize.


sstable/reader_iter_two_lvl.go line 529 at r2 (raw file):

	}

	if !dontSeekWithinSingleLevelIter {

How about adding something like. Otherwise it is a bit confusing to see singleLevelIterator also having a if-block that disables TrySeekUsingNext, and worrying that it will get executed.

if invariants.Enabled && i.didNotPositionOnLastSeekGE {
   // i.didNotPositionOnLastSeekGE must always be false, since we've fiddled with that setting, and we want
   // singleLevelIterator to blindly do what we are telling it to in flags.
   panic(...)
}

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 7 files at r1, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @RaduBerinde and @sumeerbhola)


merging_iter.go line 986 at r2 (raw file):

Previously, sumeerbhola wrote…

Is there an invariant that if we are doing a prefix seek, that key is always a prefix of m.prefix, even though it may advance in this loop?

Which is why we can't seek using l.tombstone.End if it does not match the prefix? Say the prefix was c and we were seeking to c@30. And level 1 landed at c@5, and had a tombstone [a,f). We can't seek the lower levels to f. So we will seek them using c@30. Say level 2 lands at c@20. Now we have in some sense lost the benefit of the tombstone, in that we have not seeked far enough. We won't bother comparing with the tombstone so that's ok-ish. Say the next prefix seek done by the user was c@3. Then level 1 may not have a c, but it still has the tombstone. Then we will seek these lower levels to c@3.

Is this strictness resulting in unnecessary seeking in a part of the key space which is known to be range deleted?

Hm I think this is true. Could we get away with not performing the remaining lower-level seeks at all and instead remember on their mergingIterLevels that these levels' iterators are unpositioned? I think no because a subsequent SeekGE(k2) may find that k2 is < the position of the higher-level iterators (if they too were re-positioned by range deletes), which could cause the lower levels' keys to improperly appear to be undeleted by higher-level tombstones.

We would either need to a) remember l.tombstone.End so that a subsequent SeekGE(k2) seeks to max(k2,l.tombstone.End), or b) maintain an invariant that all the higher levels remain in the heap and their l.tombstones cascade down to l.tombstone.End such that a subsequent SeekGE(k2) will end up adjusting its seek key to the original l.tombstone.End by the time it reaches the unpositioned levels.


sstable/reader_iter_two_lvl.go line 40 at r2 (raw file):

Do we get away with it because the caller will not set seekGEFlagTrySeekUsingNext?

I think this is true. The levelIter takes care to disable the TrySeekUsingNext flag when it opens a new file.

Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @sumeerbhola)


merging_iter.go line 961 at r2 (raw file):

Previously, sumeerbhola wrote…

I think key is getting updated in this loop, which is why the assertion was inside the loop.

Right, but we only update it to a larger one.


merging_iter.go line 986 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Hm I think this is true. Could we get away with not performing the remaining lower-level seeks at all and instead remember on their mergingIterLevels that these levels' iterators are unpositioned? I think no because a subsequent SeekGE(k2) may find that k2 is < the position of the higher-level iterators (if they too were re-positioned by range deletes), which could cause the lower levels' keys to improperly appear to be undeleted by higher-level tombstones.

We would either need to a) remember l.tombstone.End so that a subsequent SeekGE(k2) seeks to max(k2,l.tombstone.End), or b) maintain an invariant that all the higher levels remain in the heap and their l.tombstones cascade down to l.tombstone.End such that a subsequent SeekGE(k2) will end up adjusting its seek key to the original l.tombstone.End by the time it reaches the unpositioned levels.

Would doing SeekGE(tombstone.End) on the lower levels be better? AFAICT that's the closest thing to what we were doing before?


sstable/reader_iter_single_lvl.go line 676 at r2 (raw file):

Previously, sumeerbhola wrote…

Is this just a defensive addition, in case we start optimizing by not positioning on the regular SeekGE path too?

No, in case we get a SeekGE w/TrySeekUsingNext after a SeekPrefixGE (which I ran into if I remember correctly).


sstable/reader_iter_two_lvl.go line 40 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Do we get away with it because the caller will not set seekGEFlagTrySeekUsingNext?

I think this is true. The levelIter takes care to disable the TrySeekUsingNext flag when it opens a new file.

Good point though.. I can change it to allowSeekUsingNext?

Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @sumeerbhola)


merging_iter.go line 986 at r2 (raw file):

Previously, RaduBerinde wrote…

Would doing SeekGE(tombstone.End) on the lower levels be better? AFAICT that's the closest thing to what we were doing before?

Jackson, I don't understand your point - is there a correctness issue if we just remember which levels we are allowed to use TrySeekUsingNext? It seems equivalent to the case where the existing SeekPrefixGE did not match the bloom filter.

@sumeerbhola sumeerbhola requested a review from jbowens August 15, 2024 00:21
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @jbowens and @RaduBerinde)


merging_iter.go line 986 at r2 (raw file):

Would doing SeekGE(tombstone.End) on the lower levels be better? AFAICT that's the closest thing to what we were doing before?

We may want something that knows the prefix so the levelIter will seek the manifest.LevelIter, but realizing that the seek key is beyond the prefix, will not load the file containing key, and return nil.

@RaduBerinde
Copy link
Member Author

How important is this case to optimize? (given that we now use excise for rebalancing)

@jbowens
Copy link
Collaborator

jbowens commented Aug 15, 2024

Jackson, I don't understand your point - is there a correctness issue if we just remember which levels we are allowed to use TrySeekUsingNext? It seems equivalent to the case where the existing SeekPrefixGE did not match the bloom filter.

I think the difference is that in the simple bloom filter case, we still seeked all the levels to the correct file. This paragraph in that TrySeekUsingNext PR is referring to this problem but it's very high level: https://github.com/cockroachdb/pebble/pull/3329/files#diff-5f17b9292c13dad71775aec366c4df449009f7fc211d69ed6cb134e6880feff0R64-R69

Ah yeah, here's a more concrete example in the mergingIter comments:

pebble/merging_iter.go

Lines 929 to 954 in 25ac678

// Deterministically disable the TrySeekUsingNext optimizations sometimes in
// invariant builds to encourage the metamorphic tests to surface bugs. Note
// that we cannot disable the optimization within individual levels. It must
// be disabled for all levels or none. If one lower-level iterator performs
// a fresh seek whereas another takes advantage of its current iterator
// position, the heap can become inconsistent. Consider the following
// example:
//
// L5: [ [b-c) ] [ d ]*
// L6: [ b ] [e]*
//
// Imagine a SeekGE(a). The [b-c) range tombstone deletes the L6 point key
// 'b', resulting in the iterator positioned at d with the heap:
//
// {L5: d, L6: e}
//
// A subsequent SeekGE(b) is seeking to a larger key, so the caller may set
// TrySeekUsingNext()=true. If the L5 iterator used the TrySeekUsingNext
// optimization but the L6 iterator did not, the iterator would have the
// heap:
//
// {L6: b, L5: d}
//
// Because the L5 iterator has already advanced to the next sstable, the
// merging iterator cannot observe the [b-c) range tombstone and will
// mistakenly return L6's deleted point key 'b'.

I'm having trouble articulating the invariant we're relying on here. [That's never a good sign :)]

@jbowens
Copy link
Collaborator

jbowens commented Aug 15, 2024

How important is this case to optimize? (given that we now use excise for rebalancing)

I think range deletions are still common although maybe not particularly broad ones. I believe MVCC GC will write them if there's a sufficiently long span of all keys that need GC-ing. I wouldn't be surprised if it's not important to optimize in practice, but I'm still a little wary of a regression.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @jbowens and @RaduBerinde)


merging_iter.go line 986 at r2 (raw file):

Previously, sumeerbhola wrote…

Would doing SeekGE(tombstone.End) on the lower levels be better? AFAICT that's the closest thing to what we were doing before?

We may want something that knows the prefix so the levelIter will seek the manifest.LevelIter, but realizing that the seek key is beyond the prefix, will not load the file containing key, and return nil.

Hmm, I'm forgetting the benefit of restricting the key in SeekPrefixGE such that prefix must be a prefix of it.

Also, I was forgetting the motivation for the plan to change SeekPrefixGE so that every iterator has SeekPrefixGEStrict (actually, is that the plan?) and went back to looking at 1f7e8ee and #2182 and #2178 (review), where we started doing strictness inside mergingIter and can't follow the reasoning anymore.

The logic to init a heap is

	n := h.len()
	for i := n/2 - 1; i >= 0; i-- {
		h.down(i, n)
	}

Say we have a heap of 8 elements (memtable plus all levels with 1 sub-level in L0). I think the min and max number of comparisons is 7 and 11 respectively.

A comparison between a key that does not match the prefix and one that does match the prefix will stop after the same number of bytes as the check that the former key matches the prefix. So prefix checks are not necessarily faster (in one of those threads there was a claim of "faster prefix checks").

For levels with bloom filter checking (everything except memtable, and L6?), the probability that the key satisfies the prefix is very high (let's assume 100% for now) -- I may be forgetting something but I think we don't do an explicit prefix check in the sstable iterators. I think the earlier analysis ignored this high probability.

So let's consider some cases:

Case 1: even with prefix comparisons there are 8 elements

The code before that commit would do [7, 11] key comparisons to build the heap. Now we do 8 comparisons to check the prefix, that will all say yes. Then [7, 11] key comparisons to build the heap. So we are doing more work. In fairness, the old comments do mention the possibility of regression.

Case 2: prefix comparisons can eliminate memtable and L6, and there are 6 elements left

The code before that commit would do [7, 11] key comparisons to build the heap. Now we have done 8 comparisons for the prefix, and we need to build a heap with 6 elements. I think the latter is [5, 7] comparisons. So the total is still worse.

Case 3: prefix comparisons eliminate everything but L0 and L6

L1-L5 would not have returned a key. So the code before that commit would be working with 3 elements, corresponding to memtable, L0, L6. That is 2 key comparisons to build the heap. Now we do 3 prefix comparisons, and then 1 key comparison to build the heap, for a total of 4.

Case 4: prefix comparisons eliminate everything but L6

L0-L5 would not have returned a key. So the code before that commit would be working with 2 elements, corresponding to memtable, L6. That is 1 key comparison to build the heap. Now we do 2 prefix comparisons, and then no key comparison to build the heap, for a total of 2.

There is also a mention of benefit for iteration "If all of the keys with the prefix reside in a single level, iteration over the prefix also benefits from the reduced heap size removing k log n comparisons, where k is the number of times the iterator is stepped." I am not sure this is precise enough:
If everything except the top of the heap does not have the prefix, they are beyond the prefix. Iteration requires 2 key comparisons when there are at least 3 elements in the heap, since the top won't change. So we do have a benefit if we have managed to filter the heap down to 1 element, or 2 elements, since then iteration is doing 0 or 1 key comparison respectively. I think the CockroachDB cases related to MVCCGet which find a key (common case, I think), and doing a read for a MVCCPut, don't do a Next. If we wanted to optimize this iteration, we could have the mergingIter know (a) which levels already have a high probability of matching the prefix, so it doesn't waste prefix comparisons (b) remember which of the levels that don't have that high probability and for which it has not performed a prefix check. In a Next, it would do a prefix check for (b) and remove the entry if it does not match the prefix. If it does it would flip the bit, so it doesn't bother doing it again while that level iter is not nexted.

I am probably missing an interesting case, or missing some other obvious thing.

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @RaduBerinde and @sumeerbhola)


merging_iter.go line 986 at r2 (raw file):

So prefix checks are not necessarily faster (in one of those threads there was a claim of "faster prefix checks").

I haven't benchmarked to measure the difference, but because prefix checks are equality comparisons (eg, we know n the length of the prefix, then we're determining equality of the first n bytes), it seems like it should be faster. If Split(key) != len(prefix), then we can early exit. Otherwise, we're performing an equality comparison, which is generally faster than a comparison.

For levels with bloom filter checking (everything except memtable, and L6?), the probability that the key satisfies the prefix is very high (let's assume 100% for now)

Hrm, it'd be interesting to have some measure of this: how frequently the bloom filter fails to exclude a table but the table doesn't contain a key. I know the filter effectiveness in terms of percent of tables successfully excluded is high, and that's inversely related to the false positive rate, but I don't know if we can infer much from it without knowing the true frequency of a prefix in multiple levels.

I agree that if we replace the top-level Iterator prefix check with leaf prefix checks that never exclude keys, we're increasing the amount of work. Although with columnar blocks, this won't be the case because our initial seek will be performed on the prefixes themselves, and we can implement strictness for free as a part of the block seek.

So we do have a benefit if we have managed to filter the heap down to 1 element, or 2 elements, since then iteration is doing 0 or 1 key comparison respectively.

I think the benefit is also not just constrained to a single level containing the prefix, but I haven't stepped through the possibilities to bound the comparisons for various combinations of levels containing prefixes or not.

Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @sumeerbhola)


merging_iter.go line 986 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

So prefix checks are not necessarily faster (in one of those threads there was a claim of "faster prefix checks").

I haven't benchmarked to measure the difference, but because prefix checks are equality comparisons (eg, we know n the length of the prefix, then we're determining equality of the first n bytes), it seems like it should be faster. If Split(key) != len(prefix), then we can early exit. Otherwise, we're performing an equality comparison, which is generally faster than a comparison.

For levels with bloom filter checking (everything except memtable, and L6?), the probability that the key satisfies the prefix is very high (let's assume 100% for now)

Hrm, it'd be interesting to have some measure of this: how frequently the bloom filter fails to exclude a table but the table doesn't contain a key. I know the filter effectiveness in terms of percent of tables successfully excluded is high, and that's inversely related to the false positive rate, but I don't know if we can infer much from it without knowing the true frequency of a prefix in multiple levels.

I agree that if we replace the top-level Iterator prefix check with leaf prefix checks that never exclude keys, we're increasing the amount of work. Although with columnar blocks, this won't be the case because our initial seek will be performed on the prefixes themselves, and we can implement strictness for free as a part of the block seek.

So we do have a benefit if we have managed to filter the heap down to 1 element, or 2 elements, since then iteration is doing 0 or 1 key comparison respectively.

I think the benefit is also not just constrained to a single level containing the prefix, but I haven't stepped through the possibilities to bound the comparisons for various combinations of levels containing prefixes or not.

  1. The prefix not matching key problem.

The main motivation for me is for understandability of the code. SeekPrefixGE definition says "The supplied prefix will be the prefix of the given key returned by that Split function.". I am trying to make the code match the definition. If we want to relax the definition, we have to define what the operation even means when the prefix doesn't match, what are the expectations from the iterator, and why would anyone ever call something like that. Is there a reasonable proposal around that definition?

A secondary practical motivation is that we will probably want to keep statistics about bloom filter false positives - i.e. when the bloom filter passes but we find no key with the prefix. We would need to check the prefix to know if really the bloom filter failed to exclude something, or we're looking for an entirely different key.

  1. The strictness problem

Like before, the main motivation here is understandability. Some operators implement a SeekPrefixGEStrict which leads to a second TopLevelIterator interface which makes things more complicated. There are instances in the code where SeekPrefixGE is used with the expectation that it is strict: pebble.Iterator.SeekPrefixGE is strict and calls l.iter.SeekPrefixGE (which is strict because there's a merging iterator under there). This is a mess.

A practical motivation is that the new columnar format allows us to very cheaply know when the prefix changes, so if we push the strictness down to those iterators, we can avoid that comparison altogether. Even without that, prefix checks can be cheaper (like Jackson mentioned above, and also we could do a quick check of the last e.g. 8 bytes of the prefix - we expect some common prefix because the keys are ordered).

Back to the PR: I don't understand all the nuances, but it feels like we want this SeekPrefixGE(tombstone.End) call not necessarily for what it does at the sstable level, but for what it does inside the levelIter? If that is the case, can't we have a separate method on levelIter that does whatever it is that we need? The merging iterator already has a reference to each *levelIter.

@sumeerbhola sumeerbhola requested a review from jbowens August 16, 2024 15:02
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @jbowens and @RaduBerinde)


merging_iter.go line 986 at r2 (raw file):

Previously, RaduBerinde wrote…
  1. The prefix not matching key problem.

The main motivation for me is for understandability of the code. SeekPrefixGE definition says "The supplied prefix will be the prefix of the given key returned by that Split function.". I am trying to make the code match the definition. If we want to relax the definition, we have to define what the operation even means when the prefix doesn't match, what are the expectations from the iterator, and why would anyone ever call something like that. Is there a reasonable proposal around that definition?

A secondary practical motivation is that we will probably want to keep statistics about bloom filter false positives - i.e. when the bloom filter passes but we find no key with the prefix. We would need to check the prefix to know if really the bloom filter failed to exclude something, or we're looking for an entirely different key.

  1. The strictness problem

Like before, the main motivation here is understandability. Some operators implement a SeekPrefixGEStrict which leads to a second TopLevelIterator interface which makes things more complicated. There are instances in the code where SeekPrefixGE is used with the expectation that it is strict: pebble.Iterator.SeekPrefixGE is strict and calls l.iter.SeekPrefixGE (which is strict because there's a merging iterator under there). This is a mess.

A practical motivation is that the new columnar format allows us to very cheaply know when the prefix changes, so if we push the strictness down to those iterators, we can avoid that comparison altogether. Even without that, prefix checks can be cheaper (like Jackson mentioned above, and also we could do a quick check of the last e.g. 8 bytes of the prefix - we expect some common prefix because the keys are ordered).

Back to the PR: I don't understand all the nuances, but it feels like we want this SeekPrefixGE(tombstone.End) call not necessarily for what it does at the sstable level, but for what it does inside the levelIter? If that is the case, can't we have a separate method on levelIter that does whatever it is that we need? The merging iterator already has a reference to each *levelIter.

I suspect there is a path forward that allows for enforcing that the prefix matches the key, while not becoming a de-optimization. I am mostly thinking about this in the context of the other PR, since the correctness and optimization reasoning need to go hand in hand.

Regarding strictness, I am ok with whatever you all decide. I was merely trying to point out that some of the original performance motivation for strictness may be flawed, in that it didn't take into account that the sstable levels, which will typically dominate the levels, may already have a high probability that the key matches the prefix. I see the point of colblk allowing for cheap prefix change checks -- that is beneficial for SeekPrefixGE followed by Nexts, but won't help with just SeekPrefixGE (which may be a more common case for CRDB's OLTP workloads).

@RaduBerinde
Copy link
Member Author

I see the point of colblk allowing for cheap prefix change checks -- that is beneficial for SeekPrefixGE followed by Nexts, but won't help with just SeekPrefixGE (which may be a more common case for CRDB's OLTP workloads).

I think it should come at no extra cost to keep track of whether the binary search of the prefix ended in an equality or non-equality for SeekPrefixGE.

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @RaduBerinde and @sumeerbhola)


merging_iter.go line 986 at r2 (raw file):

won't help with just SeekPrefixGE

It still will if bloom filters are imperfect, yes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants