storage: introduce LockTableIterator #110319

nvanbenschoten · 2023-09-10T17:44:05Z

Informs #100193.
Closes #109644.
Closes #109570.

This PR introduces a new form of iterator, the LockTableIterator, which is an EngineIterator that iterates over locks in the lock table keyspace. It performs no translation of input or output keys or values, so it is used like a normal EngineIterator, with the limitation that it can only be used to iterate over the lock table keyspace.

The benefit of using a LockTableIterator is that it performs filtering of the locks in the lock table, only returning locks that match the configured filtering criteria and transparently skipping past locks that do not. The filtering criteria is expressed as a logical disjunction of two configuration parameters, at least one of which must be set:

MatchTxnID: if set, the iterator return locks held by this transaction.
MatchMinStr: if set, the iterator returns locks held by any transaction with this strength or stronger.

Expressed abstractly as a SQL query, the filtering criteria is:

SELECT * FROM lock_table WHERE (MatchTxnID  != 0 AND txn_id = MatchTxnID)
                            OR (MatchMinStr != 0 AND strength >= MatchMinStr)

Pushing this filtering logic into the iterator is a convenience for its users. It also allows the iterator to use its knowledge of the lock table keyspace structure to efficiently skip past locks that do not match the filtering criteria. It does this by seeking past many ignored locks when appropriate to avoid cases of O(ignored_locks) work, instead performing at most O(matching_locks + locked_keys) work.

A common case where this matters is with shared locks. If the iterator is configured to ignore shared locks, a single key with a large number of shared locks can be skipped over with a single seek. Avoiding this unnecessary work is essential to avoiding quadratic behavior during shared lock acquisition and release.

The LockTableIterator will be used by intentInterleavingIter, MVCCCheckLock, MVCCAcquireLock, mvccPutUsingIter, MVCCResolveWriteIntent, MVCCResolveWriteIntentRange, computeMinIntentTimestamp, SeparatedIntentScanner, and more.

The PR introduces the abstraction of a LockTableIterator, but doesn't yet implement the logic to avoid O(ignored_locks) work in the iterator. This is left as a TODO which will be addressed in the next few weeks, but doesn't need to block this PR or use of the LockTableIterator.

The PR then copies over the 2 commits from #109570, which ignore non-intent locks in intentInterleavingIter. This could have been done in a separate PR, but it felt useful to demonstrate how much simpler and cleaner #109644 gets with this new abstraction. I can extract it back out if reviewers would like.

Release note: None

Informs cockroachdb#100193. This commit introduces a new form of iterator, the LockTableIterator, which is an EngineIterator that iterates over locks in the lock table keyspace. It performs no translation of input or output keys or values, so it is used like a normal EngineIterator, with the limitation that it can only be used to iterate over the lock table keyspace. The benefit of using a LockTableIterator is that it performs filtering of the locks in the lock table, only returning locks that match the configured filtering criteria and transparently skipping past locks that do not. The filtering criteria is expressed as a logical disjunction of two configuration parameters, at least one of which must be set: - MatchTxnID: if set, the iterator return locks held by this transaction. - MatchMinStr: if set, the iterator returns locks held by any transaction with this strength or stronger. Expressed abstractly as a SQL query, the filtering criteria is: SELECT * FROM lock_table WHERE (MatchTxnID != 0 AND txn_id = MatchTxnID) OR (MatchMinStr != 0 AND strength >= MatchMinStr) Pushing this filtering logic into the iterator is a convenience for its users. It also allows the iterator to use its knowledge of the lock table keyspace structure to efficiently skip past locks that do not match the filtering criteria. It does this by seeking past many ignored locks when appropriate to avoid cases of O(ignored_locks) work, instead performing at most O(matching_locks + locked_keys) work. A common case where this matters is with shared locks. If the iterator is configured to ignore shared locks, a single key with a large number of shared locks can be skipped over with a single seek. Avoiding this unnecessary work is essential to avoiding quadratic behavior during shared lock acquisition and release. The LockTableIterator will be used by `intentInterleavingIter`, `MVCCCheckLock`, `MVCCAcquireLock`, `mvccPutUsingIter`, `MVCCResolveWriteIntent`, `MVCCResolveWriteIntentRange`, `computeMinIntentTimestamp`, `SeparatedIntentScanner`, and more. Release note: None

…marks All intents are now separated. Epic: None Release note: None

Informs cockroachdb#100193. This commit updates the intentInterleavingIter to ignore locks in the lock table keyspace with strengths other than lock.Intent (i.e. shared and exclusive locks). Future versions of the iterator may expose information to users about whether any non-intent locks were observed and, if so, which keys they were found on. For now, no such information is exposed. This change is needed for replicated locks. Non-locking reads will not interact with these locks at all, so we want the intentInterleavingIter to ignore them. Meanwhile, locking reads and writes will use a different abstraction to peek into the lock table and look for conflicts. Release note: None

cockroach-teamcity · 2023-09-10T17:44:29Z

This change is

sumeerbhola

I didn't look at the tests, but interface and integration with intentIterleavingIter looks good.

Reviewed 3 of 9 files at r1, 1 of 1 files at r2, 1 of 5 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)

pkg/storage/lock_table_iterator.go line 61 at r1 (raw file):

// and release.
type LockTableIterator struct {
	iter EngineIterator

There was some commentary in intentInterleavingIter about avoiding dynamic dispatch and using *pebbleIterator. Consider doing that here.

pkg/storage/lock_table_iterator.go line 139 at r1 (raw file):

// SeekEngineKeyGE implements the EngineIterator interface.
func (i *LockTableIterator) SeekEngineKeyGE(key EngineKey) (valid bool, err error) {
	if err := checkLockTableKey(key.Key); err != nil {

Is checkLockTableKey cheap enough, or should be hidden behind invariants.Enabled?

pkg/storage/lock_table_iterator.go line 252 at r1 (raw file):

) (state pebble.IterValidityState, err error) {
	for {
		engineKey, err := i.iter.UnsafeEngineKey()

nit: can this use i.LockTableKeyVersion()

nvanbenschoten

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @sumeerbhola)

pkg/storage/lock_table_iterator.go line 61 at r1 (raw file):

Previously, sumeerbhola wrote…

There was some commentary in intentInterleavingIter about avoiding dynamic dispatch and using *pebbleIterator. Consider doing that here.

I hoped to do that, but unfortunately, this won't always be a *pebbleIterator. There will be cases where it's a spanset.EngineIterator, when constructed from a spanset.spanSetReader. We don't have this problem in newIntentInterleavingIterator because iterators are constructed from a Reader below spanset.spanSetReader in the Reader.NewMVCCIterator hierarchy.

Dynamic dispatch on the intentIter (as opposed to the mvcc iter) doesn't seem to make much of a difference on the benchmarks we care about, though:

➜ benchstat full_static.txt full_dynamic.txt
name                                                                old time/op   new time/op   delta
MVCCScan_Pebble/rows=100/versions=1/valueSize=64/numRangeKeys=0-10   25.8µs ± 3%   26.2µs ± 3%  +1.72%  (p=0.000 n=49+44)

name                                                                old speed     new speed     delta
MVCCScan_Pebble/rows=100/versions=1/valueSize=64/numRangeKeys=0-10  248MB/s ± 3%  244MB/s ± 3%  -1.76%  (p=0.000 n=49+45)

➜ benchstat full_static.txt static_mvcc_dynamic_intent.txt
name                                                                old time/op   new time/op   delta
MVCCScan_Pebble/rows=100/versions=1/valueSize=64/numRangeKeys=0-10   25.8µs ± 3%   25.8µs ± 2%   ~     (p=0.163 n=49+45)

name                                                                old speed     new speed     delta
MVCCScan_Pebble/rows=100/versions=1/valueSize=64/numRangeKeys=0-10  248MB/s ± 3%  248MB/s ± 2%   ~     (p=0.163 n=49+45)

That's because these benchmarks we were concerned about when making this optimization don't even write any intents. However, I also don't see a difference with IntentInterleavingSeekGEAndIter/separated=true/version=1/intentStride=1/keyLen=10/seekStride=1, which should exercise the intentIter more than any other benchmark we have. So I'll leave this as is.

pkg/storage/lock_table_iterator.go line 139 at r1 (raw file):

Previously, sumeerbhola wrote…

Is checkLockTableKey cheap enough, or should be hidden behind invariants.Enabled?

It's inlined all the way down through the call to bytes.HasPrefix and the underlying bytes.Equal of two bytes, so it's just a few branches that should play well with a branch predictor.

pkg/storage/lock_table_iterator.go line 252 at r1 (raw file):

Previously, sumeerbhola wrote…

nit: can this use i.LockTableKeyVersion()

It can right now, but to address the TODO below, I'll need access to the EngineKey.Key. I'll leave this as is to avoid the churn when addressing that TODO.

sumeerbhola

Reviewed 5 of 9 files at r1, 4 of 5 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)

pkg/storage/lock_table_iterator_test.go line 580 at r3 (raw file):

// TestLockTableIteratorEquivalence is a quickcheck test that verifies that a
// LockTableIterator is equivalent to a raw engine iterator with non-matching
// lock keys filtered before writing to the engine.

nice

arulajmani

Reviewed 9 of 9 files at r1, 1 of 1 files at r2, 5 of 5 files at r3, all commit messages.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)

pkg/storage/lock_table_iterator.go line 332 at r1 (raw file):

// LockTableKeyVersion returns the strength and txn ID from the version of the
// current key.
func (i *LockTableIterator) LockTableKeyVersion() (lock.Strength, uuid.UUID, error) {

My IDE is telling me this isn't used anywhere -- is that because one of the subsequent commits makes use of this?

pkg/storage/lock_table_iterator_test.go line 160 at r1 (raw file):

}

// TestLockTableIterator is a datadriven test consisting of two commands:

nit: should we move this test to the top of the file?

pkg/storage/lock_table_iterator_test.go line 580 at r3 (raw file):

Previously, sumeerbhola wrote…

nice

+1

nvanbenschoten

TFTRs!

bors r=sumeerbhola,arulajmani

Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @arulajmani and @sumeerbhola)

pkg/storage/lock_table_iterator.go line 332 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

My IDE is telling me this isn't used anywhere -- is that because one of the subsequent commits makes use of this?

Yes, a few of the subsequent PRs make use of this. It's not tripping up the linter, so I'll just leave it as is for now.

pkg/storage/lock_table_iterator_test.go line 160 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: should we move this test to the top of the file?

I was following the example set by TestIntentInterleavingIter in intent_interleaving_iter_test.go. Since this is the only comment, I'll leave as is to avoid a round of CI.

craig · 2023-09-13T18:52:45Z

Build succeeded:

Bazel Essential CI (Cockroach)

Informs cockroachdb#100193. This commit addresses a TODO left by cockroachdb#110319 to implement an "iter before seek" optimization in the LockTableIterator, similar to the one that exists in the pebbleMVCCScanner. The optimization places an upper bound on the number of iterations that a LockTableIterator that is configured to ignore some or all shared locks will perform across the shared locks on a single user key before seeking past them. This is used to avoid iterating over all shared locks on a key when not necessary. The optimization achieves the goal of avoiding cases of O(ignored_locks) work in the LockTableIterator, instead performing at most O(matching_locks + locked_keys) work. This is important for iteration over the lock table (e.g. intentInterleavingIter), lock acquisition (MVCCAcquireLock), and lock release (mvccReleaseLockInternal). There is a caveat to these complexity bounds, however, in that they do not consider LSM tombstones. This is being discussed in cockroachdb#110324. Release note: None

110754: storage: implement iter-before-seek optimization for LockTableIterator r=nvanbenschoten a=nvanbenschoten Informs #100193. This commit addresses a TODO left by #110319 to implement an "iter before seek" optimization in the LockTableIterator, similar to the one that exists in the pebbleMVCCScanner. The optimization places an upper bound on the number of iterations that a LockTableIterator that is configured to ignore some or all shared locks will perform across the shared locks on a single user key before seeking past them. This is used to avoid iterating over all shared locks on a key when not necessary. The optimization achieves the goal of avoiding cases of O(ignored_locks) work in the LockTableIterator, instead performing at most O(matching_locks + locked_keys) work. This is important for iteration over the lock table (e.g. intentInterleavingIter), lock acquisition (MVCCAcquireLock), and lock release (mvccReleaseLockInternal). There is a caveat to these complexity bounds, however, in that they do not consider LSM tombstones. This is being discussed in #110324. Release note: None 111126: pkg/util/log: introduce new metrics to the logging package r=knz a=abarganier See individual commit messages to review. This PR introduces two new metrics to the logging package: - `log.messages.count` - `log.buffered.messages.dropped` **log.messages.count**: This metric measures the count of messages logged on the node since startup. Note that this does not measure the fan-out of single log messages to the various configured logging sinks. This metric can be helpful in understanding log rates and volumes. **log.buffered.messages.dropped**: Buffered network logging sinks have a `max-buffer-size` attribute, which determines, in bytes, how many log messages can be buffered. Any `fluent-server` or `http-server` log sink that makes use of a `buffering` attribute in its configuration (enabled by default) qualifies as a buffered network logging sink. If this buffer becomes full, and an additional log message is sent to the buffered log sing, the buffer would exceed this `max-buffer-size`. Therefore, the buffered log sink drops older messages in the buffer to handle, in order to make room for the new. This PR also renames the metric `fluent.sink.conn.errors` to `log.fluent.sink.conn.errors`, for consistency. Fixes: #72453 ---- Release note (ops change): This patch sets the Metric Type on the metric `log.fluent.sink.conn.errors`. Previously, the Metric Type was incorrectly left unset. Note that this is simply an update to the metric's metadata. The behavior and purpose of the metric remains unchanged. ---- Release note (ops change): This patch introduces the metric, `log.messages.count`. This metric measures the count of messages logged on the node since startup. Note that this does not measure the fan-out of single log messages to the various configured logging sinks. This metric can be helpful in understanding log rates and volumes. ---- Release note (ops change): This patch introduces a new metric, `log.buffered.messages.dropped`. Buffered network logging sinks have a `max-buffer-size` attribute, which determines, in bytes, how many log messages can be buffered. Any `fluent-server` or `http-server` log sink that makes use of a `buffering` attribute in its configuration (enabled by default) qualifies as a buffered network logging sink. If this buffer becomes full, and an additional log message is sent to the buffered log sing, the buffer would exceed this `max-buffer-size`. Therefore, the buffered log sink drops older messages in the buffer to handle, in order to make room for the new. `log.buffered.messages.dropped` counts the number of messages dropped from the buffer. Note that the count is shared across all buffered logging sinks. 111603: server: always START SERVICE SHARED in testserver.StartSharedProcessTenant() r=knz a=msbutler Previously, StartSharedProcessTenant() would hang if it were run on a tenant that was created by a replication stream. This patch fixes this bug by ensuring `ALTER TENANT $1 START SERVICE SHARED` is run even if the tenant was already created. Epic: none Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Alex Barganier <[email protected]> Co-authored-by: Michael Butler <[email protected]>

Informs cockroachdb#100193. This commit addresses a TODO left by cockroachdb#110319 to implement an "iter before seek" optimization in the LockTableIterator, similar to the one that exists in the pebbleMVCCScanner. The optimization places an upper bound on the number of iterations that a LockTableIterator that is configured to ignore some or all shared locks will perform across the shared locks on a single user key before seeking past them. This is used to avoid iterating over all shared locks on a key when not necessary. The optimization achieves the goal of avoiding cases of O(ignored_locks) work in the LockTableIterator, instead performing at most O(matching_locks + locked_keys) work. This is important for iteration over the lock table (e.g. intentInterleavingIter), lock acquisition (MVCCAcquireLock), and lock release (mvccReleaseLockInternal). There is a caveat to these complexity bounds, however, in that they do not consider LSM tombstones. This is being discussed in cockroachdb#110324. Release note: None

nvanbenschoten added 3 commits September 10, 2023 13:42

storage: remove !separated variants from IntentInterleavingIter bench…

3522b39

…marks All intents are now separated. Epic: None Release note: None

nvanbenschoten requested review from arulajmani and sumeerbhola September 10, 2023 17:44

nvanbenschoten requested a review from a team as a code owner September 10, 2023 17:44

nvanbenschoten mentioned this pull request Sep 11, 2023

storage: add MVCCCheckForAcquireLock and MVCCAcquireLock functions #110323

Merged

sumeerbhola requested changes Sep 11, 2023

View reviewed changes

nvanbenschoten commented Sep 11, 2023

View reviewed changes

sumeerbhola approved these changes Sep 12, 2023

View reviewed changes

arulajmani approved these changes Sep 12, 2023

View reviewed changes

nvanbenschoten commented Sep 13, 2023

View reviewed changes

craig bot merged commit 9006916 into cockroachdb:master Sep 13, 2023

nvanbenschoten deleted the nvanbenschoten/lockTableIter branch September 13, 2023 21:24

nvanbenschoten mentioned this pull request Sep 15, 2023

storage: implement iter-before-seek optimization for LockTableIterator #110754

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: introduce LockTableIterator #110319

storage: introduce LockTableIterator #110319

nvanbenschoten commented Sep 10, 2023

cockroach-teamcity commented Sep 10, 2023

sumeerbhola left a comment

nvanbenschoten left a comment

sumeerbhola left a comment

arulajmani left a comment

nvanbenschoten left a comment

craig bot commented Sep 13, 2023

storage: introduce LockTableIterator #110319

storage: introduce LockTableIterator #110319

Conversation

nvanbenschoten commented Sep 10, 2023

cockroach-teamcity commented Sep 10, 2023

sumeerbhola left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

arulajmani left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

craig bot commented Sep 13, 2023