Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: introduce LockTableIterator #110319

Merged
merged 3 commits into from
Sep 13, 2023

Conversation

nvanbenschoten
Copy link
Member

Informs #100193.
Closes #109644.
Closes #109570.

This PR introduces a new form of iterator, the LockTableIterator, which is an EngineIterator that iterates over locks in the lock table keyspace. It performs no translation of input or output keys or values, so it is used like a normal EngineIterator, with the limitation that it can only be used to iterate over the lock table keyspace.

The benefit of using a LockTableIterator is that it performs filtering of the locks in the lock table, only returning locks that match the configured filtering criteria and transparently skipping past locks that do not. The filtering criteria is expressed as a logical disjunction of two configuration parameters, at least one of which must be set:

  • MatchTxnID: if set, the iterator return locks held by this transaction.

  • MatchMinStr: if set, the iterator returns locks held by any transaction with this strength or stronger.

Expressed abstractly as a SQL query, the filtering criteria is:

SELECT * FROM lock_table WHERE (MatchTxnID  != 0 AND txn_id = MatchTxnID)
                            OR (MatchMinStr != 0 AND strength >= MatchMinStr)

Pushing this filtering logic into the iterator is a convenience for its users. It also allows the iterator to use its knowledge of the lock table keyspace structure to efficiently skip past locks that do not match the filtering criteria. It does this by seeking past many ignored locks when appropriate to avoid cases of O(ignored_locks) work, instead performing at most O(matching_locks + locked_keys) work.

A common case where this matters is with shared locks. If the iterator is configured to ignore shared locks, a single key with a large number of shared locks can be skipped over with a single seek. Avoiding this unnecessary work is essential to avoiding quadratic behavior during shared lock acquisition and release.

The LockTableIterator will be used by intentInterleavingIter, MVCCCheckLock, MVCCAcquireLock, mvccPutUsingIter, MVCCResolveWriteIntent, MVCCResolveWriteIntentRange, computeMinIntentTimestamp, SeparatedIntentScanner, and more.


The PR introduces the abstraction of a LockTableIterator, but doesn't yet implement the logic to avoid O(ignored_locks) work in the iterator. This is left as a TODO which will be addressed in the next few weeks, but doesn't need to block this PR or use of the LockTableIterator.

The PR then copies over the 2 commits from #109570, which ignore non-intent locks in intentInterleavingIter. This could have been done in a separate PR, but it felt useful to demonstrate how much simpler and cleaner #109644 gets with this new abstraction. I can extract it back out if reviewers would like.


Release note: None

Informs cockroachdb#100193.

This commit introduces a new form of iterator, the LockTableIterator,
which is an EngineIterator that iterates over locks in the lock table
keyspace. It performs no translation of input or output keys or values,
so it is used like a normal EngineIterator, with the limitation that it
can only be used to iterate over the lock table keyspace.

The benefit of using a LockTableIterator is that it performs filtering
of the locks in the lock table, only returning locks that match the
configured filtering criteria and transparently skipping past locks that
do not. The filtering criteria is expressed as a logical disjunction of
two configuration parameters, at least one of which must be set:

  - MatchTxnID: if set, the iterator return locks held by this transaction.

  - MatchMinStr: if set, the iterator returns locks held by any transaction
    with this strength or stronger.

Expressed abstractly as a SQL query, the filtering criteria is:

    SELECT * FROM lock_table WHERE (MatchTxnID  != 0 AND txn_id = MatchTxnID)
                                OR (MatchMinStr != 0 AND strength >= MatchMinStr)

Pushing this filtering logic into the iterator is a convenience for its
users. It also allows the iterator to use its knowledge of the lock
table keyspace structure to efficiently skip past locks that do not
match the filtering criteria. It does this by seeking past many ignored
locks when appropriate to avoid cases of O(ignored_locks) work, instead
performing at most O(matching_locks + locked_keys) work.

A common case where this matters is with shared locks. If the iterator
is configured to ignore shared locks, a single key with a large number
of shared locks can be skipped over with a single seek. Avoiding this
unnecessary work is essential to avoiding quadratic behavior during
shared lock acquisition and release.

The LockTableIterator will be used by `intentInterleavingIter`,
`MVCCCheckLock`, `MVCCAcquireLock`, `mvccPutUsingIter`,
`MVCCResolveWriteIntent`, `MVCCResolveWriteIntentRange`,
`computeMinIntentTimestamp`, `SeparatedIntentScanner`, and more.

Release note: None
…marks

All intents are now separated.

Epic: None
Release note: None
Informs cockroachdb#100193.

This commit updates the intentInterleavingIter to ignore locks
in the lock table keyspace with strengths other than lock.Intent
(i.e. shared and exclusive locks). Future versions of the
iterator may expose information to users about whether any
non-intent locks were observed and, if so, which keys they were
found on. For now, no such information is exposed.

This change is needed for replicated locks. Non-locking reads
will not interact with these locks at all, so we want the
intentInterleavingIter to ignore them. Meanwhile, locking reads
and writes will use a different abstraction to peek into the
lock table and look for conflicts.

Release note: None
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look at the tests, but interface and integration with intentIterleavingIter looks good.

Reviewed 3 of 9 files at r1, 1 of 1 files at r2, 1 of 5 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)


pkg/storage/lock_table_iterator.go line 61 at r1 (raw file):

// and release.
type LockTableIterator struct {
	iter EngineIterator

There was some commentary in intentInterleavingIter about avoiding dynamic dispatch and using *pebbleIterator. Consider doing that here.


pkg/storage/lock_table_iterator.go line 139 at r1 (raw file):

// SeekEngineKeyGE implements the EngineIterator interface.
func (i *LockTableIterator) SeekEngineKeyGE(key EngineKey) (valid bool, err error) {
	if err := checkLockTableKey(key.Key); err != nil {

Is checkLockTableKey cheap enough, or should be hidden behind invariants.Enabled?


pkg/storage/lock_table_iterator.go line 252 at r1 (raw file):

) (state pebble.IterValidityState, err error) {
	for {
		engineKey, err := i.iter.UnsafeEngineKey()

nit: can this use i.LockTableKeyVersion()

Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @sumeerbhola)


pkg/storage/lock_table_iterator.go line 61 at r1 (raw file):

Previously, sumeerbhola wrote…

There was some commentary in intentInterleavingIter about avoiding dynamic dispatch and using *pebbleIterator. Consider doing that here.

I hoped to do that, but unfortunately, this won't always be a *pebbleIterator. There will be cases where it's a spanset.EngineIterator, when constructed from a spanset.spanSetReader. We don't have this problem in newIntentInterleavingIterator because iterators are constructed from a Reader below spanset.spanSetReader in the Reader.NewMVCCIterator hierarchy.

Dynamic dispatch on the intentIter (as opposed to the mvcc iter) doesn't seem to make much of a difference on the benchmarks we care about, though:

➜ benchstat full_static.txt full_dynamic.txt
name                                                                old time/op   new time/op   delta
MVCCScan_Pebble/rows=100/versions=1/valueSize=64/numRangeKeys=0-10   25.8µs ± 3%   26.2µs ± 3%  +1.72%  (p=0.000 n=49+44)

name                                                                old speed     new speed     delta
MVCCScan_Pebble/rows=100/versions=1/valueSize=64/numRangeKeys=0-10  248MB/s ± 3%  244MB/s ± 3%  -1.76%  (p=0.000 n=49+45)

➜ benchstat full_static.txt static_mvcc_dynamic_intent.txt
name                                                                old time/op   new time/op   delta
MVCCScan_Pebble/rows=100/versions=1/valueSize=64/numRangeKeys=0-10   25.8µs ± 3%   25.8µs ± 2%   ~     (p=0.163 n=49+45)

name                                                                old speed     new speed     delta
MVCCScan_Pebble/rows=100/versions=1/valueSize=64/numRangeKeys=0-10  248MB/s ± 3%  248MB/s ± 2%   ~     (p=0.163 n=49+45)

That's because these benchmarks we were concerned about when making this optimization don't even write any intents. However, I also don't see a difference with IntentInterleavingSeekGEAndIter/separated=true/version=1/intentStride=1/keyLen=10/seekStride=1, which should exercise the intentIter more than any other benchmark we have. So I'll leave this as is.


pkg/storage/lock_table_iterator.go line 139 at r1 (raw file):

Previously, sumeerbhola wrote…

Is checkLockTableKey cheap enough, or should be hidden behind invariants.Enabled?

It's inlined all the way down through the call to bytes.HasPrefix and the underlying bytes.Equal of two bytes, so it's just a few branches that should play well with a branch predictor.


pkg/storage/lock_table_iterator.go line 252 at r1 (raw file):

Previously, sumeerbhola wrote…

nit: can this use i.LockTableKeyVersion()

It can right now, but to address the TODO below, I'll need access to the EngineKey.Key. I'll leave this as is to avoid the churn when addressing that TODO.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 5 of 9 files at r1, 4 of 5 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)


pkg/storage/lock_table_iterator_test.go line 580 at r3 (raw file):

// TestLockTableIteratorEquivalence is a quickcheck test that verifies that a
// LockTableIterator is equivalent to a raw engine iterator with non-matching
// lock keys filtered before writing to the engine.

nice

Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 9 of 9 files at r1, 1 of 1 files at r2, 5 of 5 files at r3, all commit messages.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)


pkg/storage/lock_table_iterator.go line 332 at r1 (raw file):

// LockTableKeyVersion returns the strength and txn ID from the version of the
// current key.
func (i *LockTableIterator) LockTableKeyVersion() (lock.Strength, uuid.UUID, error) {

My IDE is telling me this isn't used anywhere -- is that because one of the subsequent commits makes use of this?


pkg/storage/lock_table_iterator_test.go line 160 at r1 (raw file):

}

// TestLockTableIterator is a datadriven test consisting of two commands:

nit: should we move this test to the top of the file?


pkg/storage/lock_table_iterator_test.go line 580 at r3 (raw file):

Previously, sumeerbhola wrote…

nice

+1

Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

bors r=sumeerbhola,arulajmani

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @arulajmani and @sumeerbhola)


pkg/storage/lock_table_iterator.go line 332 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

My IDE is telling me this isn't used anywhere -- is that because one of the subsequent commits makes use of this?

Yes, a few of the subsequent PRs make use of this. It's not tripping up the linter, so I'll just leave it as is for now.


pkg/storage/lock_table_iterator_test.go line 160 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: should we move this test to the top of the file?

I was following the example set by TestIntentInterleavingIter in intent_interleaving_iter_test.go. Since this is the only comment, I'll leave as is to avoid a round of CI.

@craig
Copy link
Contributor

craig bot commented Sep 13, 2023

Build succeeded:

@craig craig bot merged commit 9006916 into cockroachdb:master Sep 13, 2023
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/lockTableIter branch September 13, 2023 21:24
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Sep 15, 2023
Informs cockroachdb#100193.

This commit addresses a TODO left by cockroachdb#110319 to implement an "iter
before seek" optimization in the LockTableIterator, similar to the one
that exists in the pebbleMVCCScanner. The optimization places an upper
bound on the number of iterations that a LockTableIterator that is
configured to ignore some or all shared locks will perform across the
shared locks on a single user key before seeking past them. This is used
to avoid iterating over all shared locks on a key when not necessary.

The optimization achieves the goal of avoiding cases of O(ignored_locks)
work in the LockTableIterator, instead performing at most
O(matching_locks + locked_keys) work. This is important for iteration
over the lock table (e.g. intentInterleavingIter), lock acquisition
(MVCCAcquireLock), and lock release (mvccReleaseLockInternal). There is
a caveat to these complexity bounds, however, in that they do not
consider LSM tombstones. This is being discussed in cockroachdb#110324.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Sep 29, 2023
Informs cockroachdb#100193.

This commit addresses a TODO left by cockroachdb#110319 to implement an "iter
before seek" optimization in the LockTableIterator, similar to the one
that exists in the pebbleMVCCScanner. The optimization places an upper
bound on the number of iterations that a LockTableIterator that is
configured to ignore some or all shared locks will perform across the
shared locks on a single user key before seeking past them. This is used
to avoid iterating over all shared locks on a key when not necessary.

The optimization achieves the goal of avoiding cases of O(ignored_locks)
work in the LockTableIterator, instead performing at most
O(matching_locks + locked_keys) work. This is important for iteration
over the lock table (e.g. intentInterleavingIter), lock acquisition
(MVCCAcquireLock), and lock release (mvccReleaseLockInternal). There is
a caveat to these complexity bounds, however, in that they do not
consider LSM tombstones. This is being discussed in cockroachdb#110324.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Oct 2, 2023
Informs cockroachdb#100193.

This commit addresses a TODO left by cockroachdb#110319 to implement an "iter
before seek" optimization in the LockTableIterator, similar to the one
that exists in the pebbleMVCCScanner. The optimization places an upper
bound on the number of iterations that a LockTableIterator that is
configured to ignore some or all shared locks will perform across the
shared locks on a single user key before seeking past them. This is used
to avoid iterating over all shared locks on a key when not necessary.

The optimization achieves the goal of avoiding cases of O(ignored_locks)
work in the LockTableIterator, instead performing at most
O(matching_locks + locked_keys) work. This is important for iteration
over the lock table (e.g. intentInterleavingIter), lock acquisition
(MVCCAcquireLock), and lock release (mvccReleaseLockInternal). There is
a caveat to these complexity bounds, however, in that they do not
consider LSM tombstones. This is being discussed in cockroachdb#110324.

Release note: None
craig bot pushed a commit that referenced this pull request Oct 2, 2023
110754: storage: implement iter-before-seek optimization for LockTableIterator r=nvanbenschoten a=nvanbenschoten

Informs #100193.

This commit addresses a TODO left by #110319 to implement an "iter before seek" optimization in the LockTableIterator, similar to the one that exists in the pebbleMVCCScanner. The optimization places an upper bound on the number of iterations that a LockTableIterator that is configured to ignore some or all shared locks will perform across the shared locks on a single user key before seeking past them. This is used to avoid iterating over all shared locks on a key when not necessary.

The optimization achieves the goal of avoiding cases of O(ignored_locks) work in the LockTableIterator, instead performing at most O(matching_locks + locked_keys) work. This is important for iteration over the lock table (e.g. intentInterleavingIter), lock acquisition (MVCCAcquireLock), and lock release (mvccReleaseLockInternal). There is a caveat to these complexity bounds, however, in that they do not consider LSM tombstones. This is being discussed in #110324.

Release note: None

111126: pkg/util/log: introduce new metrics to the logging package r=knz a=abarganier

See individual commit messages to review.

This PR introduces two new metrics to the logging package:
- `log.messages.count`
- `log.buffered.messages.dropped`

**log.messages.count**: This metric measures the count of messages logged on the node since startup. Note that this does not measure the fan-out of single log messages to the various configured logging sinks. This metric can be helpful in understanding log rates and volumes.

**log.buffered.messages.dropped**: Buffered network logging sinks have a `max-buffer-size` attribute, which determines, in bytes, how many log messages can be buffered. Any `fluent-server` or `http-server` log sink that makes use of a `buffering` attribute in its configuration (enabled by default) qualifies as a buffered network logging sink. If this buffer becomes full, and an additional log message is sent to the buffered log sing, the buffer would exceed this `max-buffer-size`. Therefore, the buffered log sink drops older messages in the buffer to handle, in order to make room for the new.

This PR also renames the metric `fluent.sink.conn.errors` to `log.fluent.sink.conn.errors`, for consistency.

Fixes: #72453

----

Release note (ops change): This patch sets the Metric Type
on the metric `log.fluent.sink.conn.errors`. Previously, the
Metric Type was incorrectly left unset.

Note that this is simply an update to the metric's metadata.
The behavior and purpose of the metric remains unchanged.

----

Release note (ops change): This patch introduces the metric,
`log.messages.count`.

This metric measures the count of messages logged on the
node since startup. Note that this does not measure the
fan-out of single log messages to the various configured
logging sinks.

This metric can be helpful in understanding log rates and
volumes.

----

Release note (ops change): This patch introduces a new metric,
`log.buffered.messages.dropped`.

Buffered network logging sinks have a `max-buffer-size` attribute,
which determines, in bytes, how many log messages can be buffered.
Any `fluent-server` or `http-server` log sink that makes use of
a `buffering` attribute in its configuration (enabled by default)
qualifies as a buffered network logging sink.

If this buffer becomes full, and an additional log message is sent
to the buffered log sing, the buffer would exceed this
`max-buffer-size`. Therefore, the buffered log sink drops older
messages in the buffer to handle, in order to make room for the new.

`log.buffered.messages.dropped` counts the number of messages
dropped from the buffer. Note that the count is shared across all
buffered logging sinks.

111603: server: always START SERVICE SHARED in testserver.StartSharedProcessTenant() r=knz a=msbutler

Previously, StartSharedProcessTenant() would hang if it were run on a tenant that was created by a replication stream. This patch fixes this bug by ensuring `ALTER TENANT $1 START SERVICE SHARED` is run even if the tenant was already created.

Epic: none

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Alex Barganier <[email protected]>
Co-authored-by: Michael Butler <[email protected]>
THardy98 pushed a commit to THardy98/cockroach that referenced this pull request Oct 6, 2023
Informs cockroachdb#100193.

This commit addresses a TODO left by cockroachdb#110319 to implement an "iter
before seek" optimization in the LockTableIterator, similar to the one
that exists in the pebbleMVCCScanner. The optimization places an upper
bound on the number of iterations that a LockTableIterator that is
configured to ignore some or all shared locks will perform across the
shared locks on a single user key before seeking past them. This is used
to avoid iterating over all shared locks on a key when not necessary.

The optimization achieves the goal of avoiding cases of O(ignored_locks)
work in the LockTableIterator, instead performing at most
O(matching_locks + locked_keys) work. This is important for iteration
over the lock table (e.g. intentInterleavingIter), lock acquisition
(MVCCAcquireLock), and lock release (mvccReleaseLockInternal). There is
a caveat to these complexity bounds, however, in that they do not
consider LSM tombstones. This is being discussed in cockroachdb#110324.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

storage: ignore shared and exclusive locks in intentInterleavingIterator
4 participants