-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: support FOR {UPDATE,SHARE} SKIP LOCKED #79134
kv: support FOR {UPDATE,SHARE} SKIP LOCKED #79134
Conversation
a35ea8c
to
1da4050
Compare
Aren't we scared of inconsistent data sources? I thought throughout the SQL code we assume that data is internally consistent. Couldn't there be undefined behavior if parts of a request's execution aren't internally consistent? I don't have any depth to that question. I'm simply vaguely worried since this is the first instance in which the SQL layer deliberately is tasked to handle data that may be seeing only "half" of a committed transaction (for example, could be seeing a PK record, but not its corresponding entry on a secondary index). Maybe this is all fine, but I would be interested in hearing the rationale and how this fits into our consistency model. |
This is great to see, thank you for doing this @nvanbenschoten! I have a similar concern to Tobi. For example, the optimizer uses information about foreign key presence to make decisions. What would happen if the optimizer assumes the presence of a foreign key, but the foreign key is un-scannable due to Or, for secondary indexes. I did a quick test and saw some surprising behavior (maybe a good surprise?) WIndow 1:
Window 2:
In window 2, the last command hangs until I abort the first txn. This is surprising! Since it's marked as But, this didn't happen, which is great. Do you have special logic to prevent this case? |
Yeah, @tbg is completely right. Nice catch! If
I don't think this is by design, unfortunately. I suspect it's because we still need to land #60719 to make sure that the SkipLocked wait-policy makes its way to the index join's KV requests. I had forgotten about that PR, but it's probably what we should push on first. |
I took a look at Postgres' codebase and did not find any reference to SkipLocked in their query optimizer. However, I was also unable to find a query whose query plan relied on consistency across multiple tables. For instance, I was not able to generate a query that pushed a limit through a join thanks to a validated foreign key. @andy-kimball speculates that PG might not do this kind of thing. This makes me wonder how many of these cases actually exist in CRDB. The only ones I can think of are:
One observation here is that the former is really a specialization of the latter. I don't think the optimizer treats index joins that way, but it's interesting to consider index joins as a form of lookup join across a foreign key reference. I think I recall some discussion about other cost-based optimizers treating each index as a separate table. The other observation is that it may not be very hard to track down the few optimization rules that rely on cross-row consistency and disable them when SKIP LOCKED is in use. |
@jordanlewis I cherry-picked #60719 on to this PR and tested out your example. I saw what I think we expect to see if we don't do anything about query plan optimizations that depend on consistency: -- terminal 1
root@127.0.0.1:26257/movr OPEN> update a set b=10 where a=1 returning *;
a | b | c
----+----+----
1 | 10 | 3
(1 row)
-- terminal 2
root@:26257/movr OPEN> explain select * from a@a_c_idx limit 1 for update skip locked;
info
------------------------------------------
distribution: local
vectorized: true
• index join
│ table: a@a_pkey
│ locking strength: for update
│ locking wait policy: skip locked
│
└── • scan
missing stats
table: a@a_c_idx
spans: LIMITED SCAN
limit: 1
locking strength: for update
locking wait policy: skip locked
(15 rows)
root@:26257/movr OPEN> select * from a@a_c_idx limit 1 for update skip locked;
a | b | c
----+---+----
(0 rows) The index join doesn't crash or anything with the inconsistent state, it just behaves like an inner join, like we want it to. So the problem is really just that the limit pushdown through the join is incorrect. I'm still struggling to think of any optimizations beyond limit pushdown that are incompatible with inconsistent scans. |
Nice, yeah, maybe there is just a finite number of places to check in the optimizer. On the flip side, I'm also concerned that some places in the execution engine would expect there to always be a row on the other side of a foreign key check or index join. Maybe @yuzefovich would have an idea here. |
I'd be somewhat surprised if there were assumptions about foreign keys. I don't see any mention of FK validity in the execution engine, which would be needed if the execution engine was making assumptions. I'd be less surprised if there were assumptions about index joins because they are always implicitly valid. However, I don't see anything (though it's less clear what to search for) and index join seems to behave like an inner join, which is what we want. Certainly best to have @yuzefovich weigh in. |
From the optimizer's perspective, could we consider In other words, if every table had a virtual |
I don't think we make any such assumptions in the execution engine. We do take advantage of the fact that index joins have 1-to-1 match with their input rows (from the secondary indexes) for the purposes of "estimated row count" / limits / limit hints, but even if those estimates were to turn out wrong, the execution would proceed to fetch more. The index join won't return enough rows only if its input (the scan) short-circuits itself, which is the case when the LIMIT is pushed through the index join (I can't think of anything else). So to me it sounds like Nathan's idea of somehow disabling some rules in the optimizer should work and be sufficient for correctness. |
This is to prevent limit pushdown across the index join, right? Would we only do this if the PK was needed in the first place? We wouldn't want to disallow index-only scans with SKIP LOCKED, as that's a very common usage pattern.
Thanks for confirming @yuzefovich! |
Column families pose another challenge here. When a row is split across multiple column families (i.e. multiple KV keys), each of its corresponding keys may or may not be locked at a given point in time. This makes SKIP LOCKED complex, because without any special handling, it may find a row that is "half locked" and a That would trick SQL into thinking the other columns in the locked half were NULL, even if this directly contradicted a NOT NULL constraint. Or worse, if this did not contradict a NOT NULL constraint, then SQL would be unable to distinguish a locked column family from a NULL column family. In either case, the "correct" behavior would be to not return any part of the row. We have a few options for how to address this:
Given the complexity involved with either of the first two options, the relative rarity of column families, and the limited cases where we expect SKIP LOCKED to be used, I'm leaning towards the third option. I'm curious what others think. |
I think #3 sounds fine. AFAICT SKIP LOCKED is a fairly rare and niche feature. I don't think we should spend time making it completely robust and general-purpose before validating that we need to by releasing the simpler, 1-CF limited version and seeking feedback. |
1da4050
to
29198b9
Compare
29198b9
to
74357ec
Compare
@rytaft pointed out that the IgnoreForeignKeys flag could be set on all AliasedTableExpr to skip optimizations using FK references. And I think we could add a similar flag to skip optimizations using self-join references. Let me see how hard that would be. I think we probably just have to plumb a flag into checkSelfJoinCase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The KV part looks good to me, I only have a couple of minor comments. Also, this was a really fun read 💯
Reviewed 1 of 8 files at r1, 50 of 50 files at r7, 1 of 1 files at r8, 4 of 4 files at r9, 36 of 36 files at r10, 3 of 3 files at r11.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)
pkg/kv/kvserver/replica_gossip.go
line 185 at r10 (raw file):
br, result, pErr := evaluateBatch(ctx, kvserverbase.CmdIDKey(""), rw, rec, nil, &ba, nil, nil /* st */, uncertainty.Interval{}, true /* readOnly */)
nit: nil /* g */
pkg/kv/kvserver/concurrency/concurrency_control.go
line 776 at r10 (raw file):
// the SkipLocked wait policy to determine which keys they should skip over // during evaluation. IsKeyLocked(roachpb.Key) (bool, *enginepb.TxnMeta)
Instead of defining this function here, could (should?) we embed the storage.LockTableView
interface here instead?
pkg/storage/mvcc.go
line 687 at r10 (raw file):
// conjunction with the SkipLocked wait policy to determine which keys they // should skip over during evaluation. IsKeyLocked(roachpb.Key) (bool, *enginepb.TxnMeta)
nit: s/IsKeyLocked/IsKeyLockedByConflictingTxn/?
pkg/storage/pebble_mvcc_scanner.go
line 342 at r9 (raw file):
meta enginepb.MVCCMetadata // Bools copied over from MVCC{Scan,Get}Options. See the comment on the // package level MVCCScan for what these mean.
nit: should we extend the comment to include some words about skipLocked
as well?
pkg/storage/pebble_mvcc_scanner.go
line 687 at r9 (raw file):
// 8. The scanner has been configured with the skipLocked option. Ignore // intents written by other transactions and seek to the next key. // However, we return the intent separately if we have room; the caller
I was looking at some of your testcases and I found it a bit confusing we don't check for FailOnMoreRecent
and return a WriteIntentError
if it's set here. I think I've convinced myself that this is fine, but do you think it's worth adding a few words about this here?
pkg/storage/pebble_mvcc_scanner.go
line 706 at r9 (raw file):
// into case 11 below. if p.checkUncertainty { // The intent's provisional value may be within the uncertainty window. Or
nit: indentation made this comment block go over 80 chars
pkg/kv/kvserver/concurrency/testdata/lock_table/skip_locked
line 87 at r10 (raw file):
# --------------------------------------------------------------------------------- # req2 will scan the lock table with a Skip wait policy. It will not need to wait.
s/req2/req3
pkg/kv/kvserver/concurrency/testdata/lock_table/skip_locked
line 111 at r10 (raw file):
locked: true, holder: 00000000-0000-0000-0000-000000000001 is-key-locked r=req3 k=c
Similar to my comment elsewhere, changing these things to is-key-locked-by-conflicting-txn
might help a bit in terms of readability.
pkg/storage/testdata/mvcc_histories/skip_locked
line 400 at r9 (raw file):
get ts=15 k=k4 skipLocked ---- get: "k4" -> intent {id=00000000 key=/Min pri=0.00000000 epo=0 ts=16.000000000,0 min=0,0 seq=0}
nit: do you think it's worth adding a comment explaining the flow of this particular test case? IMO It's kinda interesting!
pkg/kv/kvserver/replica_test.go
line 2989 at r11 (raw file):
blockWriters.Store(false) go func() { _, pErr := tc.Sender().Send(ctx, baRead)
Should we also validate that the response only returned keys a
and b
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skimmed through the code, nice work! IIUC only the part about disabling some rules in the optimizer is not implemented yet, right?
Reviewed 1 of 8 files at r1, 50 of 50 files at r7, 1 of 1 files at r8, 2 of 4 files at r9, 24 of 36 files at r10, 2 of 3 files at r11, 6 of 6 files at r12, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
-- commits
line 4 at r8:
nit: the commits are missing release notes.
pkg/kv/kvserver/replica_read.go
line 378 at r11 (raw file):
// SkipLocked wait policy. if err := roachpb.ResponseKeyIterate(req, resp, func(key roachpb.Key) { getAlloc := new(struct {
nit: this seems like an opportunity for some pre-allocation. IIUC this anonymous function is called for each key returned by for all ba.Requests
, and some of the requests can get an empty response, some can get a single key, some can get multiple keys, so we don't know upfront exactly how many "gets" we need to allocate (unless we iterate over all responses upfront first to calculate that number). But we probably can allocate this in small "batches", say of 16 structs. Maybe leave a TODO?
pkg/roachpb/batch.go
line 563 at r10 (raw file):
// If ScanFormat == BATCH_RESPONSE. if err := enginepb.ScanDecodeKeyValues(v.BatchResponses, func(key []byte, _ hlc.Timestamp, _ []byte) error { fn(Key(key).Clone())
Why do we make a copy of the key here but not for GetResponses nor for ScanResponses with KEY_VALUES format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 8 files at r1, 50 of 50 files at r7, 1 of 1 files at r8.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)
pkg/storage/mvcc.go
line 681 at r9 (raw file):
// table (i.e. unreplicated locks) and locks only stored in the persistent lock // table keyspace (i.e. replicated locks that have yet to be "discovered"). type LockTableView interface {
small nit: could even call this (single purpose) interface something like LockTableChecker
or something.
pkg/storage/mvcc.go
line 720 at r9 (raw file):
func (opts *MVCCGetOptions) errOnIntents() bool { return !opts.Inconsistent && !opts.SkipLocked
Should this not be !(opts.Inconsistent && opts.SkipLocked)
?
pkg/storage/mvcc.go
line 2541 at r9 (raw file):
func (opts *MVCCScanOptions) errOnIntents() bool { return !opts.Inconsistent && !opts.SkipLocked
Same question as MVCCGetOptions
.
pkg/storage/mvcc.go
line 687 at r10 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
nit: s/IsKeyLocked/IsKeyLockedByConflictingTxn/?
or IsKeyLockedByOtherTxn
, but I agree it's a useful distinction.
pkg/storage/pebble_mvcc_scanner.go
line 950 at r9 (raw file):
// determine whether this is locked with an unreplicated lock. Replicated // locks will be represented as intents, which will be skipped over in // getAndAdvance.
nit: worth mentioning at all here what is meant to happen in the case of an unreplicated lock vs a replicated lock? I.e. it might be useful to be a bit more explicit here (though you do already mention we skip replicated locks elsewhere).
74357ec
to
b938266
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC only the part about disabling some rules in the optimizer is not implemented yet, right?
Right. Specifically, the three changes we'll need to make in the optimizer are:
- disable limit pushdown across ForeignKeys when SKIP LOCKED is used on either side of the join
- disable limit pushdown across index joins when SKIP LOCKED is used on the table
- disable SKIP LOCKED on tables with column families
@michae2 has thoughts about how we can address the first two which sound promising. I don't imagine the third one will be too difficult to implement, but I don't know where that check should live.
My plan is to merge the KV portion of this when it's ready and hand the SQL portion over to SQL queries. Does that sound reasonable? If so, should I plan to merge the last commit here with an understanding that there are certain query plans that won't be handled correctly until we disable some optimizations, or should I pull it out?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @arulajmani, and @yuzefovich)
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: the commits are missing release notes.
Done.
pkg/kv/kvserver/replica_gossip.go
line 185 at r10 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
nit: nil /* g */
Done.
pkg/kv/kvserver/replica_read.go
line 378 at r11 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: this seems like an opportunity for some pre-allocation. IIUC this anonymous function is called for each key returned by for all
ba.Requests
, and some of the requests can get an empty response, some can get a single key, some can get multiple keys, so we don't know upfront exactly how many "gets" we need to allocate (unless we iterate over all responses upfront first to calculate that number). But we probably can allocate this in small "batches", say of 16 structs. Maybe leave a TODO?
Good point. Added a TODO.
pkg/kv/kvserver/concurrency/concurrency_control.go
line 776 at r10 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Instead of defining this function here, could (should?) we embed the
storage.LockTableView
interface here instead?
I'd rather not couple the external interfaces of the storage package to the internal interfaces in this package (notice that this is part of the unexported lockTableGuard
). I think it's fine that they happen to overlap.
pkg/roachpb/batch.go
line 563 at r10 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Why do we make a copy of the key here but not for GetResponses nor for ScanResponses with KEY_VALUES format?
The thought was that we are going to take these keys and put them into data structures (the refresh footprint, the timestamp cache), so we don't want to pull out subslices of a larger result buffer that also contains values and then retain long-lasting references to the underlying buffer. This is not a problem in the other cases, because the keys and values are already separate heap allocations. I added a comment.
pkg/storage/mvcc.go
line 681 at r9 (raw file):
Previously, AlexTalks (Alex Sarkesian) wrote…
small nit: could even call this (single purpose) interface something like
LockTableChecker
or something.
I'd like to keep it a little more general, in case an immutable snapshot of the in-memory lock table can be used in other places. For instance, I imagine it could come in handy for work like #81259.
pkg/storage/mvcc.go
line 720 at r9 (raw file):
Previously, AlexTalks (Alex Sarkesian) wrote…
Should this not be
!(opts.Inconsistent && opts.SkipLocked)
?
No, we want to err on intents if !(opts.Inconsistent || opts.SkipLocked)
. In other words, if neither of those flags are set.
pkg/storage/mvcc.go
line 687 at r10 (raw file):
Previously, AlexTalks (Alex Sarkesian) wrote…
or
IsKeyLockedByOtherTxn
, but I agree it's a useful distinction.
Good idea. Done.
pkg/storage/pebble_mvcc_scanner.go
line 342 at r9 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
nit: should we extend the comment to include some words about
skipLocked
as well?
Done.
pkg/storage/pebble_mvcc_scanner.go
line 687 at r9 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
I was looking at some of your testcases and I found it a bit confusing we don't check for
FailOnMoreRecent
and return aWriteIntentError
if it's set here. I think I've convinced myself that this is fine, but do you think it's worth adding a few words about this here?
Thanks for raising this question! You are right that the semantics of scanning below a lock when p.skipLocked && !p.failOnMoreRecent
were poorly defined and incorrect by any reasonable interpretation of a SkipLocked option when decoupled from a locking read. This case doesn't currently arise in practice, but we should still handle it correctly. I've gone ahead and reworked this code. This also required plumbing a desired locking strength into IsKeyLockedByConflictingTxn
and then using this to detect non-conflicts by non-locking reads (i.e. those that set failOnMoreRecent
to false
). PTAL.
pkg/storage/pebble_mvcc_scanner.go
line 706 at r9 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
nit: indentation made this comment block go over 80 chars
Done.
pkg/kv/kvserver/concurrency/testdata/lock_table/skip_locked
line 87 at r10 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
s/req2/req3
Done.
pkg/kv/kvserver/concurrency/testdata/lock_table/skip_locked
line 111 at r10 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Similar to my comment elsewhere, changing these things to
is-key-locked-by-conflicting-txn
might help a bit in terms of readability.
Done.
pkg/storage/testdata/mvcc_histories/skip_locked
line 400 at r9 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
nit: do you think it's worth adding a comment explaining the flow of this particular test case? IMO It's kinda interesting!
Discussed on your other comment.
pkg/kv/kvserver/replica_test.go
line 2989 at r11 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Should we also validate that the response only returned keys
a
andb
as well?
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plan makes sense to me. We probably don't want to merge things that are known to be incorrect, so I'd pull out the last commit, but I'll defer to @michae2 on this since it sounds like he'll pick up the SQL portion.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @arulajmani, @nvanbenschoten, and @yuzefovich)
pkg/roachpb/batch.go
line 563 at r10 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
The thought was that we are going to take these keys and put them into data structures (the refresh footprint, the timestamp cache), so we don't want to pull out subslices of a larger result buffer that also contains values and then retain long-lasting references to the underlying buffer. This is not a problem in the other cases, because the keys and values are already separate heap allocations. I added a comment.
I see, thanks for the explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working on that here: #82188
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @arulajmani, @nvanbenschoten, and @yuzefovich)
b938266
to
036e855
Compare
036e855
to
ff823cc
Compare
Rename two enum values for improved readability.
Refactor to make the code easier to read and extend.
This commit teaches the `pebbleMVCCScanner` about the skip-locked scan policy. When a scan is configured with the skipLocked option, it does not include locked keys in the result set. To support this, the MVCC layer needs to be provided access to the in-memory lock table, so that it can determine whether keys are locked with unreplicated lock. Replicated locks are represented as intents, which will be skipped over in getAndAdvance.
This commit teaches KV concurrency control about the skip-locked scan policy. Skip locked has a number of touch points in KV. To understand these, we first need to understand the isolation model of skip-locked. When a request is using a SkipLocked wait policy, it behaves as if run at a weaker isolation level for any keys that it skips over. If the read request does not return a key, it does not make a claim about whether that key does or does not exist or what the key's value was at the read's MVCC timestamp. Instead, it only makes a claim about the set of keys that are returned. For those keys which were not skipped and were returned (and often locked, if combined with a locking strength, though this is not required), serializable isolation is enforced. Currently, skip locked does not impact latching. Requests using the SkipLocked wait policy acquire the same latches as before and wait on all latches ahead of them in line. However, a future commit will exploit optimistic evaluation to limit the cases in which skip locked blocks on latches. Skip locked requests do not scan the lock table when initially sequencing. Instead, they capture a snapshot of the in-memory lock table while sequencing and scan the lock table as they perform their MVCC scan using the btree snapshot stored in the concurrency guard. MVCC was taught about skip locked in the previous commit. Skip locked requests add point reads for each of the keys returned to the timestamp cache, instead of adding a single ranged read. This satisfies the weaker isolation level of skip locked. Because the issuing transaction is not intending to enforce serializable isolation across keys that were skipped by its request, it does not need to prevent writes below its read timestamp to keys that were skipped. Similarly, Skip locked requests only records refresh spans for the individual keys returned, instead of recording a refresh span across the entire read span. Because the issuing transaction is not intending to enforce serializable isolation across keys that were skipped by its request, it does not need to validate that they have not changed if the transaction ever needs to refresh.
This commit teaches optimistic evaluation about the skip-locked scan policy. If a request is using a SkipLocked wait policy, we always perform optimistic evaluation. In Replica.collectSpansRead, SkipLocked reads are able to constrain their read spans down to point reads on just those keys that were returned and were not already locked. This means that there is a good chance that some or all of the write latches that the SkipLocked read would have blocked on won't overlap with the keys that the request ends up returning, so they won't conflict when checking for optimistic conflicts.
ff823cc
to
dd4049b
Compare
TFTRs! I've extracted the final commit from this PR into #83627 and stripped it out of this PR. This allows us to land the KV portion of SKIP LOCKED without exposing it, so we can first resolve the optimizer changes that were discussed in #79134 (review). bors r=arulajmani |
Build succeeded: |
KV portion of #40476.
Assists #62734.
Assists #72407.
Assists #78564.
NOTE: the SQL changes here were extracted from this PR and moved to #83627. This allows us to land the KV portion of this change without exposing it yet.
These semantics closely match those of FOR {UPDATE,SHARE} SKIP LOCKED in PostgreSQL. With SKIP LOCKED, any selected rows that cannot be immediately locked are skipped. Skipping locked rows provides an inconsistent view of the data, so this is not suitable for general purpose work, but can be used to avoid lock contention with multiple consumers accessing a queue-like table.
Here is a short video that explains why users might want to use SKIP LOCKED in Postgres. The same motivation applies to CockroachDB. However, SKIP LOCKED is not a complete solution to queues, as MVCC garbage will still become a major problem with sufficiently high consumer throughput. Even with a very low gc.ttl, CockroachDB does not garbage collect MVCC garbage fast enough to avoid slowing down consumers that scan from the head of a queue over MVCC tombstones of previously consumed queue entries.
Implementation
Skip locked has a number of touchpoints in Storage and KV. To understand these, we first need to understand the isolation model of skip-locked. When a request is using a SkipLocked wait policy, it behaves as if run at a weaker isolation level for any keys that it skips over. If the read request does not return a key, it does not make a claim about whether that key does or does not exist or what the key's value was at the read's MVCC timestamp. Instead, it only makes a claim about the set of keys that are returned. For those keys which were not skipped and were returned (and often locked, if combined with a locking strength, though this is not required), serializable isolation is enforced.
When the
pebbleMVCCScanner
is configured with the skipLocked option, it does not include locked keys in the result set. To support this, the MVCC layer needs to be provided access to the in-memory lock table, so that it can determine whether keys are locked with unreplicated lock. Replicated locks are represented as intents, which will be skipped over in getAndAdvance.Requests using the SkipLocked wait policy acquire the same latches as before and wait on all latches ahead of them in line. However, if a request is using a SkipLocked wait policy, we always perform optimistic evaluation. In Replica.collectSpansRead, SkipLocked reads are able to constrain their read spans down to point reads on just those keys that were returned and were not already locked. This means that there is a good chance that some or all of the write latches that the SkipLocked read would have blocked on won't overlap with the keys that the request ends up returning, so they won't conflict when checking for optimistic conflicts.
Skip locked requests do not scan the lock table when initially sequencing. Instead, they capture a snapshot of the in-memory lock table while sequencing and scan the lock table as they perform their MVCC scan using the btree snapshot stored in the concurrency guard. MVCC was taught about skip locked in the previous commit.
Skip locked requests add point reads for each of the keys returned to the timestamp cache, instead of adding a single ranged read. This satisfies the weaker isolation level of skip locked. Because the issuing transaction is not intending to enforce serializable isolation across keys that were skipped by its request, it does not need to prevent writes below its read timestamp to keys that were skipped.
Similarly, Skip locked requests only records refresh spans for the individual keys returned, instead of recording a refresh span across the entire read span. Because the issuing transaction is not intending to enforce serializable isolation across keys that were skipped by its request, it does not need to validate that they have not changed if the transaction ever needs to refresh.
Benchmarking
I haven't done any serious benchmarking with this SKIP LOCKED yet, though I'd like to. At some point, I would like to build a simple queue-like workload into the
workload
tool and experiment with various consumer access patterns (non-locking reads, locking reads, skip-locked reads), indexing schemes, concurrency levels (for producers and consumers), and batch sizes.