kv: support FOR {UPDATE,SHARE} SKIP LOCKED #79134

nvanbenschoten · 2022-03-31T14:55:37Z

KV portion of #40476.
Assists #62734.
Assists #72407.
Assists #78564.

NOTE: the SQL changes here were extracted from this PR and moved to #83627. This allows us to land the KV portion of this change without exposing it yet.

CREATE TABLE kv (k INT PRIMARY KEY, v INT)
INSERT INTO kv VALUES (1, 1), (2, 2), (3, 3)


-- in session 1
BEGIN; UPDATE kv SET v = 0 WHERE k = 1 RETURNING *

  k | v
----+----
  1 | 0


-- in session 2
BEGIN; SELECT * FROM kv ORDER BY k LIMIT 1 FOR UPDATE SKIP LOCKED

  k | v
----+----
  2 | 2


-- in session 3
BEGIN; SELECT * FROM kv FOR UPDATE SKIP LOCKED

  k | v
----+----
  3 | 3

These semantics closely match those of FOR {UPDATE,SHARE} SKIP LOCKED in PostgreSQL. With SKIP LOCKED, any selected rows that cannot be immediately locked are skipped. Skipping locked rows provides an inconsistent view of the data, so this is not suitable for general purpose work, but can be used to avoid lock contention with multiple consumers accessing a queue-like table.

Here is a short video that explains why users might want to use SKIP LOCKED in Postgres. The same motivation applies to CockroachDB. However, SKIP LOCKED is not a complete solution to queues, as MVCC garbage will still become a major problem with sufficiently high consumer throughput. Even with a very low gc.ttl, CockroachDB does not garbage collect MVCC garbage fast enough to avoid slowing down consumers that scan from the head of a queue over MVCC tombstones of previously consumed queue entries.

Implementation

Skip locked has a number of touchpoints in Storage and KV. To understand these, we first need to understand the isolation model of skip-locked. When a request is using a SkipLocked wait policy, it behaves as if run at a weaker isolation level for any keys that it skips over. If the read request does not return a key, it does not make a claim about whether that key does or does not exist or what the key's value was at the read's MVCC timestamp. Instead, it only makes a claim about the set of keys that are returned. For those keys which were not skipped and were returned (and often locked, if combined with a locking strength, though this is not required), serializable isolation is enforced.

When the pebbleMVCCScanner is configured with the skipLocked option, it does not include locked keys in the result set. To support this, the MVCC layer needs to be provided access to the in-memory lock table, so that it can determine whether keys are locked with unreplicated lock. Replicated locks are represented as intents, which will be skipped over in getAndAdvance.

Requests using the SkipLocked wait policy acquire the same latches as before and wait on all latches ahead of them in line. However, if a request is using a SkipLocked wait policy, we always perform optimistic evaluation. In Replica.collectSpansRead, SkipLocked reads are able to constrain their read spans down to point reads on just those keys that were returned and were not already locked. This means that there is a good chance that some or all of the write latches that the SkipLocked read would have blocked on won't overlap with the keys that the request ends up returning, so they won't conflict when checking for optimistic conflicts.

Skip locked requests do not scan the lock table when initially sequencing. Instead, they capture a snapshot of the in-memory lock table while sequencing and scan the lock table as they perform their MVCC scan using the btree snapshot stored in the concurrency guard. MVCC was taught about skip locked in the previous commit.

Skip locked requests add point reads for each of the keys returned to the timestamp cache, instead of adding a single ranged read. This satisfies the weaker isolation level of skip locked. Because the issuing transaction is not intending to enforce serializable isolation across keys that were skipped by its request, it does not need to prevent writes below its read timestamp to keys that were skipped.

Similarly, Skip locked requests only records refresh spans for the individual keys returned, instead of recording a refresh span across the entire read span. Because the issuing transaction is not intending to enforce serializable isolation across keys that were skipped by its request, it does not need to validate that they have not changed if the transaction ever needs to refresh.

Benchmarking

I haven't done any serious benchmarking with this SKIP LOCKED yet, though I'd like to. At some point, I would like to build a simple queue-like workload into the workload tool and experiment with various consumer access patterns (non-locking reads, locking reads, skip-locked reads), indexing schemes, concurrency levels (for producers and consumers), and batch sizes.

cockroach-teamcity · 2022-03-31T14:55:50Z

This change is

tbg · 2022-03-31T19:50:14Z

Aren't we scared of inconsistent data sources? I thought throughout the SQL code we assume that data is internally consistent. Couldn't there be undefined behavior if parts of a request's execution aren't internally consistent? I don't have any depth to that question. I'm simply vaguely worried since this is the first instance in which the SQL layer deliberately is tasked to handle data that may be seeing only "half" of a committed transaction (for example, could be seeing a PK record, but not its corresponding entry on a secondary index). Maybe this is all fine, but I would be interested in hearing the rationale and how this fits into our consistency model.

jordanlewis · 2022-04-01T13:20:03Z

This is great to see, thank you for doing this @nvanbenschoten!

I have a similar concern to Tobi.

For example, the optimizer uses information about foreign key presence to make decisions. What would happen if the optimizer assumes the presence of a foreign key, but the foreign key is un-scannable due to SKIP LOCKED?

Or, for secondary indexes.

I did a quick test and saw some surprising behavior (maybe a good surprise?)

WIndow 1:

root@:26257/defaultdb> create table a (a PRIMARY KEY, b, c) AS select g,g+1,g+2 from generate_series(1,100) g(g);
CREATE TABLE AS
root@:26257/defaultdb> create index on a(c);
CREATE INDEX
root@:26257/defaultdb> begin;
BEGIN
root@:26257/defaultdb  OPEN> update a set b=10 where a=1 returning *;

Window 2:

root@:26257/defaultdb> begin;
BEGIN
root@:26257/defaultdb  OPEN> explain select * from a@a_c_idx limit 1 for update skip locked;
                                       info
-----------------------------------------------------------------------------------
  distribution: local
  vectorized: true

  • index join
  │ estimated row count: 1
  │ table: a@a_pkey
  │
  └── • scan
        estimated row count: 1 (1.0% of the table; stats collected 2 minutes ago)
        table: a@a_c_idx
        spans: LIMITED SCAN
        limit: 1
        locking strength: for update
        locking wait policy: skip locked
(14 rows)


Time: 1ms total (execution 1ms / network 0ms)

root@:26257/defaultdb  OPEN> select * from a@a_c_idx limit 1 for update skip locked;

In window 2, the last command hangs until I abort the first txn. This is surprising! Since it's marked as SKIP LOCKED, I'd have assumed that it would successfully read the non-locked secondary index key, try to index join to the locked primary index key, skip the result due to the lock, and come back with an inconsistent case: a secondary index key without a matching primary index key.

But, this didn't happen, which is great. Do you have special logic to prevent this case?

nvanbenschoten · 2022-04-04T15:38:37Z

Yeah, @tbg is completely right. Nice catch! If SKIP LOCKED allows scans to return inconsistent results, the SQL optimizer is going to have to adjust its expectations and disable certain optimizations. I don't really see a way around this. I'd be very interested in understanding the effect that SkipLocked has on Postgres' optimizer.

I did a quick test and saw some surprising behavior (maybe a good surprise?)
...
But, this didn't happen, which is great. Do you have special logic to prevent this case?

I don't think this is by design, unfortunately. I suspect it's because we still need to land #60719 to make sure that the SkipLocked wait-policy makes its way to the index join's KV requests. I had forgotten about that PR, but it's probably what we should push on first.

nvanbenschoten · 2022-04-05T04:49:43Z

I took a look at Postgres' codebase and did not find any reference to SkipLocked in their query optimizer. However, I was also unable to find a query whose query plan relied on consistency across multiple tables. For instance, I was not able to generate a query that pushed a limit through a join thanks to a validated foreign key. @andy-kimball speculates that PG might not do this kind of thing.

This makes me wonder how many of these cases actually exist in CRDB. The only ones I can think of are:

index joins, where we assume a 1:1 pairing between each secondary index entry and each primary index entry.
any optimization (like limit pushdown or any normalization rule powered by the JoinFiltersMatchAllLeftRows filter) that is only allowed when a foreign key is validated and also when the IgnoreForeignKeys option is not used.

One observation here is that the former is really a specialization of the latter. I don't think the optimizer treats index joins that way, but it's interesting to consider index joins as a form of lookup join across a foreign key reference. I think I recall some discussion about other cost-based optimizers treating each index as a separate table.

The other observation is that it may not be very hard to track down the few optimization rules that rely on cross-row consistency and disable them when SKIP LOCKED is in use.

nvanbenschoten · 2022-04-05T06:03:19Z

@jordanlewis I cherry-picked #60719 on to this PR and tested out your example. I saw what I think we expect to see if we don't do anything about query plan optimizations that depend on consistency:

-- terminal 1
root@127.0.0.1:26257/movr  OPEN> update a set b=10 where a=1 returning *;
  a | b  | c
----+----+----
  1 | 10 | 3
(1 row)

-- terminal 2
root@:26257/movr  OPEN> explain select * from a@a_c_idx limit 1 for update skip locked;
                   info
------------------------------------------
  distribution: local
  vectorized: true

  • index join
  │ table: a@a_pkey
  │ locking strength: for update
  │ locking wait policy: skip locked
  │
  └── • scan
        missing stats
        table: a@a_c_idx
        spans: LIMITED SCAN
        limit: 1
        locking strength: for update
        locking wait policy: skip locked
(15 rows)

root@:26257/movr  OPEN> select * from a@a_c_idx limit 1 for update skip locked;
  a | b | c
----+---+----
(0 rows)

The index join doesn't crash or anything with the inconsistent state, it just behaves like an inner join, like we want it to. So the problem is really just that the limit pushdown through the join is incorrect. I'm still struggling to think of any optimizations beyond limit pushdown that are incompatible with inconsistent scans.

jordanlewis · 2022-04-05T13:35:33Z

Nice, yeah, maybe there is just a finite number of places to check in the optimizer.

On the flip side, I'm also concerned that some places in the execution engine would expect there to always be a row on the other side of a foreign key check or index join. Maybe @yuzefovich would have an idea here.

nvanbenschoten · 2022-04-05T13:49:42Z

I'd be somewhat surprised if there were assumptions about foreign keys. I don't see any mention of FK validity in the execution engine, which would be needed if the execution engine was making assumptions. I'd be less surprised if there were assumptions about index joins because they are always implicitly valid. However, I don't see anything (though it's less clear what to search for) and index join seems to behave like an inner join, which is what we want. Certainly best to have @yuzefovich weigh in.

michae2 · 2022-04-05T16:30:40Z

From the optimizer's perspective, could we consider SKIP LOCKED to be a "virtual" predicate that only the PK could evaluate? Then I think the optimizer could function as it normally does.

In other words, if every table had a virtual is_this_row_currently_locked column only in the PK, and every SKIP LOCKED query added a predicate WHERE is_this_row_currently_locked = false then perhaps the optimizer would not need any special handling?

yuzefovich · 2022-04-05T17:02:41Z

I'm also concerned that some places in the execution engine would expect there to always be a row on the other side of a foreign key check or index join. Maybe @yuzefovich would have an idea here.

I don't think we make any such assumptions in the execution engine. We do take advantage of the fact that index joins have 1-to-1 match with their input rows (from the secondary indexes) for the purposes of "estimated row count" / limits / limit hints, but even if those estimates were to turn out wrong, the execution would proceed to fetch more. The index join won't return enough rows only if its input (the scan) short-circuits itself, which is the case when the LIMIT is pushed through the index join (I can't think of anything else). So to me it sounds like Nathan's idea of somehow disabling some rules in the optimizer should work and be sufficient for correctness.

nvanbenschoten · 2022-04-06T16:25:16Z

From the optimizer's perspective, could we consider SKIP LOCKED to be a "virtual" predicate that only the PK could evaluate? Then I think the optimizer could function as it normally does.

This is to prevent limit pushdown across the index join, right? Would we only do this if the PK was needed in the first place? We wouldn't want to disallow index-only scans with SKIP LOCKED, as that's a very common usage pattern.

I don't think we make any such assumptions in the execution engine.

Thanks for confirming @yuzefovich!

nvanbenschoten · 2022-04-18T18:24:08Z

Column families pose another challenge here. When a row is split across multiple column families (i.e. multiple KV keys), each of its corresponding keys may or may not be locked at a given point in time. This makes SKIP LOCKED complex, because without any special handling, it may find a row that is "half locked" and a Scan with a SkipLocked wait policy would only return the unlocked half.

That would trick SQL into thinking the other columns in the locked half were NULL, even if this directly contradicted a NOT NULL constraint. Or worse, if this did not contradict a NOT NULL constraint, then SQL would be unable to distinguish a locked column family from a NULL column family. In either case, the "correct" behavior would be to not return any part of the row.

We have a few options for how to address this:

return locked keys (without values) to the client during SkipLocked scans and let the client decide what to do when it finds a half-locked row. This is complex, as it would require an extension of the KV API and require extra plumbing at a few levels in the stack. It would also require logic in SQL to piece rows back together and discard rows that were partially locked.
teach KV to skip over all keys in half-locked rows. KV is currently unaware of row boundaries (for the most part), but we do have some desire to make it more aware of multiple keys that together form a single row. If it know about row boundaries then it could avoid adding any key from a row that was half-locked to a Scan's result set.
Disallow SKIP LOCKED scans over table/indexes with multiple column families.

Given the complexity involved with either of the first two options, the relative rarity of column families, and the limited cases where we expect SKIP LOCKED to be used, I'm leaning towards the third option. I'm curious what others think.

jordanlewis · 2022-04-19T15:09:43Z

I think #3 sounds fine. AFAICT SKIP LOCKED is a fairly rare and niche feature. I don't think we should spend time making it completely robust and general-purpose before validating that we need to by releasing the simpler, 1-CF limited version and seeking feedback.

michae2 · 2022-05-17T22:01:16Z

@rytaft pointed out that the IgnoreForeignKeys flag could be set on all AliasedTableExpr to skip optimizations using FK references.

And I think we could add a similar flag to skip optimizations using self-join references. Let me see how hard that would be. I think we probably just have to plumb a flag into checkSelfJoinCase.

arulajmani

The KV part looks good to me, I only have a couple of minor comments. Also, this was a really fun read 💯

Reviewed 1 of 8 files at r1, 50 of 50 files at r7, 1 of 1 files at r8, 4 of 4 files at r9, 36 of 36 files at r10, 3 of 3 files at r11.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)

pkg/kv/kvserver/replica_gossip.go line 185 at r10 (raw file):

	br, result, pErr :=
		evaluateBatch(ctx, kvserverbase.CmdIDKey(""), rw, rec, nil, &ba, nil, nil /* st */, uncertainty.Interval{}, true /* readOnly */)

nit: nil /* g */

pkg/kv/kvserver/concurrency/concurrency_control.go line 776 at r10 (raw file):

	// the SkipLocked wait policy to determine which keys they should skip over
	// during evaluation.
	IsKeyLocked(roachpb.Key) (bool, *enginepb.TxnMeta)

Instead of defining this function here, could (should?) we embed the storage.LockTableView interface here instead?

pkg/storage/mvcc.go line 687 at r10 (raw file):

	// conjunction with the SkipLocked wait policy to determine which keys they
	// should skip over during evaluation.
	IsKeyLocked(roachpb.Key) (bool, *enginepb.TxnMeta)

nit: s/IsKeyLocked/IsKeyLockedByConflictingTxn/?

pkg/storage/pebble_mvcc_scanner.go line 342 at r9 (raw file):

	meta enginepb.MVCCMetadata
	// Bools copied over from MVCC{Scan,Get}Options. See the comment on the
	// package level MVCCScan for what these mean.

nit: should we extend the comment to include some words about skipLocked as well?

pkg/storage/pebble_mvcc_scanner.go line 687 at r9 (raw file):

			// 8. The scanner has been configured with the skipLocked option. Ignore
			// intents written by other transactions and seek to the next key.
			// However, we return the intent separately if we have room; the caller

I was looking at some of your testcases and I found it a bit confusing we don't check for FailOnMoreRecent and return a WriteIntentError if it's set here. I think I've convinced myself that this is fine, but do you think it's worth adding a few words about this here?

pkg/storage/pebble_mvcc_scanner.go line 706 at r9 (raw file):

			// into case 11 below.
			if p.checkUncertainty {
				// The intent's provisional value may be within the uncertainty window. Or

nit: indentation made this comment block go over 80 chars

pkg/kv/kvserver/concurrency/testdata/lock_table/skip_locked line 87 at r10 (raw file):


# ---------------------------------------------------------------------------------
# req2 will scan the lock table with a Skip wait policy. It will not need to wait.

s/req2/req3

pkg/kv/kvserver/concurrency/testdata/lock_table/skip_locked line 111 at r10 (raw file):

locked: true, holder: 00000000-0000-0000-0000-000000000001

is-key-locked r=req3 k=c

Similar to my comment elsewhere, changing these things to is-key-locked-by-conflicting-txn might help a bit in terms of readability.

pkg/storage/testdata/mvcc_histories/skip_locked line 400 at r9 (raw file):

get ts=15 k=k4 skipLocked
----
get: "k4" -> intent {id=00000000 key=/Min pri=0.00000000 epo=0 ts=16.000000000,0 min=0,0 seq=0}

nit: do you think it's worth adding a comment explaining the flow of this particular test case? IMO It's kinda interesting!

pkg/kv/kvserver/replica_test.go line 2989 at r11 (raw file):

			blockWriters.Store(false)
			go func() {
				_, pErr := tc.Sender().Send(ctx, baRead)

Should we also validate that the response only returned keys a and b as well?

yuzefovich

Skimmed through the code, nice work! IIUC only the part about disabling some rules in the optimizer is not implemented yet, right?

Reviewed 1 of 8 files at r1, 50 of 50 files at r7, 1 of 1 files at r8, 2 of 4 files at r9, 24 of 36 files at r10, 2 of 3 files at r11, 6 of 6 files at r12, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

-- commits line 4 at r8:
nit: the commits are missing release notes.

pkg/kv/kvserver/replica_read.go line 378 at r11 (raw file):

			// SkipLocked wait policy.
			if err := roachpb.ResponseKeyIterate(req, resp, func(key roachpb.Key) {
				getAlloc := new(struct {

nit: this seems like an opportunity for some pre-allocation. IIUC this anonymous function is called for each key returned by for all ba.Requests, and some of the requests can get an empty response, some can get a single key, some can get multiple keys, so we don't know upfront exactly how many "gets" we need to allocate (unless we iterate over all responses upfront first to calculate that number). But we probably can allocate this in small "batches", say of 16 structs. Maybe leave a TODO?

pkg/roachpb/batch.go line 563 at r10 (raw file):

		// If ScanFormat == BATCH_RESPONSE.
		if err := enginepb.ScanDecodeKeyValues(v.BatchResponses, func(key []byte, _ hlc.Timestamp, _ []byte) error {
			fn(Key(key).Clone())

Why do we make a copy of the key here but not for GetResponses nor for ScanResponses with KEY_VALUES format?

AlexTalks

Reviewed 1 of 8 files at r1, 50 of 50 files at r7, 1 of 1 files at r8.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)

pkg/storage/mvcc.go line 681 at r9 (raw file):

// table (i.e. unreplicated locks) and locks only stored in the persistent lock
// table keyspace (i.e. replicated locks that have yet to be "discovered").
type LockTableView interface {

small nit: could even call this (single purpose) interface something like LockTableChecker or something.

pkg/storage/mvcc.go line 720 at r9 (raw file):

func (opts *MVCCGetOptions) errOnIntents() bool {
	return !opts.Inconsistent && !opts.SkipLocked

Should this not be !(opts.Inconsistent && opts.SkipLocked)?

pkg/storage/mvcc.go line 2541 at r9 (raw file):

func (opts *MVCCScanOptions) errOnIntents() bool {
	return !opts.Inconsistent && !opts.SkipLocked

Same question as MVCCGetOptions.

pkg/storage/mvcc.go line 687 at r10 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: s/IsKeyLocked/IsKeyLockedByConflictingTxn/?

or IsKeyLockedByOtherTxn, but I agree it's a useful distinction.

pkg/storage/pebble_mvcc_scanner.go line 950 at r9 (raw file):

	// determine whether this is locked with an unreplicated lock. Replicated
	// locks will be represented as intents, which will be skipped over in
	// getAndAdvance.

nit: worth mentioning at all here what is meant to happen in the case of an unreplicated lock vs a replicated lock? I.e. it might be useful to be a bit more explicit here (though you do already mention we skip replicated locks elsewhere).

nvanbenschoten

IIUC only the part about disabling some rules in the optimizer is not implemented yet, right?

Right. Specifically, the three changes we'll need to make in the optimizer are:

disable limit pushdown across ForeignKeys when SKIP LOCKED is used on either side of the join
disable limit pushdown across index joins when SKIP LOCKED is used on the table
disable SKIP LOCKED on tables with column families

@michae2 has thoughts about how we can address the first two which sound promising. I don't imagine the third one will be too difficult to implement, but I don't know where that check should live.

My plan is to merge the KV portion of this when it's ready and hand the SQL portion over to SQL queries. Does that sound reasonable? If so, should I plan to merge the last commit here with an understanding that there are certain query plans that won't be handled correctly until we disable some optimizations, or should I pull it out?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @arulajmani, and @yuzefovich)

-- commits line 4 at r8:

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: the commits are missing release notes.

Done.

pkg/kv/kvserver/replica_gossip.go line 185 at r10 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: nil /* g */

Done.

pkg/kv/kvserver/replica_read.go line 378 at r11 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: this seems like an opportunity for some pre-allocation. IIUC this anonymous function is called for each key returned by for all ba.Requests, and some of the requests can get an empty response, some can get a single key, some can get multiple keys, so we don't know upfront exactly how many "gets" we need to allocate (unless we iterate over all responses upfront first to calculate that number). But we probably can allocate this in small "batches", say of 16 structs. Maybe leave a TODO?

Good point. Added a TODO.

pkg/kv/kvserver/concurrency/concurrency_control.go line 776 at r10 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Instead of defining this function here, could (should?) we embed the storage.LockTableView interface here instead?

I'd rather not couple the external interfaces of the storage package to the internal interfaces in this package (notice that this is part of the unexported lockTableGuard). I think it's fine that they happen to overlap.

pkg/roachpb/batch.go line 563 at r10 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Why do we make a copy of the key here but not for GetResponses nor for ScanResponses with KEY_VALUES format?

The thought was that we are going to take these keys and put them into data structures (the refresh footprint, the timestamp cache), so we don't want to pull out subslices of a larger result buffer that also contains values and then retain long-lasting references to the underlying buffer. This is not a problem in the other cases, because the keys and values are already separate heap allocations. I added a comment.

pkg/storage/mvcc.go line 681 at r9 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

small nit: could even call this (single purpose) interface something like LockTableChecker or something.

I'd like to keep it a little more general, in case an immutable snapshot of the in-memory lock table can be used in other places. For instance, I imagine it could come in handy for work like #81259.

pkg/storage/mvcc.go line 720 at r9 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

Should this not be !(opts.Inconsistent && opts.SkipLocked)?

No, we want to err on intents if !(opts.Inconsistent || opts.SkipLocked). In other words, if neither of those flags are set.

pkg/storage/mvcc.go line 687 at r10 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

or IsKeyLockedByOtherTxn, but I agree it's a useful distinction.

Good idea. Done.

pkg/storage/pebble_mvcc_scanner.go line 342 at r9 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: should we extend the comment to include some words about skipLocked as well?

Done.

pkg/storage/pebble_mvcc_scanner.go line 687 at r9 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

I was looking at some of your testcases and I found it a bit confusing we don't check for FailOnMoreRecent and return a WriteIntentError if it's set here. I think I've convinced myself that this is fine, but do you think it's worth adding a few words about this here?

Thanks for raising this question! You are right that the semantics of scanning below a lock when p.skipLocked && !p.failOnMoreRecent were poorly defined and incorrect by any reasonable interpretation of a SkipLocked option when decoupled from a locking read. This case doesn't currently arise in practice, but we should still handle it correctly. I've gone ahead and reworked this code. This also required plumbing a desired locking strength into IsKeyLockedByConflictingTxn and then using this to detect non-conflicts by non-locking reads (i.e. those that set failOnMoreRecent to false). PTAL.

pkg/storage/pebble_mvcc_scanner.go line 706 at r9 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: indentation made this comment block go over 80 chars

Done.

pkg/kv/kvserver/concurrency/testdata/lock_table/skip_locked line 87 at r10 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

s/req2/req3

Done.

pkg/kv/kvserver/concurrency/testdata/lock_table/skip_locked line 111 at r10 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Similar to my comment elsewhere, changing these things to is-key-locked-by-conflicting-txn might help a bit in terms of readability.

Done.

pkg/storage/testdata/mvcc_histories/skip_locked line 400 at r9 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: do you think it's worth adding a comment explaining the flow of this particular test case? IMO It's kinda interesting!

Discussed on your other comment.

pkg/kv/kvserver/replica_test.go line 2989 at r11 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Should we also validate that the response only returned keys a and b as well?

Done.

yuzefovich

The plan makes sense to me. We probably don't want to merge things that are known to be incorrect, so I'd pull out the last commit, but I'll defer to @michae2 on this since it sounds like he'll pick up the SQL portion.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @arulajmani, @nvanbenschoten, and @yuzefovich)

pkg/roachpb/batch.go line 563 at r10 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

The thought was that we are going to take these keys and put them into data structures (the refresh footprint, the timestamp cache), so we don't want to pull out subslices of a larger result buffer that also contains values and then retain long-lasting references to the underlying buffer. This is not a problem in the other cases, because the keys and values are already separate heap allocations. I added a comment.

I see, thanks for the explanation.

michae2

Working on that here: #82188

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks, @arulajmani, @nvanbenschoten, and @yuzefovich)

Rename two enum values for improved readability.

Refactor to make the code easier to read and extend.

This commit teaches the `pebbleMVCCScanner` about the skip-locked scan policy. When a scan is configured with the skipLocked option, it does not include locked keys in the result set. To support this, the MVCC layer needs to be provided access to the in-memory lock table, so that it can determine whether keys are locked with unreplicated lock. Replicated locks are represented as intents, which will be skipped over in getAndAdvance.

This commit teaches KV concurrency control about the skip-locked scan policy. Skip locked has a number of touch points in KV. To understand these, we first need to understand the isolation model of skip-locked. When a request is using a SkipLocked wait policy, it behaves as if run at a weaker isolation level for any keys that it skips over. If the read request does not return a key, it does not make a claim about whether that key does or does not exist or what the key's value was at the read's MVCC timestamp. Instead, it only makes a claim about the set of keys that are returned. For those keys which were not skipped and were returned (and often locked, if combined with a locking strength, though this is not required), serializable isolation is enforced. Currently, skip locked does not impact latching. Requests using the SkipLocked wait policy acquire the same latches as before and wait on all latches ahead of them in line. However, a future commit will exploit optimistic evaluation to limit the cases in which skip locked blocks on latches. Skip locked requests do not scan the lock table when initially sequencing. Instead, they capture a snapshot of the in-memory lock table while sequencing and scan the lock table as they perform their MVCC scan using the btree snapshot stored in the concurrency guard. MVCC was taught about skip locked in the previous commit. Skip locked requests add point reads for each of the keys returned to the timestamp cache, instead of adding a single ranged read. This satisfies the weaker isolation level of skip locked. Because the issuing transaction is not intending to enforce serializable isolation across keys that were skipped by its request, it does not need to prevent writes below its read timestamp to keys that were skipped. Similarly, Skip locked requests only records refresh spans for the individual keys returned, instead of recording a refresh span across the entire read span. Because the issuing transaction is not intending to enforce serializable isolation across keys that were skipped by its request, it does not need to validate that they have not changed if the transaction ever needs to refresh.

This commit teaches optimistic evaluation about the skip-locked scan policy. If a request is using a SkipLocked wait policy, we always perform optimistic evaluation. In Replica.collectSpansRead, SkipLocked reads are able to constrain their read spans down to point reads on just those keys that were returned and were not already locked. This means that there is a good chance that some or all of the write latches that the SkipLocked read would have blocked on won't overlap with the keys that the request ends up returning, so they won't conflict when checking for optimistic conflicts.

nvanbenschoten · 2022-06-30T16:10:45Z

TFTRs!

I've extracted the final commit from this PR into #83627 and stripped it out of this PR. This allows us to land the KV portion of SKIP LOCKED without exposing it, so we can first resolve the optimizer changes that were discussed in #79134 (review).

bors r=arulajmani

craig · 2022-06-30T18:05:23Z

Build succeeded:

GitHub CI (Cockroach)

nvanbenschoten requested review from a team as code owners March 31, 2022 14:55

nvanbenschoten requested a review from a team March 31, 2022 14:55

nvanbenschoten requested review from a team as code owners March 31, 2022 14:55

nvanbenschoten force-pushed the nvanbenschoten/skipLocked branch 2 times, most recently from a35ea8c to 1da4050 Compare March 31, 2022 18:39

jordanlewis requested a review from a team April 1, 2022 14:10

nvanbenschoten force-pushed the nvanbenschoten/skipLocked branch from 1da4050 to 29198b9 Compare May 5, 2022 05:43

nvanbenschoten force-pushed the nvanbenschoten/skipLocked branch from 29198b9 to 74357ec Compare May 17, 2022 16:14

arulajmani self-requested a review May 17, 2022 20:04

arulajmani reviewed May 20, 2022

View reviewed changes

yuzefovich reviewed May 20, 2022

View reviewed changes

AlexTalks suggested changes May 27, 2022

View reviewed changes

nvanbenschoten force-pushed the nvanbenschoten/skipLocked branch from 74357ec to b938266 Compare June 1, 2022 21:14

nvanbenschoten commented Jun 1, 2022

View reviewed changes

yuzefovich reviewed Jun 1, 2022

View reviewed changes

michae2 reviewed Jun 1, 2022

View reviewed changes

nvanbenschoten mentioned this pull request Jun 6, 2022

ttl: improve row deletion performance with secondary indexes #82140

Open

nvanbenschoten force-pushed the nvanbenschoten/skipLocked branch from b938266 to 036e855 Compare June 29, 2022 23:38

nvanbenschoten mentioned this pull request Jun 29, 2022

sql: support FOR {UPDATE,SHARE} SKIP LOCKED #83627

Closed

nvanbenschoten changed the title ~~kv/sql: support FOR {UPDATE,SHARE} SKIP LOCKED~~ kv: support FOR {UPDATE,SHARE} SKIP LOCKED Jun 29, 2022

nvanbenschoten force-pushed the nvanbenschoten/skipLocked branch from 036e855 to ff823cc Compare June 29, 2022 23:52

nvanbenschoten added 5 commits June 29, 2022 22:38

sql: rename SKIP wait policy to SKIP_LOCKED

0ca54c7

Rename two enum values for improved readability.

storage: refactor order of cases in pebbleMVCCScanner.getAndAdvance

f3f8c79

Refactor to make the code easier to read and extend.

nvanbenschoten force-pushed the nvanbenschoten/skipLocked branch from ff823cc to dd4049b Compare June 30, 2022 05:39

craig bot merged commit 05ca68a into cockroachdb:master Jun 30, 2022

nvanbenschoten deleted the nvanbenschoten/skipLocked branch June 30, 2022 19:01

michae2 mentioned this pull request Jul 19, 2022

[WIP] opt: add hint to ignore preserved-multiplicity consistency #82188

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: support FOR {UPDATE,SHARE} SKIP LOCKED #79134

kv: support FOR {UPDATE,SHARE} SKIP LOCKED #79134

nvanbenschoten commented Mar 31, 2022 •

edited

Loading

cockroach-teamcity commented Mar 31, 2022

tbg commented Mar 31, 2022

jordanlewis commented Apr 1, 2022

nvanbenschoten commented Apr 4, 2022

nvanbenschoten commented Apr 5, 2022

nvanbenschoten commented Apr 5, 2022

jordanlewis commented Apr 5, 2022

nvanbenschoten commented Apr 5, 2022

michae2 commented Apr 5, 2022

yuzefovich commented Apr 5, 2022

nvanbenschoten commented Apr 6, 2022

nvanbenschoten commented Apr 18, 2022

jordanlewis commented Apr 19, 2022 •

edited

Loading

michae2 commented May 17, 2022

arulajmani left a comment

yuzefovich left a comment

AlexTalks left a comment

nvanbenschoten left a comment

yuzefovich left a comment

michae2 left a comment

nvanbenschoten commented Jun 30, 2022

craig bot commented Jun 30, 2022

kv: support FOR {UPDATE,SHARE} SKIP LOCKED #79134

kv: support FOR {UPDATE,SHARE} SKIP LOCKED #79134

Conversation

nvanbenschoten commented Mar 31, 2022 • edited Loading

Implementation

Benchmarking

cockroach-teamcity commented Mar 31, 2022

tbg commented Mar 31, 2022

jordanlewis commented Apr 1, 2022

nvanbenschoten commented Apr 4, 2022

nvanbenschoten commented Apr 5, 2022

nvanbenschoten commented Apr 5, 2022

jordanlewis commented Apr 5, 2022

nvanbenschoten commented Apr 5, 2022

michae2 commented Apr 5, 2022

yuzefovich commented Apr 5, 2022

nvanbenschoten commented Apr 6, 2022

nvanbenschoten commented Apr 18, 2022

jordanlewis commented Apr 19, 2022 • edited Loading

michae2 commented May 17, 2022

arulajmani left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

AlexTalks left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

michae2 left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Jun 30, 2022

craig bot commented Jun 30, 2022

nvanbenschoten commented Mar 31, 2022 •

edited

Loading

jordanlewis commented Apr 19, 2022 •

edited

Loading