kvserver, batcheval: pin Engine state during read-only command evaluation #76312

aayushshah15 · 2022-02-09T17:40:57Z

This commit makes it such that we eagerly pin the engine state of the Reader
created during the evaluation of read-only requests.

Generally, reads will hold latches throughout the course of their evaluation
(particularly, while they do their MVCCScan). Mainly, this commit paves the
way for us to move to a world where we avoid holding latches during the
MVCCScan. Additionally, it also lets us make MVCC garbage collection latchless
as described in #55293.

There are a few notable changes in this patch:

Pinning the engine state eagerly runs into kv: QueryResolvedTimestamp iterator snapshot may be captured before closed timestamp #70974. To resolve this, the
closed timestamp of the Replica is now captured at the time the EvalContext
is created, and not during the command evaluation of
QueryResolvedTimestampRequest.
EvalContext now has a ImmutableEvalContext embedded into it. The
ImmutableEvalContext is supposed to encapsulate state that must not change
after the EvalContext is created. The closed timestamp of the replica is part
of the ImmutableEvalContext.
Replica no longer fully implements the EvalContext interface. Instead,
it implements everything but GetClosedTimestamp() (which is implemented by
ImmutableEvalContext instead).

Relates to #55293
Resolves #55461
Resolves #70974

Release note: None

cockroach-teamcity · 2022-02-09T17:41:05Z

This change is

aayushshah15 · 2022-02-09T17:44:54Z

Hey @nvanbenschoten, does this match your expectation of what we had discussed in the last few weeks?

This may need a few tests but I wanted to confirm that the approach is close to what you had in mind. I confirmed that pinning the engine state eagerly (and doing nothing else) fails TestNonBlockingReadsAtResolvedTimestamp under stress and then confirmed that capturing the closed timestamp eagerly when we create the EvalContext avoids running into #70974.

cc @arulajmani

aayushshah15 · 2022-02-14T18:14:25Z

We'd discussed that we should measure the overhead of the pinning we do in this patch, with and without the additional overhead of calling GetCurrentClosedTimestamp() for each read's command evaluation.

name \ ops/sec  master.txt  pin_with_closedts.txt  pin_without_closedts.txt
.                110k ± 0%              104k ± 0%                 105k ± 0%

name \ p50      master.txt  pin_with_closedts.txt  pin_without_closedts.txt
.                1.10 ± 0%              1.20 ± 0%                 1.10 ± 0%

name \ p95      master.txt  pin_with_closedts.txt  pin_without_closedts.txt
.                5.20 ± 0%              5.50 ± 0%                 5.50 ± 0%

name \ p99      master.txt  pin_with_closedts.txt  pin_without_closedts.txt
.                9.40 ± 0%             10.00 ± 0%                10.00 ± 0%

Looks like the pinning alone has a pretty significant cost (I recall @sumeerbhola had been mildly concerned about something like this?). I'll try diffing the CPU profiles between master and with the iterator pinning in this patch and update here.

sumeerbhola

Looks like the pinning alone has a pretty significant cost

Which benchmark is this?
The difference is that before we were creating a pebble.Iterator when creating the first pebbleIterator and then cloning it for future pebbleIterators. So 1 create + (n-1) clonings
While now we are dong 1 create + n clonings. Most of the time n=1. An optimization would be to store a bool in pebbleReadOnly and pebbleBatch about whether the {pebbleReadOnly,pebbleBatch}.iter has been used before and if not, don't bother cloning it.
There is also an existing bug where we may not close the iter:

	// Setting iter to nil is sufficient since it will be closed by one of the
	// subsequent destroy calls.
	p.iter = nil

This is not true, if no one has created an MVCCIterator or EngineIterator. The bool will also help for fixing this.

Reviewed 4 of 14 files at r1, 4 of 6 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @sumeerbhola)

pkg/kv/kvserver/replica_eval_context.go, line 55 at r2 (raw file):

// GetClosedTimestamp implements the EvalContext interface.
func (ec *evalContextImpl) GetClosedTimestamp(_ context.Context) hlc.Timestamp {

this is implementing ImmutatbleEvalContext, yes?
Odd that it is named immutable, but the closedTS field is protected by a mutex. Who modifies it?

aayushshah15

Which benchmark is this?

I should've mentioned it before, but it was just kv95 running for 5 mins.

./cockroach workload run kv --init --tolerate-errors --concurrency=192 --splits=1000 --ramp=1m --duration=5m --read-percent=95 --min-block-bytes=4096 --max-block-bytes=4096 {pgurl:1-5}

I added the bookkeeping you mentioned to avoid redundantly cloning the iterator an extra time. Here are the numbers with that change (I'm not sure if they're directly comparable to the numbers I posted above, since I don't remember what cluster size I ran the previous numbers on):

name \ ops/sec  master.txt  pin-with-closedts.txt  pin-wo-closedts.txt
.                108k ± 2%              112k ± 1%            114k ± 0%

name \ p50      master.txt  pin-with-closedts.txt  pin-wo-closedts.txt
.                1.17 ± 6%              1.10 ± 0%            1.10 ± 0%

name \ p95      master.txt  pin-with-closedts.txt  pin-wo-closedts.txt
.                5.93 ± 2%              5.58 ± 4%            5.50 ± 0%

name \ p99      master.txt  pin-with-closedts.txt  pin-wo-closedts.txt
.                11.7 ± 3%              11.0 ± 0%            11.0 ± 0%

Surprisingly, it doesn't seem like the any of the locking inside Replica.GetCurrentClosedTimestamp() has much of an effect on the workload, cc @nvanbenschoten.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)

pkg/kv/kvserver/replica_eval_context.go, line 55 at r2 (raw file):

Previously, sumeerbhola wrote…

this is implementing ImmutatbleEvalContext, yes?
Odd that it is named immutable, but the closedTS field is protected by a mutex. Who modifies it?

You're right, we didn't need the mutex here.

aayushshah15 · 2022-03-25T01:49:33Z

@nvanbenschoten mentioned that we should run the numbers for with/without the call into Replica.GetCurrentClosedTimestamp again for more iterations.

I ran them again for 10 iterations this time, and here they are. old is with the call to GetCurrentClosedTimestamp and new is without:

name  old ops/sec  new ops/sec  delta
.       109k ± 3%    115k ± 1%  +5.23%  (p=0.000 n=10+9)

name  old p50      new p50      delta
.       1.14 ± 5%    1.10 ± 0%    ~     (p=0.087 n=10+10)

name  old p95      new p95      delta
.       5.80 ± 0%    5.50 ± 0%  -5.17%  (p=0.002 n=8+10)

name  old p99      new p99      delta
.       11.4 ± 7%    11.0 ± 0%  -3.17%  (p=0.003 n=10+8)

In cockroachdb#76312, we're making it such that a replica's closed timestamp is computed during the creation of the batch request's evaluation context. This is to ensure that a replica's closed timestamp is captured before we pin the state of its storage engine (since otherwise, the closed timestamp could be _newer_ than the state of the engine the command is evaluating over). See: cockroachdb#70974. However, computing the replica's current closed timestamp on every command evaluation can be expensive due to the mutual exclusion involved (it requires that we hold `Replica.mu` as well as a few locks inside the closed timestamp side transport). Since only the `QueryResolvedTimestampRequest` and `SubsumeRequest` need to read a replica's closed timestamp, this commit makes it such that we elide this work for every other request. Release note: None

nvanbenschoten · 2022-03-27T21:34:51Z

Thanks for gathering those extra numbers. Is the only difference between old and new the call to GetCurrentClosedTimestamp ? Put differently, is new still pinning iterators? If so, that indicates that calling GetCurrentClosedTimestamp indiscriminately is too expensive, right?

aayushshah15 · 2022-03-28T05:10:09Z

Is the only difference between old and new the call to GetCurrentClosedTimestamp ? Put differently, is new still pinning iterators?

Yep.

If so, that indicates that calling GetCurrentClosedTimestamp indiscriminately is too expensive, right?

Yes, I've added a commit to this PR that makes it such that most requests won't call into GetCurrentClosedTimestamp() anymore. PTAL. For the sake of completion, here are the numbers comparing the first commit of this PR (i.e. with pinning and with the call into GetCurrentClosedTimestamp() on every request) against the last commit of this PR.

> benchstat ../benchlogs/pin-with-closedts.txt ../benchlogs/pin-elide-closedts.txt
name  old ops/sec  new ops/sec  delta
.       109k ± 3%    117k ± 2%  +7.54%  (p=0.000 n=10+10)

name  old p50      new p50      delta
.       1.14 ± 5%    1.10 ± 0%    ~     (p=0.065 n=10+9)

name  old p95      new p95      delta
.       5.80 ± 0%    5.41 ± 4%  -6.72%  (p=0.000 n=8+10)

name  old p99      new p99      delta
.       11.4 ± 7%    10.5 ± 0%  -7.57%  (p=0.000 n=10+8)

nvanbenschoten · 2022-03-28T13:40:04Z

For even more completeness, do you have this direct comparison between master and the last commit of this PR?

aayushshah15 · 2022-03-28T18:48:59Z

For even more completeness, do you have this direct comparison between master and the last commit of this PR?

Here is master against the last commit in this PR, not sure what to make of the slight perf improvement:

benchstat  ../benchlogs/master.txt ../benchlogs/pin-elide-closedts.txt
name  old ops/sec  new ops/sec  delta
.       114k ± 3%    117k ± 2%  +3.31%  (p=0.001 n=10+10)

name  old p50      new p50      delta
.       1.10 ± 0%    1.10 ± 0%    ~     (all equal)

name  old p95      new p95      delta
.       5.50 ± 0%    5.41 ± 4%    ~     (p=0.071 n=6+10)

name  old p99      new p99      delta
.       11.0 ± 0%    10.5 ± 0%  -4.55%  (p=0.000 n=8+8)

nvanbenschoten

Reviewed 2 of 14 files at r1, 1 of 6 files at r2, 11 of 15 files at r3, 4 of 4 files at r4, 4 of 4 files at r5, 7 of 7 files at r6, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @sumeerbhola)

pkg/kv/kvserver/replica_eval_context.go, line 39 at r4 (raw file):

func newEvalContextImpl(r *Replica) *evalContextImpl {
	return &evalContextImpl{

Should we pool these objects to avoid the new heap allocation? Their lifetime should be pretty well defined.

pkg/kv/kvserver/replica_eval_context.go, line 41 at r4 (raw file):

	return &evalContextImpl{
		Replica:  r,
		closedTS: r.GetCurrentClosedTimestamp(context.TODO()),

Let's plumb a ctx to here.

pkg/kv/kvserver/batcheval/eval_context.go, line 142 at r4 (raw file):

}

// ImmutableEvalContext is like EvalContext, but it encapsulates state that

The idea was to pass only this interface to read-only command evaluation functions, right? To do that, we may need to pull the EvalContext out of CommandArgs. Even if that is the plan, we can do it in a separate PR.

pkg/kv/kvserver/batcheval/eval_context.go, line 147 at r4 (raw file):

	// GetClosedTimestamp returns the current closed timestamp on the range.
	// It is expected that a caller will have performed some action (either
	// calling RevokeLease or WatchForMerge) to freeze further progression of

This comment reminds me that GetClosedTimestamp is also used in during range subsumption and lease transfers. In those cases, we want the latest closed timestamp, not the closed timestamp at the time of the storage snapshot. The ordering is very important for those cases, as they need the closed timestamp to be computed after their call to WaitForMerge or RevokeLease. Do you have thoughts on what we should do here? Maybe expose two methods?

pkg/storage/pebble.go, line 1681 at r4 (raw file):

	iter       cloneableIter
	unused     bool

nit: should this be called iterUnused? Is it clear what it's referring to? Same question below in pebbleBatch.

sumeerbhola

Reviewed 5 of 15 files at r3, 3 of 4 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @nvanbenschoten, and @sumeerbhola)

pkg/storage/pebble.go, line 1681 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: should this be called iterUnused? Is it clear what it's referring to? Same question below in pebbleBatch.

+1 to iterUnused

…luation This commit introduces a change to the way certain types of read-only requests are evaluated. Traditionally, read-only requests have held their latches throughout their execution. This commit allows certain qualifying reads to be able to release their latches earlier. At a high level, reads may attempt to resolve all conflicts upfront by performing a sort of "validation" phase before they perform their MVCC scan. This validation phase performs a scan of the lock table keyspace in order to find any conflicting intents that may need to be resolved before the actual evaluation of the request over the MVCC keyspace. If no conflicting intents are found, then (since cockroachdb#76312), the request is guaranteed to be fully isolated against all other concurrent requests and can be allowed to release its latches at this point. This allows the actual evaluation of the read (over the MVCC part of the keyspace) to proceed without latches being held, which is the main motivation of this work. This validation phase could be thought of as an extension to the validation that the concurrency manager already performs when requests are sequenced through it, by trying to detect any conflicting intents that have already been pulled into the in-memory lock table. Additionally, for certain types of requests that can drop their latches early, and do not need to access the `IntentHistory` for any of their parent txn's intents, this commit attempts to make their MVCC scan cheaper by eliminating the need for an `intentInterleavingIterator`. This is enabled by the observation that once the validation phase is complete, the only remaining intents in the read's declared span must be intents belonging to the reader's transaction. So if the reader doesn't need to read an intent that isn't the latest intent on a key, then it doesn't need access to the key's `IntentHistory` (which lives in the lock-table keyspace), and doesn't need to use an `intentInterleavingIterator`. Release note (performance improvement): certain types of reads will now have a far smaller contention footprint with conflicting concurrent writers Resolves cockroachdb#66485 Release justification: high benefit change to existing functionality, part of 22.2 roadmap

…luation This commit introduces a change to the way certain types of read-only requests are evaluated. Traditionally, read-only requests have held their latches throughout their execution. This commit allows certain qualifying reads to be able to release their latches earlier. At a high level, reads may attempt to resolve all conflicts upfront by performing a sort of "validation" phase before they perform their MVCC scan. This validation phase performs a scan of the lock table keyspace in order to find any conflicting intents that may need to be resolved before the actual evaluation of the request over the MVCC keyspace. If no conflicting intents are found, then (since cockroachdb#76312), the request is guaranteed to be fully isolated against all other concurrent requests and can be allowed to release its latches at this point. This allows the actual evaluation of the read (over the MVCC part of the keyspace) to proceed without latches being held, which is the main motivation of this work. This validation phase could be thought of as an extension to the validation that the concurrency manager already performs when requests are sequenced through it, by trying to detect any conflicting intents that have already been pulled into the in-memory lock table. Additionally, for certain types of requests that can drop their latches early, and do not need to access the `IntentHistory` for any of their parent txn's intents, this commit attempts to make their MVCC scan cheaper by eliminating the need for an `intentInterleavingIterator`. This is enabled by the observation that once the validation phase is complete, the only remaining intents in the read's declared span must be intents belonging to the reader's transaction. So if the reader doesn't need to read an intent that isn't the latest intent on a key, then it doesn't need access to the key's `IntentHistory` (which lives in the lock-table keyspace), and doesn't need to use an `intentInterleavingIterator`. Release note (performance improvement): certain types of reads will now have a far smaller contention footprint with conflicting concurrent writers Resolves cockroachdb#66485

85993: kvserver: allow certain read-only requests to drop latches before evaluation r=aayushshah15 a=aayushshah15 This commit introduces a change to the way certain types of read-only requests are evaluated. Traditionally, read-only requests have held their latches throughout their execution. This commit allows certain qualifying reads to be able to release their latches earlier. At a high level, reads may attempt to resolve all conflicts upfront by performing a sort of "validation" phase before they perform their MVCC scan. This validation phase performs a scan of the lock table keyspace in order to find any conflicting intents that may need to be resolved before the actual evaluation of the request over the MVCC keyspace. If no conflicting intents are found, then (since #76312), the request is guaranteed to be fully isolated against all other concurrent requests and can be allowed to release its latches at this point. This allows the actual evaluation of the read (over the MVCC part of the keyspace) to proceed without latches being held, which is the main motivation of this work. This validation phase could be thought of as an extension to the validation that the concurrency manager already performs when requests are sequenced through it, by trying to detect any conflicting intents that have already been pulled into the in-memory lock table. Additionally, for certain types of requests that can drop their latches early, and do not need to access the `IntentHistory` for any of their parent txn's intents, this commit attempts to make their MVCC scan cheaper by eliminating the need for an `intentInterleavingIterator`. This is enabled by the observation that once the validation phase is complete, the only remaining intents in the read's declared span must be intents belonging to the reader's transaction. So if the reader doesn't need to read an intent that isn't the latest intent on a key, then it doesn't need access to the key's `IntentHistory` (which lives in the lock-table keyspace), and doesn't need to use an `intentInterleavingIterator`. Release note (performance improvement): certain types of reads will now have a far smaller contention footprint with conflicting concurrent writers Resolves #66485 Release justification: high benefit change to existing functionality, part of 22.2 roadmap Co-authored-by: Aayush Shah <[email protected]>

aayushshah15 force-pushed the 20220209_pinIteratorsForEvaluatingReads branch from 63fd520 to 0e68cf0 Compare February 14, 2022 18:12

sumeerbhola requested changes Feb 14, 2022

View reviewed changes

aayushshah15 force-pushed the 20220209_pinIteratorsForEvaluatingReads branch 2 times, most recently from 1fa3a1d to 099c920 Compare March 23, 2022 17:16

aayushshah15 commented Mar 23, 2022

View reviewed changes

aayushshah15 marked this pull request as ready for review March 23, 2022 17:30

aayushshah15 requested review from a team as code owners March 23, 2022 17:30

aayushshah15 requested review from nvanbenschoten and removed request for a team March 23, 2022 17:34

aayushshah15 force-pushed the 20220209_pinIteratorsForEvaluatingReads branch from c0a1980 to b5a3994 Compare March 26, 2022 21:14

aayushshah15 force-pushed the 20220209_pinIteratorsForEvaluatingReads branch from b5a3994 to ac8da77 Compare March 26, 2022 21:42

aayushshah15 force-pushed the 20220209_pinIteratorsForEvaluatingReads branch from ac8da77 to 5795da6 Compare March 26, 2022 23:31

nvanbenschoten requested changes Mar 29, 2022

View reviewed changes

aayushshah15 mentioned this pull request Mar 29, 2022

kvserver: check GC threshold after acquiring a storage snapshot #78980

Merged

sumeerbhola reviewed Mar 29, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver, batcheval: pin Engine state during read-only command evaluation #76312

kvserver, batcheval: pin Engine state during read-only command evaluation #76312

aayushshah15 commented Feb 9, 2022 •

edited

Loading

cockroach-teamcity commented Feb 9, 2022

aayushshah15 commented Feb 9, 2022 •

edited

Loading

aayushshah15 commented Feb 14, 2022

sumeerbhola left a comment

aayushshah15 left a comment

aayushshah15 commented Mar 25, 2022

nvanbenschoten commented Mar 27, 2022

aayushshah15 commented Mar 28, 2022 •

edited

Loading

nvanbenschoten commented Mar 28, 2022

aayushshah15 commented Mar 28, 2022

nvanbenschoten left a comment

sumeerbhola left a comment

kvserver, batcheval: pin Engine state during read-only command evaluation #76312

kvserver, batcheval: pin Engine state during read-only command evaluation #76312

Conversation

aayushshah15 commented Feb 9, 2022 • edited Loading

cockroach-teamcity commented Feb 9, 2022

aayushshah15 commented Feb 9, 2022 • edited Loading

aayushshah15 commented Feb 14, 2022

sumeerbhola left a comment

Choose a reason for hiding this comment

aayushshah15 left a comment

Choose a reason for hiding this comment

aayushshah15 commented Mar 25, 2022

nvanbenschoten commented Mar 27, 2022

aayushshah15 commented Mar 28, 2022 • edited Loading

nvanbenschoten commented Mar 28, 2022

aayushshah15 commented Mar 28, 2022

nvanbenschoten left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

aayushshah15 commented Feb 9, 2022 •

edited

Loading

aayushshah15 commented Feb 9, 2022 •

edited

Loading

aayushshah15 commented Mar 28, 2022 •

edited

Loading