Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: use lease start time to determine follower read safety #35130

Conversation

ajwerner
Copy link
Contributor

@ajwerner ajwerner commented Feb 21, 2019

This change obviously needs some testing. I'm putting it out to make sure I
understand the issue and proposed solution.

This change attempts to address a situation where a lease transfer may prevent
a follower read. This happens because after the lease transfer we no longer have
any information in the closedts storage regarding the range from the new
leaseholder.

Fixes #35129.
Fixes #32980.

Release note: None

@ajwerner ajwerner requested review from nvanbenschoten and a team February 21, 2019 19:07
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@@ -66,6 +66,12 @@ func (r *Replica) canServeFollowerRead(
r.store.cfg.ClosedTimestamp.Storage.(*ctstorage.MultiStorage).StringForNodes(lErr.LeaseHolder.NodeID),
)
}
// We may be able to serve the read based on the lease start time.
// It is safe to serve a read for requests which occur before this
// Replica's view of the lease start time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because anything proposed by a previous leaseholder and yet unobserved won't apply under this new lease, and under this lease nothing can be proposed that would affect timestamps earlier than the start time of the lease (in practice, this is enforced by the timestamp cache on the leaseholder).

The one concern that I have (which is probably not a real one but needs to be sleuthed) is that of lease extensions and how they affect the Start timestamp. We basically want Start to never move forward during the tenure of a single leaseholder (as measured via the lease sequence number).

It also seems that this works for expiration-based leases as well, so we do seem to have some kind of follower reads for expiration-based ranges, even though it's a pretty shitty one since the Start timestamp won't move forward.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is basically what I had in mind. Thanks for getting the PR out so quickly.

We're also going to want this logic to help keep rangefeed closed timestamps moving forward on lease transfers as well. Here's one proposal to accomplish this and remove duplicated code:

  1. get rid of Provider.CanServe(ts), it should be equivalent to !Provider.MaxClosed.Less(ts). Switch over all tests
  2. create a new Replica.MaxClosed method that takes the closed timestamp and the lease into consideration
  3. use this new method in Replica.canServeFollowerRead and Replica.handleClosedTimestampUpdateRaftMuLocked

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @tbg)


pkg/storage/replica_follower_read.go, line 71 at r1 (raw file):
This is where that logic lives. Your intuition holds for everything except for if the min proposed timestamp is set. In that case, we forward the start timestamp to this value even though we're performing an extension. This appears to still work out though, based on the comment:

// Note also that leasePostApply makes sure to update the timestamp cache in
// this case: even though the lease holder does not change, the the sequence
// number does and this triggers a low water mark bump.

Lease.Equivalent will return false if the Start timestamp of the lease changes, meaning that it will trigger a timestamp cache bump, so the comment checks out.


pkg/storage/replica_follower_read.go, line 72 at r1 (raw file):

			// It is safe to serve a read for requests which occur before this
			// Replica's view of the lease start time.
			if ba.Timestamp.Less(lease.Start) {

I think this should actually be !lease.Start.Less(ba.Timestamp). A leaseholder shouldn't be able to write exactly at its lease's start timestamp.

@ajwerner ajwerner force-pushed the ajwerner/use-lease-start-time-to-determine-follower-read-safety branch from 184c63c to f54ba93 Compare February 21, 2019 23:23
@ajwerner
Copy link
Contributor Author

@nvanbenschoten this still needs some more testing, especially targeted around lease transfers but PTAL and see if this is what you had in mind.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is exactly what I had in mind. @tbg does anything stick out to you as obviously wrong about switching from CanServe to MaxClosed?

Of course, this only makes me more excited for both targetted testing around lease transfers and more general invariant testing with lease movement, splitting, merging, rebalancing, and chaos 😃

Reviewed 6 of 6 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @tbg)


pkg/storage/replica_follower_read.go, line 94 at r2 (raw file):

	maxClosed := r.store.cfg.ClosedTimestamp.Provider.MaxClosed(
		lease.Replica.NodeID, r.RangeID, ctpb.Epoch(lease.Epoch), ctpb.LAI(lai))
	if maxClosed.IsEmpty() || maxClosed.Less(lease.Start) {

We can replace this block with maxClosed.Forward(lease.Start)

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice simplification, I like it.

Reviewed 6 of 6 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@ajwerner ajwerner force-pushed the ajwerner/use-lease-start-time-to-determine-follower-read-safety branch from f54ba93 to bc52593 Compare February 22, 2019 15:27
@ajwerner ajwerner force-pushed the ajwerner/use-lease-start-time-to-determine-follower-read-safety branch from bc52593 to fbde377 Compare March 4, 2019 15:21
@ajwerner
Copy link
Contributor Author

ajwerner commented Mar 4, 2019

I had been avoiding pushing an update to this diff because I was writing more tests on splits and merges but those tests have turned out to not be passing and need more investigation so I took the lease related testing from my testing branch and plopped it over here. Should be RFAL.

@ajwerner ajwerner force-pushed the ajwerner/use-lease-start-time-to-determine-follower-read-safety branch from fbde377 to 171e033 Compare March 8, 2019 20:23
@ajwerner
Copy link
Contributor Author

ajwerner commented Mar 8, 2019

I added another commit here which makes follower reads robust to splits. I could split it out into a separate PR if that makes this easier to review but the testing relied on code from the original PR.

@ajwerner ajwerner force-pushed the ajwerner/use-lease-start-time-to-determine-follower-read-safety branch 3 times, most recently from 9935227 to e9ca3ff Compare March 11, 2019 13:27
@ajwerner
Copy link
Contributor Author

@nvanbenschoten PTAL when you get a chance.

@ajwerner ajwerner force-pushed the ajwerner/use-lease-start-time-to-determine-follower-read-safety branch from e9ca3ff to ac5ac0d Compare March 11, 2019 22:10
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r3, 6 of 6 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)


pkg/storage/closed_timestamp_test.go, line 121 at r3 (raw file):

	// that we can always read from this timestamp from all replicas even while
	// lease transfers are ongoing. The test launches a goroutine to randomly
	// trigger transfers every

Finish this sentence :)


pkg/storage/closed_timestamp_test.go, line 154 at r3 (raw file):

	g.Go(transferLeasesRandomlyUntilDeadline)

	// Attempt to send read requests to	a replica in a tight loop until deadline

Weird spacing.


pkg/storage/replica.go, line 318 at r4 (raw file):

		// initialMaxClosed is the initial maxClosed timestamp for the replica as known
		// from its left-hand-side upon creation.
		initialMaxClosed hlc.Timestamp

I'm curious what your proposal was for informing the closed timestamp infrastructure of splits instead of introducing this extra state. It doesn't seem like much of a violation to inform a Provider that all of its knowledge about a (node, rangeID) also applies to a different (node, rangeID).


pkg/storage/replica_follower_read.go, line 94 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We can replace this block with maxClosed.Forward(lease.Start)

Looks like this wasn't addressed. We can do the same for initialMaxClosed as well.

@ajwerner
Copy link
Contributor Author

I'm curious what your proposal was for informing the closed timestamp infrastructure of splits instead of introducing this extra state. It doesn't seem like much of a violation to inform a Provider that all of its knowledge about a (node, rangeID) also applies to a different (node, rangeID).

It was sort of gross. Modifying the ctpb.Entry values in storage is weird. The basic idea was the clone the MLIA map value in the entries in storage. This feels weird because these are legitimately messages from a remote store that I'll be modifying. That change to the API then has to propagate through the provider. The bigger issue is picking the right MLAI. Before I niavely just cloned what was in there but I don't think that works as there could have been a new update after the split. We sort of need to either have the LAI of the split itself. I don't exactly know where to find that. I'm wavering on whether that's the right path to follow.

@nvanbenschoten
Copy link
Member

The bigger issue is picking the right MLAI. Before I niavely just cloned what was in there but I don't think that works as there could have been a new update after the split.

I see, we want the largest closed timestamp due to an MLAI at or before the index of the split. Conveniently, the way you have it now gives exactly those semantics. I guess I'm ok with this approach then, although @tbg should also take a look.

Also, make sure you add a nice comment about how this works right at the point where we set initialMaxClosed. The dependencies there are subtle. If that method is called with a r.mu.state.LeaseAppliedIndex > leaseAppliedIndexOfSplit then we could get an incorrect result and would start serving inconsistent follower reads.

This change recognizes that lease transfer start times have provide write
invariants equivalent to those provided by the closed timestamp and MLAI and
thus the maximum closed timestamp for a replica can be viewed as the maximum
timestamp of that of the current replica lease start time and the timestamp
provided by the closed timestamp provider. Utilizing this fact makes follower
reads as well as rangefeeds more robust to lease transfers. Before this change
lease transfers to different nodes might cause closed timestamp information for
a range to become unavailable leading to an inability for a range to serve a
read at a timestamp which was previously determined to be safe.

Fixes cockroachdb#35129.

Release note: None
This commit prevents discontinuities in ability to read a value from a follower
after a split. The invariant we want to uphold is that values which were safe
to read from a follower will remain safe to read following the split. In the
face of merges the system provides the claim that timestamps which were readable
on both the left and right sides of the merge will remain readable after the
merge.

Release note: None
@ajwerner ajwerner force-pushed the ajwerner/use-lease-start-time-to-determine-follower-read-safety branch from ac5ac0d to 4c0cbf8 Compare March 12, 2019 18:30
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment. Waiting on @tbg

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/storage/replica.go, line 318 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'm curious what your proposal was for informing the closed timestamp infrastructure of splits instead of introducing this extra state. It doesn't seem like much of a violation to inform a Provider that all of its knowledge about a (node, rangeID) also applies to a different (node, rangeID).

Discussed in github. The problems are twofold, 1 is that it's awkward to modify a ctpb.Entry from a different node and creating an interface to do so feels wrong and 2 is that it's hard to know the right MLAI value to put into that entry. Blindly cloning the current value for the LHS might be wrong. I think a reasonable value could be determined but it feels wrong to have to add logic to figure that out.


pkg/storage/replica_follower_read.go, line 94 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Looks like this wasn't addressed. We can do the same for initialMaxClosed as well.

Done. Much cleaner.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 4 files at r3, 7 of 7 files at r5, 6 of 6 files at r6.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

@ajwerner
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 14, 2019

Build failed

@ajwerner
Copy link
Contributor Author

bors r+

craig bot pushed a commit that referenced this pull request Mar 14, 2019
35130: storage: use lease start time to determine follower read safety r=ajwerner a=ajwerner

This change obviously needs some testing. I'm putting it out to make sure I
understand the issue and proposed solution.

This change attempts to address a situation where a lease transfer may prevent
a follower read. This happens because after the lease transfer we no longer have
any information in the closedts storage regarding the range from the new
leaseholder.

Fixes #35129.
Fixes #32980.

Release note: None

Co-authored-by: Andrew Werner <[email protected]>
@craig
Copy link
Contributor

craig bot commented Mar 14, 2019

Build succeeded

@craig craig bot merged commit 4c0cbf8 into cockroachdb:master Mar 14, 2019
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Mar 15, 2019
This PR addresses a test flake introduced by enabling follower reads in
conjunction with cockroachdb#35130 which makes follower reads more generally possible
in the face of lease transfer.

Fixes cockroachdb#35758.

Release note: None
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Mar 16, 2019
This PR addresses a test flake introduced by enabling follower reads in
conjunction with cockroachdb#35130 which makes follower reads more generally possible
in the face of lease transfer.

Fixes cockroachdb#35758.

Release note: None
craig bot pushed a commit that referenced this pull request Mar 18, 2019
35284: storage,kv: make transaction deadline exceeded errors retriable r=andreimatei a=andreimatei

Before this patch, they were opaque TransactionStatusErrors.
The belief is that we should only be seeing such errors when a
transaction is pushed by minutes. Shockingly, this seems to hapen enough
in our tests, for example as described here: #18684 (comment)

This patch marks the error as retriable, since it technically is.

This patch also changes the semantics of the
EndTransactionRequest.Deadline field to make it exclusive so that it
matches the nature of SQL leases. No migration needed.

Touches #18684

Release note (sql change): "transaction deadline exceeded" errors are
now returned to the client with a retriable code.

35793: storage: fix TestRangeInfo flake and re-enable follower reads by default r=ajwerner a=ajwerner

This PR addresses a test flake introduced by enabling follower reads in
conjunction with #35130 which makes follower reads more generally possible
in the face of lease transfer.

Fixes #35758.

Release note: None

35865: roachtest: Skip flaky jepsen nemesis r=tbg a=bdarnell

See #35599

Release note: None

Co-authored-by: Andrei Matei <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
Co-authored-by: Ben Darnell <[email protected]>
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request May 28, 2021
This commit reverts the main change introduced in cockroachdb#35130. That change made it
such that each replica could bootstrap its `maxClosed` timestamp value with the
start time of the latest lease that it knew about. However, it doesn't consider
the fact that when the lease transfer occurred, the range may have been in a
subsumed state and thus, is not allowed to serve any requests past the
subsumption time. Bumping a replica's closed timestamp by the lease start time
allows for a bug where the closed timestamp of a replica may be advanced past
the subsumption time of a range, which would allow the range's non-leaseholder
replicas to serve follower reads past its subsumption time.

Such a sequence of events would in turn allow follower reads queries to serve
results that could be invalidated by future writes on the keyspan owned by the
subsumed range.

Release note (bug fix): A rare bug present in 20.2 that allowed `AS OF SYSTEM
TIME` queries to serve inconsistent results has been resolved.
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Jul 9, 2021
This commit reverts the main change introduced in cockroachdb#35130. That change made it
such that each replica could bootstrap its `maxClosed` timestamp value with the
start time of the latest lease that it knew about. However, it doesn't consider
the fact that when the lease transfer occurred, the range may have been in a
subsumed state and thus, is not allowed to serve any requests past the
subsumption time. Bumping a replica's closed timestamp by the lease start time
allows for a bug where the closed timestamp of a replica may be advanced past
the subsumption time of a range, which would allow the range's non-leaseholder
replicas to serve follower reads past its subsumption time.

Such a sequence of events would in turn allow follower reads queries to serve
results that could be invalidated by future writes on the keyspan owned by the
subsumed range.

Release note (bug fix): A rare bug in 20.2 that manifested itself when a lease
change occured during a range merge has been resolved. This bug allowed `AS OF
SYSTEM TIME` queries to serve inconsistent results.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants