-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: use lease start time to determine follower read safety #35130
storage: use lease start time to determine follower read safety #35130
Conversation
pkg/storage/replica_follower_read.go
Outdated
@@ -66,6 +66,12 @@ func (r *Replica) canServeFollowerRead( | |||
r.store.cfg.ClosedTimestamp.Storage.(*ctstorage.MultiStorage).StringForNodes(lErr.LeaseHolder.NodeID), | |||
) | |||
} | |||
// We may be able to serve the read based on the lease start time. | |||
// It is safe to serve a read for requests which occur before this | |||
// Replica's view of the lease start time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because anything proposed by a previous leaseholder and yet unobserved won't apply under this new lease, and under this lease nothing can be proposed that would affect timestamps earlier than the start time of the lease (in practice, this is enforced by the timestamp cache on the leaseholder).
The one concern that I have (which is probably not a real one but needs to be sleuthed) is that of lease extensions and how they affect the Start
timestamp. We basically want Start
to never move forward during the tenure of a single leaseholder (as measured via the lease sequence number).
It also seems that this works for expiration-based leases as well, so we do seem to have some kind of follower reads for expiration-based ranges, even though it's a pretty shitty one since the Start timestamp won't move forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is basically what I had in mind. Thanks for getting the PR out so quickly.
We're also going to want this logic to help keep rangefeed closed timestamps moving forward on lease transfers as well. Here's one proposal to accomplish this and remove duplicated code:
- get rid of
Provider.CanServe(ts)
, it should be equivalent to!Provider.MaxClosed.Less(ts)
. Switch over all tests - create a new
Replica.MaxClosed
method that takes the closed timestamp and the lease into consideration - use this new method in
Replica.canServeFollowerRead
andReplica.handleClosedTimestampUpdateRaftMuLocked
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @tbg)
pkg/storage/replica_follower_read.go, line 71 at r1 (raw file):
This is where that logic lives. Your intuition holds for everything except for if the min proposed timestamp is set. In that case, we forward the start timestamp to this value even though we're performing an extension. This appears to still work out though, based on the comment:
// Note also that leasePostApply makes sure to update the timestamp cache in
// this case: even though the lease holder does not change, the the sequence
// number does and this triggers a low water mark bump.
Lease.Equivalent
will return false
if the Start
timestamp of the lease changes, meaning that it will trigger a timestamp cache bump, so the comment checks out.
pkg/storage/replica_follower_read.go, line 72 at r1 (raw file):
// It is safe to serve a read for requests which occur before this // Replica's view of the lease start time. if ba.Timestamp.Less(lease.Start) {
I think this should actually be !lease.Start.Less(ba.Timestamp)
. A leaseholder shouldn't be able to write exactly at its lease's start timestamp.
184c63c
to
f54ba93
Compare
@nvanbenschoten this still needs some more testing, especially targeted around lease transfers but PTAL and see if this is what you had in mind. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is exactly what I had in mind. @tbg does anything stick out to you as obviously wrong about switching from CanServe
to MaxClosed
?
Of course, this only makes me more excited for both targetted testing around lease transfers and more general invariant testing with lease movement, splitting, merging, rebalancing, and chaos 😃
Reviewed 6 of 6 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @tbg)
pkg/storage/replica_follower_read.go, line 94 at r2 (raw file):
maxClosed := r.store.cfg.ClosedTimestamp.Provider.MaxClosed( lease.Replica.NodeID, r.RangeID, ctpb.Epoch(lease.Epoch), ctpb.LAI(lai)) if maxClosed.IsEmpty() || maxClosed.Less(lease.Start) {
We can replace this block with maxClosed.Forward(lease.Start)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice simplification, I like it.
Reviewed 6 of 6 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained
f54ba93
to
bc52593
Compare
bc52593
to
fbde377
Compare
I had been avoiding pushing an update to this diff because I was writing more tests on splits and merges but those tests have turned out to not be passing and need more investigation so I took the lease related testing from my testing branch and plopped it over here. Should be RFAL. |
fbde377
to
171e033
Compare
I added another commit here which makes follower reads robust to splits. I could split it out into a separate PR if that makes this easier to review but the testing relied on code from the original PR. |
9935227
to
e9ca3ff
Compare
@nvanbenschoten PTAL when you get a chance. |
e9ca3ff
to
ac5ac0d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r3, 6 of 6 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)
pkg/storage/closed_timestamp_test.go, line 121 at r3 (raw file):
// that we can always read from this timestamp from all replicas even while // lease transfers are ongoing. The test launches a goroutine to randomly // trigger transfers every
Finish this sentence :)
pkg/storage/closed_timestamp_test.go, line 154 at r3 (raw file):
g.Go(transferLeasesRandomlyUntilDeadline) // Attempt to send read requests to a replica in a tight loop until deadline
Weird spacing.
pkg/storage/replica.go, line 318 at r4 (raw file):
// initialMaxClosed is the initial maxClosed timestamp for the replica as known // from its left-hand-side upon creation. initialMaxClosed hlc.Timestamp
I'm curious what your proposal was for informing the closed timestamp infrastructure of splits instead of introducing this extra state. It doesn't seem like much of a violation to inform a Provider that all of its knowledge about a (node, rangeID) also applies to a different (node, rangeID).
pkg/storage/replica_follower_read.go, line 94 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
We can replace this block with
maxClosed.Forward(lease.Start)
Looks like this wasn't addressed. We can do the same for initialMaxClosed
as well.
It was sort of gross. Modifying the ctpb.Entry values in storage is weird. The basic idea was the clone the MLIA map value in the entries in storage. This feels weird because these are legitimately messages from a remote store that I'll be modifying. That change to the API then has to propagate through the provider. The bigger issue is picking the right MLAI. Before I niavely just cloned what was in there but I don't think that works as there could have been a new update after the split. We sort of need to either have the LAI of the split itself. I don't exactly know where to find that. I'm wavering on whether that's the right path to follow. |
I see, we want the largest closed timestamp due to an MLAI at or before the index of the split. Conveniently, the way you have it now gives exactly those semantics. I guess I'm ok with this approach then, although @tbg should also take a look. Also, make sure you add a nice comment about how this works right at the point where we set |
This change recognizes that lease transfer start times have provide write invariants equivalent to those provided by the closed timestamp and MLAI and thus the maximum closed timestamp for a replica can be viewed as the maximum timestamp of that of the current replica lease start time and the timestamp provided by the closed timestamp provider. Utilizing this fact makes follower reads as well as rangefeeds more robust to lease transfers. Before this change lease transfers to different nodes might cause closed timestamp information for a range to become unavailable leading to an inability for a range to serve a read at a timestamp which was previously determined to be safe. Fixes cockroachdb#35129. Release note: None
This commit prevents discontinuities in ability to read a value from a follower after a split. The invariant we want to uphold is that values which were safe to read from a follower will remain safe to read following the split. In the face of merges the system provides the claim that timestamps which were readable on both the left and right sides of the merge will remain readable after the merge. Release note: None
ac5ac0d
to
4c0cbf8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment. Waiting on @tbg
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/storage/replica.go, line 318 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I'm curious what your proposal was for informing the closed timestamp infrastructure of splits instead of introducing this extra state. It doesn't seem like much of a violation to inform a Provider that all of its knowledge about a (node, rangeID) also applies to a different (node, rangeID).
Discussed in github. The problems are twofold, 1 is that it's awkward to modify a ctpb.Entry from a different node and creating an interface to do so feels wrong and 2 is that it's hard to know the right MLAI value to put into that entry. Blindly cloning the current value for the LHS might be wrong. I think a reasonable value could be determined but it feels wrong to have to add logic to figure that out.
pkg/storage/replica_follower_read.go, line 94 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Looks like this wasn't addressed. We can do the same for
initialMaxClosed
as well.
Done. Much cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 4 files at r3, 7 of 7 files at r5, 6 of 6 files at r6.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
bors r+ |
Build failed |
bors r+ |
35130: storage: use lease start time to determine follower read safety r=ajwerner a=ajwerner This change obviously needs some testing. I'm putting it out to make sure I understand the issue and proposed solution. This change attempts to address a situation where a lease transfer may prevent a follower read. This happens because after the lease transfer we no longer have any information in the closedts storage regarding the range from the new leaseholder. Fixes #35129. Fixes #32980. Release note: None Co-authored-by: Andrew Werner <[email protected]>
Build succeeded |
This PR addresses a test flake introduced by enabling follower reads in conjunction with cockroachdb#35130 which makes follower reads more generally possible in the face of lease transfer. Fixes cockroachdb#35758. Release note: None
This PR addresses a test flake introduced by enabling follower reads in conjunction with cockroachdb#35130 which makes follower reads more generally possible in the face of lease transfer. Fixes cockroachdb#35758. Release note: None
35284: storage,kv: make transaction deadline exceeded errors retriable r=andreimatei a=andreimatei Before this patch, they were opaque TransactionStatusErrors. The belief is that we should only be seeing such errors when a transaction is pushed by minutes. Shockingly, this seems to hapen enough in our tests, for example as described here: #18684 (comment) This patch marks the error as retriable, since it technically is. This patch also changes the semantics of the EndTransactionRequest.Deadline field to make it exclusive so that it matches the nature of SQL leases. No migration needed. Touches #18684 Release note (sql change): "transaction deadline exceeded" errors are now returned to the client with a retriable code. 35793: storage: fix TestRangeInfo flake and re-enable follower reads by default r=ajwerner a=ajwerner This PR addresses a test flake introduced by enabling follower reads in conjunction with #35130 which makes follower reads more generally possible in the face of lease transfer. Fixes #35758. Release note: None 35865: roachtest: Skip flaky jepsen nemesis r=tbg a=bdarnell See #35599 Release note: None Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: Ben Darnell <[email protected]>
This commit reverts the main change introduced in cockroachdb#35130. That change made it such that each replica could bootstrap its `maxClosed` timestamp value with the start time of the latest lease that it knew about. However, it doesn't consider the fact that when the lease transfer occurred, the range may have been in a subsumed state and thus, is not allowed to serve any requests past the subsumption time. Bumping a replica's closed timestamp by the lease start time allows for a bug where the closed timestamp of a replica may be advanced past the subsumption time of a range, which would allow the range's non-leaseholder replicas to serve follower reads past its subsumption time. Such a sequence of events would in turn allow follower reads queries to serve results that could be invalidated by future writes on the keyspan owned by the subsumed range. Release note (bug fix): A rare bug present in 20.2 that allowed `AS OF SYSTEM TIME` queries to serve inconsistent results has been resolved.
This commit reverts the main change introduced in cockroachdb#35130. That change made it such that each replica could bootstrap its `maxClosed` timestamp value with the start time of the latest lease that it knew about. However, it doesn't consider the fact that when the lease transfer occurred, the range may have been in a subsumed state and thus, is not allowed to serve any requests past the subsumption time. Bumping a replica's closed timestamp by the lease start time allows for a bug where the closed timestamp of a replica may be advanced past the subsumption time of a range, which would allow the range's non-leaseholder replicas to serve follower reads past its subsumption time. Such a sequence of events would in turn allow follower reads queries to serve results that could be invalidated by future writes on the keyspan owned by the subsumed range. Release note (bug fix): A rare bug in 20.2 that manifested itself when a lease change occured during a range merge has been resolved. This bug allowed `AS OF SYSTEM TIME` queries to serve inconsistent results.
This change obviously needs some testing. I'm putting it out to make sure I
understand the issue and proposed solution.
This change attempts to address a situation where a lease transfer may prevent
a follower read. This happens because after the lease transfer we no longer have
any information in the closedts storage regarding the range from the new
leaseholder.
Fixes #35129.
Fixes #32980.
Release note: None