Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: no sync between splits and follower reads can cause invalid reads #67016

Closed
andreimatei opened this issue Jun 29, 2021 · 8 comments · Fixed by #89886 or #91405
Closed

kvserver: no sync between splits and follower reads can cause invalid reads #67016

andreimatei opened this issue Jun 29, 2021 · 8 comments · Fixed by #89886 or #91405
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-0-visible-logical-error Database stores inconsistent data in some cases, or queries return invalid results silently. T-kv KV Team

Comments

@andreimatei
Copy link
Contributor

andreimatei commented Jun 29, 2021

On leaseholders, latches provide serialization between different classes of operation. On followers, however, latches don't provide the same protection, because requests that are simply applied on a follower don't take latches. Follower reads do take latches, but I think that's by accident and useless given that writes don't take their latches.
At the very least, the following scenario involving a follower read and a split applying concurrently seems possible:

  1. follower read across all of range begins, checks range descriptor
  2. range split applies
  3. RHS range gets removed
  4. follower read misses RHS keys

In addition to latches, reads also take read locks on readOnlyCmdMu. Some range-level operations also take this lock, so the lock acts as a kind of structural synchronization. For example, I think removing a replica takes this lock (since #64324). Applying a split, however, does not. Maybe it should?

Alternatively, I think if step 1) in the scenario instead read "follower read across all of range begins, binds Pebble snapshot, checks range descriptor", then I think the hazard would be avoided because the "checking of the range descriptor" would provide sufficient synchronization with the split. Currently, reads don't bind their Pebble snapshot early (#55461), but I believe they could easily do it now that we have #66845. If we could bind the snapshot early, we could order it before a range descriptor check.

Besides this particular split/follower-read scenario, it seems to me that the topic of latches on followers should be explored. If the latches taken by follower-reads are not useful, we shouldn't take them. The evaluation of requests accesses both MVCC data and also in-memory range state. Access to this state is supposed to be made safe by latches, as asserted by SpanSetReplicaEvalContext. On followers, I think SpanSetReplicaEvalContext does not provide the guarantees that it thinks it does, which suggests that follower-reads should only have access to a much-reduced EvalContext which doesn't allow access to the range's in-memory state. Hopefully this would be sufficient for the limited set of requests that we evaluate on followers.

Jira issue: CRDB-8330

@andreimatei andreimatei added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-replication Relating to Raft, consensus, and coordination. S-0-visible-logical-error Database stores inconsistent data in some cases, or queries return invalid results silently. T-kv KV Team labels Jun 29, 2021
@andreimatei
Copy link
Contributor Author

We've discussed this a bit. Taking readonlyCmdMu more on the application path in write-mode, below Raft, seems dangerous as long as the lock is being held for the whole duration of read evaluation. It seems that the best way forward, at least in the short term, is to reduce the time the lock is held by evaluation by dropping it soon after r.checkExecutionCanProceed(). And couple that with eagerly-binding the Pebble snapshot while the lock is held. We've also discussed auditing checkExecutionCanProceed() to make sure that the replica state it accesses is properly synchronized with Pebble state (i.e. we write to Pebble before mutating in-memory state, or otherwise protect against checkExecutionCanProceed prematurely succeeding).

@sumeerbhola
Copy link
Collaborator

@andreimatei is fixing this unblocked by the merge of #66845?

@andreimatei
Copy link
Contributor Author

#66845 helps. But beyond that, I wouldn't say it's particularly clear what to do here because I'm not very confident in the (lack of) consequences for dropping readonlyCmdMu early.

@mwang1026
Copy link

Related to #55461 which will help unblock this work

@aayushshah15
Copy link
Contributor

Update for posterity: We now (as of #76312) check the range bounds and eagerly capture a storage engine snapshot while holding onto readOnlyCmdMu. However, it still seems like this bug might be possible because nothing in the split application path acquires readOnlyCmdMu. If we changed that (i.e. if splits only updated the LHS range descriptor while holding on to readOnlyCmdMu), we should no longer have this bug.

@nvanbenschoten (or @tbg) do you mind confirming whether the above seems correct to you?

@nvanbenschoten
Copy link
Member

Yes, this sounds correct to me. We'll need to make below-raft range boundary changes synchronized with respect to above-raft {range desc checks, destroy checks, engine snapshots}. The most straightforward way to do that seems like it would be to grab the readOnlyCmdMu in exclusive mode when applying a raft entry that changes a range's key bounds.

Another option would be to make this optimistic by re-checking the range descriptor (checkSpanInRangeRLocked) after grabbing an engine snapshot. We already determined that we need to grab the r.mu.RLock after this snapshot for #55293. In fact, it seems possible that we could remove readOnlyCmdMu entirely by also rechecking isDestroyedRLocked after grabbing an engine snapshot.

Also, now that we grab the engine snapshot eagerly, we should release readOnlyCmdMu before evaluation instead of after evaluation due to the current defer r.readOnlyCmdMu.RUnlock() in executeReadOnlyBatch. That will be important to avoid blocking Raft on batch evaluation. We'll need to do the same thing in executeWriteBatch.

@aayushshah15
Copy link
Contributor

Another option would be to make this optimistic by re-checking the range descriptor (checkSpanInRangeRLocked) after grabbing an engine snapshot. We already determined that we need to grab the r.mu.RLock after this snapshot for #55293. In fact, it seems possible that we could remove readOnlyCmdMu entirely by also rechecking isDestroyedRLocked after grabbing an engine snapshot.

Is there an issue that discusses how we can get rid of readOnlyCmdMu? If not, I can file one to track it separately. I see some discussion about it in #43048.

Also, now that we grab the engine snapshot eagerly, we should release readOnlyCmdMu before evaluation instead of after evaluation due to the current defer r.readOnlyCmdMu.RUnlock() in executeReadOnlyBatch. That will be important to avoid blocking Raft on batch evaluation. We'll need to do the same thing in executeWriteBatch

Thanks for confirming this bit too. I can track it separately once we have an issue that talks about removing readOnlyCmdMu.

@nvanbenschoten
Copy link
Member

Is there an issue that discusses how we can get rid of readOnlyCmdMu? If not, I can file one to track it separately.

I don't see one, so we should open up a new issue.

arulajmani added a commit to arulajmani/cockroach that referenced this issue Oct 13, 2022
This patch fixes a bug in how follower reads are synchronized with the
application of concurrent split operations. Reads on the leaseholder
are serialized with concurrent split operations by latching. However,
splits are simply applied on the follower, and as such, don't go through
latching like they do on the leaseholder. Previously, this could lead to
invalid reads in cases where the range split and the RHS was removed
after the range descriptor's bounds were checked but before a storage
snapshot was acquired.

This patch fixes this hazard by checking the range bounds after
acquiring the storage snapshot (in addition to before, like we used to
prior to this change). It also adds a couple of tests -- one exercising
the exact scenario described in the associated issue and another that
runs concurrent split/read operations without tightly controlling the
synchronization between them.

Fixes cockroachdb#67016

Release note (bug fix): fixes a rare bug where concurrent follower
read/split operations could lead to invalid read results.
craig bot pushed a commit that referenced this issue Oct 24, 2022
89886: kvserver: ensure follower reads correctly synchronize with splits r=arulajmani a=arulajmani

This patch fixes a bug in how follower reads are synchronized with the application of concurrent split operations. Reads on the leaseholder are serialized with concurrent split operations by latching. However, splits are simply applied on the follower, and as such, don't go through latching like they do on the leaseholder. Previously, this could lead to invalid reads in cases where the range split and the RHS was removed after the range descriptor's bounds were checked but before a storage snapshot was acquired.

This patch fixes this hazard by checking the range bounds after acquiring the storage snapshot (in addition to before, like we used to prior to this change). It also adds a couple of tests -- one exercising the exact scenario described in the associated issue and another that runs concurrent split/read operations without tightly controlling the synchronization between them.

Fixes #67016

Release note (bug fix): fixes a rare bug where concurrent follower read/split operations could lead to invalid read results.

90456: sqlsmith: do not error when UDTs have no members r=mgartner a=mgartner

Fixes #90433

Release note: None

Co-authored-by: Arul Ajmani <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
@craig craig bot closed this as completed in e1f85b6 Oct 24, 2022
arulajmani added a commit to arulajmani/cockroach that referenced this issue Nov 7, 2022
Now that we've fixed cockroachdb#67016, and follower reads correctly synchronize
with concurrent splits, it's safe for us to serve ExportRequests from
followers. This patch permits that.

Closes cockroachdb#88804

Release note: None
arulajmani added a commit to arulajmani/cockroach that referenced this issue Nov 8, 2022
This patch fixes a bug in how follower reads are synchronized with the
application of concurrent split operations. Reads on the leaseholder
are serialized with concurrent split operations by latching. However,
splits are simply applied on the follower, and as such, don't go through
latching like they do on the leaseholder. Previously, this could lead to
invalid reads in cases where the range split and the RHS was removed
after the range descriptor's bounds were checked but before a storage
snapshot was acquired.

This patch fixes this hazard by checking the range bounds after
acquiring the storage snapshot (in addition to before, like we used to
prior to this change). It also adds a couple of tests -- one exercising
the exact scenario described in the associated issue and another that
runs concurrent split/read operations without tightly controlling the
synchronization between them.

Fixes cockroachdb#67016

Release note (bug fix): fixes a rare bug where concurrent follower
read/split operations could lead to invalid read results.
arulajmani added a commit to arulajmani/cockroach that referenced this issue Dec 7, 2022
Now that we've fixed cockroachdb#67016, and follower reads correctly synchronize
with concurrent splits, it's safe for us to serve ExportRequests from
followers. This patch permits that.

Closes cockroachdb#88804

Release note: None
arulajmani added a commit to arulajmani/cockroach that referenced this issue Dec 7, 2022
Now that we've fixed cockroachdb#67016, and follower reads correctly synchronize
with concurrent splits, it's safe for us to serve ExportRequests from
followers. This patch permits that.

Closes cockroachdb#88804

Release note: None
craig bot pushed a commit that referenced this issue Dec 7, 2022
91405: kv: permit ExportRequest evaluation on followers r=nvanbenschoten a=arulajmani

Now that we've fixed #67016, and follower reads correctly synchronize with concurrent splits, it's safe for us to serve ExportRequests from followers. This patch permits that.

Closes #88804

Release note: None

93124: roachtest: check logger file before dereference r=smg260 a=smg260

Another nil dereferece fix like #92845.

[Error in TC](https://teamcity.cockroachdb.com/viewLog.html?buildId=7824979&buildTypeId=Cockroach_Nightlies_RoachtestNightlyGceBazel)

Release note: none
Epic: none

Co-authored-by: Arul Ajmani <[email protected]>
Co-authored-by: Miral Gadani <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-0-visible-logical-error Database stores inconsistent data in some cases, or queries return invalid results silently. T-kv KV Team
Projects
None yet
6 participants