storage: avoid stale read due to observed timestamp after lease change #23749

tbg · 2018-03-12T19:18:45Z

Discovered in #21056 (follower reads).

The fundamental problem is that when we pick up a clock read on a node that we visit, we use this as a guarantee that the node has not served any writes at higher timestamps that we could be missing (as in, that happened causally before us). But this guarantee is broken if the node later becomes leaseholder of some range after we first visited the node (where the data was in fact causally there before our visit, but the node wasn't in charge of it and thus unaware at the time).

For example,

put(k on leaseholder n1), gateway choses t=1.0
begin; read(unrelated key on n2); gateway choses t=0.98
pick up observed timestamp for n2 of 0.99
lease of n1 expires (or gets transferred), n2 takes over
read(k) on leaseholder n2 misses earlier put at t=1s (outside of uncertainty window up to 0.99)

@spencerkimball suggests:

Should be fixable by moving setting the MaxTimestamp from Stores.Send to after we get call Replica.redirectOnOrAcquireLease. MaxTimestamp should be set to the greater of the observed timestamp for the node and the lease start time, bounded by Transaction.MaxTimestamp. This will of course result in more uncertainty restarts, but not many, as lease transfers are not common unless the transaction is really long-running, in which case the original Transaction.MaxTimestamp will control.

This change relocates the logic for using observed timestamps to affect a transaction's `MaxTimestamp` field in order to avoid uncertainty restarts. This was previously done in `storage.Stores.Send`, but is now moved to `storage.Replica.executeReadOnlyBatch` and `storage.Replica.tryExecuteWriteBatch` in order to use information from the replica's lease to avoiding failing to read a value written in absolute time before the txn started, but at a higher timestamp than the timestamp observed at the node which now owns the replica lease. Fixes cockroachdb#23749 Release note: None

…holders This PR is an initial reaction to cockroachdb#36431 and to `limitTxnMaxTimestamp` being broken for follower reads (PR for that coming soon). To start, we were missing any real documentation on what makes the observed timestamp mechanism safe and what invariants need to hold in order for a clock reading on a node to imply an absence of causality on any values later read at higher timestamp on that node. The PR addresses this by discussing the invariants necessary for observed timestamps to be used to bound a transaction's uncertainty window. In doing so, the PR also attempts to address a second issue, which is that the documentation conflated a node serving reads and writes with leaseholders serving reads and writes. At the time that this was all written, these were equivalent. Now that we have follower reads, the omission of a discussion about leaseholders made this all hard to think about (e.g. "whose observed timestamp do we care about when performing a follower read?"). The PR addresses this by adjusting the wording around nodes and leaseholders. This would have come in handy in reasoning about cockroachdb#23749. Release note: None

36554: roachpb: formalize observed timestamp preconditions, talk about leaseholders r=nvanbenschoten a=nvanbenschoten This PR is an initial reaction to #36431 and to `limitTxnMaxTimestamp` being broken for follower reads (PR for that coming soon). To start, we were missing any real documentation on what makes the observed timestamp mechanism safe and what invariants need to hold in order for a clock reading on a node to imply an absence of causality on any values later read at higher timestamp on that node. The PR addresses this by discussing the invariants necessary for observed timestamps to be used to bound a transaction's uncertainty window. In doing so, the PR also attempts to address a second issue, which is that the documentation conflated a node serving reads and writes with leaseholders serving reads and writes. At the time that this was all written, these were equivalent. Now that we have follower reads, the omission of a discussion about leaseholders made this all hard to think about (e.g. "whose observed timestamp do we care about when performing a follower read?"). The PR addresses this by adjusting the wording around nodes and leaseholders. This would have come in handy in reasoning about #23749. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>

tbg added this to the 2.0 milestone Mar 12, 2018

tbg assigned tbg and spencerkimball Mar 12, 2018

tbg mentioned this issue Mar 13, 2018

storage: implement follower reads using replica closed timestamps #21056

Closed

spencerkimball mentioned this issue Mar 21, 2018

storage: improve use of observed tstamps to limit uncertainty restarts #24127

Merged

spencerkimball closed this as completed in #24127 Mar 22, 2018

nvanbenschoten mentioned this issue Apr 5, 2019

roachpb: formalize observed timestamp preconditions, talk about leaseholders #36554

Merged

nvanbenschoten mentioned this issue Feb 24, 2020

kv: stale read because intent from our past gets committed in our certain future? #36431

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: avoid stale read due to observed timestamp after lease change #23749

storage: avoid stale read due to observed timestamp after lease change #23749

tbg commented Mar 12, 2018 •

edited

Loading

storage: avoid stale read due to observed timestamp after lease change #23749

storage: avoid stale read due to observed timestamp after lease change #23749

Comments

tbg commented Mar 12, 2018 • edited Loading

tbg commented Mar 12, 2018 •

edited

Loading