Skip to content

Commit

Permalink
Merge #56667 #56679
Browse files Browse the repository at this point in the history
56667: sql: normalize age and timestamptz intervals like postgres r=solongordon a=otan

sql: normalize age and timestamptz intervals like postgres

Release note (sql change, bug fix): Previously, `timestamp/timestamptz -
timestamp/timestamptz` operators would normalize the interval into
months, days, H:M:S (in older versions, this may be just H:M:S).

This could result in an incorrect result:
```
> select '2020-01-01'::timestamptz - '2018-01-01';
     ?column?
-------------------
  2 years 10 days
```
This has now been fixed such that it is only normalized into
days/H:M:S. This is now more postgres compatible.

Release note (sql change, bug fix): Previously, the `age` builtin would
incorrectly normalize months and days based on 30 days a month (in older
versions this may be just H:M:S).

This can give an incorrect result, e.g.
```
> select age('2020-01-01'::timestamptz, '2018-01-01');
     ?column?
-------------------
  2 years 10 days
```

This is not the most accurate it could be as `age` can use the given
timestamptz arguments to be more accurate.
This is now more postgres compatible.

Revert "sql: age returns normalized intervals"

This reverts commit 88a5d94.

Release note: None



56679: kv: use leaseholder's observed timestamp even for follower reads r=nvanbenschoten a=nvanbenschoten

This commit addresses a longstanding TODO to rationalize the use of
observed timestamps during follower reads.

Before this change, we used to use observed timestamps pulled from the
follower node to limit a transaction's uncertainty interval, but this
was incorrect. An observed timestamp pulled from the follower node's
clock has no meaning for the purpose of reducing the transaction's
uncertainty interval. This is because there is no guarantee that at the
time of acquiring the observed timestamp from the follower node, the
leaseholder hadn't already served writes at higher timestamps than the
follower node's clock reflected.

However, if the transaction performing a follower read happens to have
an observed timestamp from the current leaseholder, this timestamp can
be used to reduce the transaction's uncertainty interval. Even though
the read is being served from a different replica in the range, the
observed timestamp still places a bound on the values in the range that
may have been written before the transaction began.

In the past, this was mostly innocuous because AOST txns don't have an
uncertainty interval and the follower read duration was so large that
very few "present time" transactions would ever perform follower reads.
Now that the follower read duration is lower and more and more present
time transactions will perform follower reads, this is more critical to
get right.

The change also reworks the observedts package's docs a little bit to
take advantage of section headers:
![localhost_8080_pkg_github com_cockroachdb_cockroach_pkg_kv_kvserver_observedts_](https://user-images.githubusercontent.com/5438456/99130331-0d3cfc00-25de-11eb-9f49-616bfed9fc55.png)


Co-authored-by: Oliver Tan <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
  • Loading branch information
3 people committed Nov 16, 2020
3 parents 807755c + abfe0d5 + d7d777b commit 859be50
Show file tree
Hide file tree
Showing 15 changed files with 431 additions and 125 deletions.
8 changes: 6 additions & 2 deletions docs/generated/sql/functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -330,9 +330,13 @@
<table>
<thead><tr><th>Function &rarr; Returns</th><th>Description</th></tr></thead>
<tbody>
<tr><td><a name="age"></a><code>age(end: <a href="timestamp.html">timestamptz</a>, begin: <a href="timestamp.html">timestamptz</a>) &rarr; <a href="interval.html">interval</a></code></td><td><span class="funcdesc"><p>Calculates the interval between <code>begin</code> and <code>end</code>.</p>
<tr><td><a name="age"></a><code>age(end: <a href="timestamp.html">timestamptz</a>, begin: <a href="timestamp.html">timestamptz</a>) &rarr; <a href="interval.html">interval</a></code></td><td><span class="funcdesc"><p>Calculates the interval between <code>begin</code> and <code>end</code>, normalized into years, months and days.</p>
<pre><code> Note this may not be an accurate time span since years and months are normalized from days, and years and months are out of context.
To avoid normalizing days into months and years, use the timestamptz subtraction operator.</code></pre>
</span></td></tr>
<tr><td><a name="age"></a><code>age(val: <a href="timestamp.html">timestamptz</a>) &rarr; <a href="interval.html">interval</a></code></td><td><span class="funcdesc"><p>Calculates the interval between <code>val</code> and the current time.</p>
<tr><td><a name="age"></a><code>age(val: <a href="timestamp.html">timestamptz</a>) &rarr; <a href="interval.html">interval</a></code></td><td><span class="funcdesc"><p>Calculates the interval between <code>val</code> and the current time, normalized into years, months and days.</p>
<pre><code> Note this may not be an accurate time span since years and months are normalized from days, and years and months are out of context.
To avoid normalizing days into months and years, use `now() - timestamptz`.</code></pre>
</span></td></tr>
<tr><td><a name="clock_timestamp"></a><code>clock_timestamp() &rarr; <a href="timestamp.html">timestamp</a></code></td><td><span class="funcdesc"><p>Returns the current system time on one of the cluster nodes.</p>
</span></td></tr>
Expand Down
15 changes: 14 additions & 1 deletion pkg/kv/kvserver/observedts/BUILD.bazel
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
load("@io_bazel_rules_go//go:def.bzl", "go_library")
load("@io_bazel_rules_go//go:def.bzl", "go_library", "go_test")

go_library(
name = "observedts",
Expand All @@ -14,3 +14,16 @@ go_library(
"//pkg/util/log",
],
)

go_test(
name = "observedts_test",
srcs = ["limit_test.go"],
embed = [":observedts"],
deps = [
"//pkg/kv/kvserver/kvserverpb",
"//pkg/roachpb",
"//pkg/util/hlc",
"//pkg/util/leaktest",
"//vendor/github.com/stretchr/testify/require",
],
)
115 changes: 77 additions & 38 deletions pkg/kv/kvserver/observedts/doc.go
Original file line number Diff line number Diff line change
Expand Up @@ -16,16 +16,24 @@ package observedts

import "github.com/cockroachdb/cockroach/pkg/roachpb"

// D1 | A transaction's "max timestamp" is the upper bound of its uncertainty
// D1 ————————————————————————————————————————————————
//
// Transaction.MaxTimestamp
//
// A transaction's "max timestamp" is the upper bound of its uncertainty
// interval. The value is set to the transaction's initial timestamp + the
// cluster's maximum clock skew. Assuming maximum clock skew bounds are
// respected, this maximum timestamp places an upper bound on the commit
// timestamp of any other transaction that committed causally before the new
// transaction.
var D1 = roachpb.Transaction{}.MaxTimestamp

// D2 | While reading, if a transaction encounters a value above its read
// timestamp but equal to or below its max timestamp, it triggers a read within
// D2 ————————————————————————————————————————————————
//
// ReadWithinUncertaintyIntervalError
//
// While reading, if a transaction encounters a value above its read timestamp
// but equal to or below its max timestamp, it triggers a read within
// uncertainty interval error.
//
// This error forces the transaction to increase its read timestamp, either
Expand All @@ -36,7 +44,11 @@ var D1 = roachpb.Transaction{}.MaxTimestamp
// linearizability and avoid stale reads.
var D2 = roachpb.ReadWithinUncertaintyIntervalError{}

// D3 | An observed timestamp is a combination of a NodeID and a Timestamp. The
// D3 ————————————————————————————————————————————————
//
// ObservedTimestamp
//
// An observed timestamp is a combination of a NodeID and a Timestamp. The
// timestamp is said to have been "observed" because it was pulled from the
// corresponding node's HLC clock.
//
Expand All @@ -46,16 +58,23 @@ var D2 = roachpb.ReadWithinUncertaintyIntervalError{}
// given a smaller value.
var D3 = roachpb.ObservedTimestamp{}

// D4 | A transaction collects observed timestamps as it visits nodes in the
// cluster when performing reads and writes.
// D4 ————————————————————————————————————————————————
//
// Transaction.UpdateObservedTimestamp
//
// A transaction collects observed timestamps as it visits nodes in the cluster
// when performing reads and writes.
var D4 = (&roachpb.Transaction{}).UpdateObservedTimestamp

// D5 | The observed timestamps are collected in a list on the transaction
// proto. The purpose of this list is to avoid uncertainty related restarts
// which normally occur when reading a value in the near future as per the
// max_timestamp field.
// D5 ————————————————————————————————————————————————
//
// ### Meaning:
// Transaction.ObservedTimestamps
//
// The observed timestamps are collected in a list on the transaction proto. The
// purpose of this list is to avoid uncertainty related restarts which normally
// occur when reading a value in the near future as per the max_timestamp field.
//
// Meaning
//
// Morally speaking, having an entry for a node in this list means that this
// node has been visited before, and that no more uncertainty restarts are
Expand All @@ -73,7 +92,7 @@ var D4 = (&roachpb.Transaction{}).UpdateObservedTimestamp
// timestamp that is at least high as our entry in the list for node A, so no
// future operation on node A will be uncertain.
//
// ### Correctness:
// Correctness
//
// Thus, expressed properly, we can say that when a node has been read from
// successfully before by a transaction, uncertainty for values written by a
Expand All @@ -90,26 +109,26 @@ var D4 = (&roachpb.Transaction{}).UpdateObservedTimestamp
// purposes of uncertainty.
//
// There are two invariants necessary for this property to hold:
// 1. a leaseholder's clock must always be equal to or greater than the timestamp
// of all writes that it has served. This is trivial to enforce for
// non-transactional writes. It is more complicated for transactional writes
// which may move their commit timestamp forward over their lifetime before
// committing, even after writing intents on remote Ranges. To accommodate
// this situation, transactions ensure that at the time of their commit, any
// leaseholder for a Range that contains one of its intent has an HLC clock
// with an equal or greater timestamp than the transaction's commit timestamp.
// TODO(nvanbenschoten): This is violated by txn refreshes. See #36431.
// 2. a leaseholder's clock must always be equal to or greater than the timestamp
// of all writes that previous leaseholders for its Range have served. We
// enforce that when a Replica acquires a lease it bumps its node's clock to a
// time higher than the previous leaseholder's clock when it stopped serving
// writes. This is accomplished cooperatively for lease transfers and through
// a statis period before lease expiration for lease acquisitions. It then
// follows by induction that, in conjunction with the previous invariant, this
// invariant holds for all leaseholders, given that a Range's initial
// leaseholder assumes responsibility for an empty range with no writes.
//
// ### Usage:
// 1. a leaseholder's clock must always be equal to or greater than the timestamp
// of all writes that it has served. This is trivial to enforce for
// non-transactional writes. It is more complicated for transactional writes
// which may move their commit timestamp forward over their lifetime before
// committing, even after writing intents on remote Ranges. To accommodate
// this situation, transactions ensure that at the time of their commit, any
// leaseholder for a Range that contains one of its intent has an HLC clock
// with an equal or greater timestamp than the transaction's commit timestamp.
// TODO(nvanbenschoten): This is violated by txn refreshes. See #36431.
// 2. a leaseholder's clock must always be equal to or greater than the timestamp
// of all writes that previous leaseholders for its Range have served. We
// enforce that when a Replica acquires a lease it bumps its node's clock to a
// time higher than the previous leaseholder's clock when it stopped serving
// writes. This is accomplished cooperatively for lease transfers and through
// a statis period before lease expiration for lease acquisitions. It then
// follows by induction that, in conjunction with the previous invariant, this
// invariant holds for all leaseholders, given that a Range's initial
// leaseholder assumes responsibility for an empty range with no writes.
//
// Usage
//
// The property ensures that when this list holds a corresponding entry for the
// node who owns the lease that the current request is executing under, we can
Expand All @@ -133,14 +152,34 @@ var D4 = (&roachpb.Transaction{}).UpdateObservedTimestamp
// the list. In particular, if `read_timestamp` is taken from that node's clock,
// we may add that to the list, which eliminates read uncertainty for reads on
// that node.
//
// Follower Reads
//
// If the replica serving a transaction's read is not the leaseholder for its
// range, an observed timestamp pulled from the follower node's clock has no
// meaning for the purpose of reducing the transaction's uncertainty interval.
// This is because there is no guarantee that at the time of acquiring the
// observed timestamp from the follower node, the leaseholder hadn't already
// served writes at higher timestamps than the follower node's clock reflected.
//
// However, if the transaction performing a follower read happens to have an
// observed timestamp from the current leaseholder, this timestamp can be used
// to reduce the transaction's uncertainty interval. Even though the read is
// being served from a different replica in the range, the observed timestamp
// still places a bound on the values in the range that may have been written
// before the transaction began.
var D5 = roachpb.Transaction{}.ObservedTimestamps

// D6 | Observed timestamps allow transactions to avoid uncertainty related
// restarts because they allow transactions to bound their "effective max
// timestamp" when reading on a node which they have previously collected an
// observed timestamp from. Similarly, observed timestamps can also assist a
// transaction even on its first visit to a node in cases where it gets stuck
// waiting on locks for long periods of time.
// D6 ————————————————————————————————————————————————
//
// LimitTxnMaxTimestamp
//
// Observed timestamps allow transactions to avoid uncertainty related restarts
// because they allow transactions to bound their "effective max timestamp" when
// reading on a node which they have previously collected an observed timestamp
// from. Similarly, observed timestamps can also assist a transaction even on
// its first visit to a node in cases where it gets stuck waiting on locks for
// long periods of time.
var D6 = LimitTxnMaxTimestamp

// Ignore unused warnings.
Expand Down
68 changes: 35 additions & 33 deletions pkg/kv/kvserver/observedts/limit.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,14 @@ import (
"github.com/cockroachdb/cockroach/pkg/util/log"
)

// LimitTxnMaxTimestamp limits the batch transaction's max timestamp
// so that it respects any timestamp already observed on this node.
// This prevents unnecessary uncertainty interval restarts caused by
// reading a value written at a timestamp between txn.Timestamp and
// txn.MaxTimestamp. The replica lease's start time is also taken into
// consideration to ensure that a lease transfer does not result in
// the observed timestamp for this node being inapplicable to data
// previously written by the former leaseholder. To wit:
// LimitTxnMaxTimestamp limits the transaction's max timestamp so that it
// respects any timestamp already observed on the leaseholder's node. This
// prevents unnecessary uncertainty interval restarts caused by reading a value
// written at a timestamp between txn.Timestamp and txn.MaxTimestamp.
//
// The lease's start time is also taken into consideration to ensure that a
// lease transfer does not result in the observed timestamp for this node being
// inapplicable to data previously written by the former leaseholder. To wit:
//
// 1. put(k on leaseholder n1), gateway chooses t=1.0
// 2. begin; read(unrelated key on n2); gateway chooses t=0.98
Expand All @@ -39,40 +39,42 @@ import (
// the previous leaseholder.
//
func LimitTxnMaxTimestamp(
ctx context.Context, ba *roachpb.BatchRequest, status kvserverpb.LeaseStatus,
) {
if ba.Txn == nil {
return
ctx context.Context, txn *roachpb.Transaction, status kvserverpb.LeaseStatus,
) *roachpb.Transaction {
if txn == nil || status.State != kvserverpb.LeaseState_VALID {
return txn
}
// For calls that read data within a txn, we keep track of timestamps
// observed from the various participating nodes' HLC clocks. If we have
// a timestamp on file for this Node which is smaller than MaxTimestamp,
// we can lower MaxTimestamp accordingly. If MaxTimestamp drops below
// ReadTimestamp, we effectively can't see uncertainty restarts anymore.
// TODO(nvanbenschoten): This should use the lease's node id.
obsTS, ok := ba.Txn.GetObservedTimestamp(ba.Replica.NodeID)
// observed from the various participating nodes' HLC clocks. If we have a
// timestamp on file for the leaseholder's node which is smaller than
// MaxTimestamp, we can lower MaxTimestamp accordingly. If MaxTimestamp
// drops below ReadTimestamp, we effectively can't see uncertainty restarts
// anymore.
//
// Note that we care about an observed timestamp from the leaseholder's
// node, even if this is a follower read on a different node. See the
// comment in doc.go about "Follower Reads" for more.
obsTS, ok := txn.GetObservedTimestamp(status.Lease.Replica.NodeID)
if !ok {
return
}
// If the lease is valid, we use the greater of the observed
// timestamp and the lease start time, up to the max timestamp. This
// ensures we avoid incorrect assumptions about when data was
// written, in absolute time on a different node, which held the
// lease before this replica acquired it.
// TODO(nvanbenschoten): Do we ever need to call this when
// status.State != VALID?
if status.State == kvserverpb.LeaseState_VALID {
obsTS.Forward(status.Lease.Start)
return txn
}
if obsTS.Less(ba.Txn.MaxTimestamp) {
// If the lease is valid, we use the greater of the observed timestamp and
// the lease start time, up to the max timestamp. This ensures we avoid
// incorrect assumptions about when data was written, in absolute time on a
// different node, which held the lease before this replica acquired it.
obsTS.Forward(status.Lease.Start)
// If the observed timestamp reduces the transaction's uncertainty interval,
// update the transacion proto.
if obsTS.Less(txn.MaxTimestamp) {
// Copy-on-write to protect others we might be sharing the Txn with.
txnClone := ba.Txn.Clone()
txnClone := txn.Clone()
// The uncertainty window is [ReadTimestamp, maxTS), so if that window
// is empty, there won't be any uncertainty restarts.
if obsTS.LessEq(ba.Txn.ReadTimestamp) {
if obsTS.LessEq(txn.ReadTimestamp) {
log.Event(ctx, "read has no clock uncertainty")
}
txnClone.MaxTimestamp.Backward(obsTS)
ba.Txn = txnClone
txn = txnClone
}
return txn
}
Loading

0 comments on commit 859be50

Please sign in to comment.