storage: minimize retries with 1PC commits #22315

spencerkimball · 2018-02-01T23:20:04Z

Introduce a new NoRefreshSpans field to the EndTransactionRequest
arguments. This specifies that a serializable isolation transaction has
encountered no refresh spans. On a 1PC commit, this can be used to
avoid serializable restarts by re-executing the 1PC transaction at an
appropriately higher timestamp in the event of the timestamp being
forwarded by the timestamp cache or because of write-too-old errors.

When evaluating a write batch, we now allow a local retry for write
too old errors for non-transactional batches, and for serializable
1PC txns where NoRefreshSpans is true.

Release note: None

cockroach-teamcity · 2018-02-01T23:20:10Z

This change is

spencerkimball · 2018-02-02T20:37:33Z

@nvanbenschoten this now evaluates the full batch for both transactional and non-transactional batches to find the highest timestamp to satisfy write too old errors. Local retries on the replica at evaluation time are allowed for non-transactional batches always and for serializable 1PC transactional batches, as long as they have the NoRefreshSpans flag set to true.

nvanbenschoten · 2018-02-02T20:59:35Z

Thanks @spencerkimball, I'm more comfortable with the change now that we're attempting to minimize the timestamp that these transactions will be pushed to. This eliminates the increase in WriteTooOld errors that you were seeing before, right?

spencerkimball · 2018-02-02T21:27:21Z

@nvanbenschoten it does eliminate those yes.

nvanbenschoten · 2018-02-02T21:28:45Z

Reviewed 5 of 7 files at r1.
Review status: 5 of 7 files reviewed at latest revision, all discussions resolved, some commit checks failed.

pkg/roachpb/api.proto, line 458 at r1 (raw file):

  // intents are left in the event of an error.
  bool require_1pc = 6 [(gogoproto.customname) = "Require1PC"];
  // Set to true if this transaction is serializable isolation and has

Does it ever make sense to set this for snapshot txns?

pkg/roachpb/api.proto, line 461 at r1 (raw file):

If true, one phase commits can use the executing server's
// hlc clock now timestamp to avoid retries.

replace "use the executing server's hlc clock now timestamp" with "increase their timestamp on the executing server"

pkg/roachpb/api.proto, line 462 at r1 (raw file):

  // etc.). If true, one phase commits can use the executing server's
  // hlc clock now timestamp to avoid retries.
  bool no_refresh_spans = 7;

Do you think we should make this flag even more general? Something like committed_to_timestamp or timestamp_negotiable.

As discussed with @andreimatei, we don't want to increase our transaction timestamp if we've done anything that relies on our initial timestamp. This includes having performed reads and writes at the timestamp but can also include leaking the timestamp in other ways like reporting it to SQL functions.

pkg/storage/replica.go, line 5209 at r1 (raw file):

	rec := NewReplicaEvalContext(r, spans)
	// We can retry locally if this is a non-transactional request.

nit: pull ba.Txn == nil into a variable like you did above with retryLocally so that it's easier to see the parallels.

pkg/storage/replica.go, line 5273 at r1 (raw file):

	if retry, reason := isEndTransactionTriggeringRetryError(ba.Txn); retry {
		// We can still attempt the one phase commit if the end
		// transaction request specifies there are no refresh spans.

Is this actually possible? Doesn't isEndTransactionTriggeringRetryError == true imply that the txn has performed previous writes?

Comments from Reviewable

spencerkimball · 2018-02-02T22:42:30Z

Review status: 5 of 7 files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.

pkg/roachpb/api.proto, line 458 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Does it ever make sense to set this for snapshot txns?

We don't track refresh spans for snapshot txns because we don't use them for refreshing, so this optimization doesn't apply to SNAPSHOT.

pkg/roachpb/api.proto, line 461 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

If true, one phase commits can use the executing server's
// hlc clock now timestamp to avoid retries.

replace "use the executing server's hlc clock now timestamp" with "increase their timestamp on the executing server"

Oops. this was a leftover bit.

pkg/roachpb/api.proto, line 462 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do you think we should make this flag even more general? Something like committed_to_timestamp or timestamp_negotiable.

As discussed with @andreimatei, we don't want to increase our transaction timestamp if we've done anything that relies on our initial timestamp. This includes having performed reads and writes at the timestamp but can also include leaking the timestamp in other ways like reporting it to SQL functions.

OK I'm fine with that. Changed it to can_forward_commit_timestamp. "negotiable" sounds like it could do something other than move forward.

pkg/storage/replica.go, line 5209 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: pull ba.Txn == nil into a variable like you did above with retryLocally so that it's easier to see the parallels.

Done.

pkg/storage/replica.go, line 5273 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is this actually possible? Doesn't isEndTransactionTriggeringRetryError == true imply that the txn has performed previous writes?

Yes, happens all the time if the batch affects keys which have newer timestamps from the timestamp cache. I improved the comment.

Comments from Reviewable

spencerkimball · 2018-02-02T23:34:50Z

@nvanbenschoten I realize we can actually avoid retries on non fast path (non-1PC) transactions as well with this knowledge about having no refresh spans.

tbg · 2018-02-02T23:49:43Z

Looks like this should reference (and potentially close) #15797.

On Fri, Feb 2, 2018 at 6:35 PM Spencer Kimball ***@***.***> wrote: @nvanbenschoten <https://github.com/nvanbenschoten> I realize we can actually avoid retries on non fast path (non-1PC) transactions as well with this knowledge about having no refresh spans. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#22315 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE135GeShxSMu-G0QQYfE6kd32aWxBrUks5tQ5ujgaJpZM4R2cRc> .

--

…

-- Tobias

spencerkimball · 2018-02-03T00:15:32Z

Fixes #15797

With this change, kv --concurrency=8 --cycle-length=1 exhibits:

_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
      1s        0         2855.9         2855.9      2.8      3.4      5.0      6.8
      2s        0         2950.1         2903.0      2.8      3.0      4.7      6.8
      3s        0         2886.7         2897.6      2.8      3.3      5.8      7.9
      4s        0         2899.1         2898.0      2.8      3.1      5.8      7.3
      5s        0         2937.1         2905.8      2.6      3.0      6.3      8.9
      6s        0         2915.8         2907.5      2.8      3.0      6.6      7.1
      7s        0         2862.1         2901.0      2.8      3.1      7.1     11.5
      8s        0         2717.8         2878.1      2.8      3.9      8.9     15.2
      9s        0         2750.0         2863.8      2.8      3.5      8.1     11.5
     10s        0         2872.2         2864.7      2.8      3.1      8.4      9.4

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
   10.0s        0          28649         2864.7      2.8      2.8      3.3      6.8     15.2

Previous sha's performance with same options has:

_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
      1s        0          462.7          462.7     11.5     48.2     62.9     79.7
      2s        0          449.4          456.0     16.3     46.1     65.0     88.1
      3s        0          441.6          451.2     16.8     48.2     58.7     79.7
      4s        0          458.0          452.9     15.7     46.1     52.4     75.5
      5s        0          438.2          450.0     16.8     48.2     60.8     71.3
      6s        0          451.4          450.2     10.0     46.1     56.6     71.3
      7s        0          361.1          437.5     22.0     56.6     75.5    109.1
      8s        0          416.1          434.8      8.9     52.4     75.5     96.5
      9s        0          440.3          435.5     13.6     50.3     58.7     62.9
     10s        0          451.4          437.0      9.4     46.1     58.7     79.7

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
   10.0s        0           4371          437.0     18.3     14.2     48.2     65.0    109.1

tbg · 2018-02-03T02:53:36Z

Nice!

…

On Fri, Feb 2, 2018, 7:15 PM Spencer Kimball ***@***.***> wrote: Fixes #15797 <#15797> With this change, kv --concurrency=8 --cycle-length=1 exhibits: _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms) 1s 0 2855.9 2855.9 2.8 3.4 5.0 6.8 2s 0 2950.1 2903.0 2.8 3.0 4.7 6.8 3s 0 2886.7 2897.6 2.8 3.3 5.8 7.9 4s 0 2899.1 2898.0 2.8 3.1 5.8 7.3 5s 0 2937.1 2905.8 2.6 3.0 6.3 8.9 6s 0 2915.8 2907.5 2.8 3.0 6.6 7.1 7s 0 2862.1 2901.0 2.8 3.1 7.1 11.5 8s 0 2717.8 2878.1 2.8 3.9 8.9 15.2 9s 0 2750.0 2863.8 2.8 3.5 8.1 11.5 10s 0 2872.2 2864.7 2.8 3.1 8.4 9.4 _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms) 10.0s 0 28649 2864.7 2.8 2.8 3.3 6.8 15.2 Previous sha's performance with same options has: _elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms) 1s 0 462.7 462.7 11.5 48.2 62.9 79.7 2s 0 449.4 456.0 16.3 46.1 65.0 88.1 3s 0 441.6 451.2 16.8 48.2 58.7 79.7 4s 0 458.0 452.9 15.7 46.1 52.4 75.5 5s 0 438.2 450.0 16.8 48.2 60.8 71.3 6s 0 451.4 450.2 10.0 46.1 56.6 71.3 7s 0 361.1 437.5 22.0 56.6 75.5 109.1 8s 0 416.1 434.8 8.9 52.4 75.5 96.5 9s 0 440.3 435.5 13.6 50.3 58.7 62.9 10s 0 451.4 437.0 9.4 46.1 58.7 79.7 _elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms) 10.0s 0 4371 437.0 18.3 14.2 48.2 65.0 109.1 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#22315 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE135F1R0cL33Zg1owPIaH9YoVlwSra5ks5tQ6U2gaJpZM4R2cRc> .

--

-- Tobias

nvanbenschoten · 2018-02-04T03:37:32Z

that's a nice improvement!

I realize we can actually avoid retries on non fast path (non-1PC) transactions as well with this knowledge about having no refresh spans.

Yeah, that's a good point. We can always allow transaction timestamps to be pushed forward to avoid retries up until the timestamp is leaked for the first time. To make that change I think we'll need to move the CanForwardCommitTimestamp flag from EndTransactionRequest onto the Transaction proto itself.

Reviewed 4 of 6 files at r2.
Review status: 6 of 7 files reviewed at latest revision, all discussions resolved, all commit checks successful.

pkg/roachpb/data.go, line 1145 at r2 (raw file):

	}
	// Also forward by the existing timestamp.
	ts.Forward(err.ExistingTimestamp.Add(0, 1))

Why was this suddenly caught? Did part of this change cause it to become more important?

s/.Add(0, 1)/.Next()/

Also, It's unfortunate that the discrepancy between ReadWithinUncertaintyIntervalError.ExistingTimestamp and WriteTooOldError.ActualTimestamp caused us to make this mistake. I'd prefer we change them to both reflect either the existing value's timestamp or the existing value's timestamp+1. Either way, let's prevent future confusion. If this isn't easily possible because of the migration concern, let's at least be more explicit about the disagreement.

Comments from Reviewable

nvanbenschoten · 2018-02-04T03:43:11Z

@spencerkimball, do you mind trying this change out on TPC-C as well? #15797 (comment) implies that TPC-C might also be affected by an approach like this.

bdarnell · 2018-02-05T00:14:05Z

Reviewed 1 of 7 files at r1, 6 of 6 files at r2.
Review status: all files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.

pkg/kv/txn_coord_sender.go, line 464 at r2 (raw file):

			tc.mu.intentsSizeBytes = intentsSizeBytes

			if tc.mu.meta.Txn.IsSerializable() && tc.mu.meta.RefreshValid &&

While this is all fresh in your mind, make sure there is commentary somewhere about why we can't do this for snapshot transactions. (is it just because we don't accumulate refresh spans for non-serializable transactions or is there a deeper reason? If we weren't thinking about getting rid of snapshot completely, it might be worth saving refresh spans just to enable this optimization)

Maybe setting the isolation to snapshot should set RefreshValid to false so we don't need the separate IsSerializable check here.

pkg/kv/txn_coord_sender.go, line 466 at r2 (raw file):

			if tc.mu.meta.Txn.IsSerializable() && tc.mu.meta.RefreshValid &&
				len(tc.mu.meta.RefreshReads) == 0 && len(tc.mu.meta.RefreshWrites) == 0 {
				et.CanForwardCommitTimestamp = true

Can the DistSender split the batch below this point? We need to make sure to clear the CanForwardCommitTimestamp flag if that happens.

pkg/storage/replica.go, line 5238 at r2 (raw file):

		br, res, pErr = evaluateBatch(ctx, idKey, batch, rec, ms, ba)
		// If we can retry, set a higher batch timestamp and continue.
		if wtoErr, ok := pErr.GetDetail().(*roachpb.WriteTooOldError); ok && canRetry {

If retries is already 1, this will mutate ba.Timestamp but won't actually retry, which seems odd. On the other hand, since any concurrent writes should be blocked in the command queue and we look at all requests in the batch, we really shouldn't see more than one WriteTooOldError here. Maybe we should panic if we get more than one?

pkg/storage/replica.go, line 5487 at r2 (raw file):

					}
				}
				// Return error early on some requests for serializable isolation.

I think some comments about how we define "returning early" here (i.e. the end of the batch) would be helpful. This is a heuristic that may not always be optimal. Finishing the batch helps us choose the right timestamp in the event that there are multiple conflicts and the client retry will end up writing to the same keys the next time around. If the client behavior changes on retries then laying down the intents on this attempt was wasted work. If running the rest of the batch is beneficial, then the same logic might argue for running the entire remainder of the transaction. Returning a WriteTooOldError at the end of the batch is simply choosing a local minimum in the cost/benefit curve, since the cost of continuing the transaction is much less when we're not requiring additional round trips.

Comments from Reviewable

spencerkimball · 2018-02-05T03:29:50Z

@nvanbenschoten are there instructions for running TPC-C?

Review status: all files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.

pkg/kv/txn_coord_sender.go, line 464 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

While this is all fresh in your mind, make sure there is commentary somewhere about why we can't do this for snapshot transactions. (is it just because we don't accumulate refresh spans for non-serializable transactions or is there a deeper reason? If we weren't thinking about getting rid of snapshot completely, it might be worth saving refresh spans just to enable this optimization)

Maybe setting the isolation to snapshot should set RefreshValid to false so we don't need the separate IsSerializable check here.

We could do this for snapshot transactions, but I don't think we should. As you point out, we don't accumulate the refresh spans for SNAPSHOT. Not sure it makes sense to add any commentary considering the reason is so straightforward.

pkg/kv/txn_coord_sender.go, line 466 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Can the DistSender split the batch below this point? We need to make sure to clear the CanForwardCommitTimestamp flag if that happens.

It can. Originally the name of this flag was NoRefreshSpans which was specifically meant to still be accurate for that case, but then it was suggested to change the name and I didn't remember this case when agreeing. I think I'm going to change this back to NoRefreshSpans.

pkg/roachpb/data.go, line 1145 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why was this suddenly caught? Did part of this change cause it to become more important?

s/.Add(0, 1)/.Next()/

Also, It's unfortunate that the discrepancy between ReadWithinUncertaintyIntervalError.ExistingTimestamp and WriteTooOldError.ActualTimestamp caused us to make this mistake. I'd prefer we change them to both reflect either the existing value's timestamp or the existing value's timestamp+1. Either way, let's prevent future confusion. If this isn't easily possible because of the migration concern, let's at least be more explicit about the disagreement.

I added this addition for some unittest cases. It never fires in real life. But turns out that when I first added it, just using ExistingTimestamp was incorrect – need to add a logical tick to make sure we don't get another WriteTooOldError. Nothing prompted this fix except realizing the original change was incorrect.

pkg/storage/replica.go, line 5238 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

If retries is already 1, this will mutate ba.Timestamp but won't actually retry, which seems odd. On the other hand, since any concurrent writes should be blocked in the command queue and we look at all requests in the batch, we really shouldn't see more than one WriteTooOldError here. Maybe we should panic if we get more than one?

You can get more than one WriteTooOldError in the case where a non-transactional batch updates the same key on more than one request. Obviously this can also by extension happen to a transactional batch which is attempting to execute on the fast path. Added a comment.

pkg/storage/replica.go, line 5487 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think some comments about how we define "returning early" here (i.e. the end of the batch) would be helpful. This is a heuristic that may not always be optimal. Finishing the batch helps us choose the right timestamp in the event that there are multiple conflicts and the client retry will end up writing to the same keys the next time around. If the client behavior changes on retries then laying down the intents on this attempt was wasted work. If running the rest of the batch is beneficial, then the same logic might argue for running the entire remainder of the transaction. Returning a WriteTooOldError at the end of the batch is simply choosing a local minimum in the cost/benefit curve, since the cost of continuing the transaction is much less when we're not requiring additional round trips.

Added to the comment.

An important part of the story here is when this returning early situation is triggered by a conditional put (increment is not even really a factor in practice). On a conditional put, the chance that the retry will succeed is pretty slim, and so laying down intents is almost always going to be a bad idea. The reason cputs are likely to fail is that they're often either based on a read of the underlying value which will change on the forwarded retry timestamp, or they're expecting an empty value, which won't be true on the retry.

Comments from Reviewable

Introduce a new `NoRefreshSpans` field to the `EndTransactionRequest` arguments. This specifies that a serializable isolation transaction has encountered no refresh spans. On a 1PC commit, this can be used to avoid serializable restarts by re-executing the 1PC transaction at an appropriately higher timestamp in the event of the timestamp being forwarded by the timestamp cache or because of write-too-old errors. When evaluating a write batch, we now allow a local retry for write too old errors for non-transactional batches, and for serializable 1PC txns where `NoRefreshSpans` is true. Release note: None

nvanbenschoten · 2018-02-05T17:26:36Z

@spencerkimball yes, the load generator is here. You should just be able to run ./tpcc -drop -load -warehouses=1. Try with both -no-wait=false and -no-wait=true.

Review status: 1 of 7 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.

Comments from Reviewable

jordanlewis · 2018-02-05T17:46:44Z

FWIW no-wait=false won't be very interesting with just 1 warehouse, because the load generator will mostly be sleeping.

spencerkimball · 2018-02-05T21:50:04Z

The results are not appreciably different for the tpcc load generator with this change.

bdarnell · 2018-02-05T23:08:16Z

Review status: 1 of 7 files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.

pkg/kv/txn_coord_sender.go, line 466 at r2 (raw file):

It can.

Then don't we need to clear this flag (whatever it's called) when DistSender splits the batch? The sub-batches could create refresh spans but the EndTransaction sent on its own would still have NoRefreshSpans set.

pkg/roachpb/api.proto, line 463 at r2 (raw file):

  // path. For this to be true, the transaction must not have
  // accumulated any refreshable spans (e.g. from reads, cputs, etc.),
  // and must not have "leaked" a commit timestamp to the SQL client.

What happened to the part about leaking a commit timestamp to the SQL client? Don't we still need to do that?

Comments from Reviewable

andreimatei · 2018-02-06T15:43:23Z

Review status: 1 of 7 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.

pkg/roachpb/api.proto, line 463 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

What happened to the part about leaking a commit timestamp to the SQL client? Don't we still need to do that?

On a related note, I want to check that we're good on other txns that care about their timestamps particularly - the incremental backup roll-up txns. Are those unaffected by this series of changes?

Comments from Reviewable

andreimatei · 2018-02-06T15:45:53Z

Review status: 1 of 7 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.

pkg/roachpb/api.proto, line 463 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

On a related note, I want to check that we're good on other txns that care about their timestamps particularly - the incremental backup roll-up txns. Are those unaffected by this series of changes?

Cc @danhhz @knz

Comments from Reviewable

spencerkimball · 2018-02-06T17:40:39Z

The txn start time was not always the txn commit time, as with SNAPSHOT txns. So now() if it used to "work", didn't actually work with both isolation levels.

Comments from Reviewable

knz · 2018-02-06T17:41:58Z

yeah I am not concerned about snapshot. We never pretended to do anything correct with it anyway :)
The jepsen test suite didn't exercise snapshot either (AFAIK)

Meanwhile, I think we can defer working on this until 2.1 work starts, assuming we clearly document what is going on. A user who wishes to preserve the pre-optimization semantics can tweak their SQL code to force the txn to become 2PC, in which case I believe the previous behavior holds (is that correct?)

petermattis · 2018-02-06T17:42:12Z

It used to be that the txn start time was also the txn commit time, so it worked that way yes. Now is not the case any more.

I must have missed context here: why does now() need to be the transaction commit time? You're clearly worried about something that I'm not understanding.

Review status: 1 of 7 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.

Comments from Reviewable

knz · 2018-02-06T17:44:16Z

test monotonic inserts: insert into t(x, y) select max(x)+1, now() from multiple concurrent clients must see the two columns x and y grow monotonically. This is only true if now() is monotonic with commit order.

There are variants of this that use multiple tables that check this is also true even when the txns do not conflict on writes (just on reads).

knz · 2018-02-06T17:45:11Z

That monotonic insert test as written above however does not use 1PC (because it needs to read the max) so it's not impacted. But I wonder if there are other possible equivalent tests that can use 1PC and that would be impacted.

andreimatei · 2018-02-06T17:46:27Z

This particular PR is about 1PC, but the broader set of changes affects all txns.

knz · 2018-02-06T17:48:27Z

And just to be clear, a disclaimer: please take my comments above with a grain of salt, it's been a while since I last checked what these tests were actually testing. I recall we went through multiple iterations back then to properly define what tests were correct (i.e. the behavior they were testing was strictly required from the most general definition of serializable isolation).

petermattis · 2018-02-06T17:48:51Z

test monotonic inserts: insert into t(x, y) select max(x)+1, now() from multiple concurrent clients must see the two columns x and y grow monotonically. This is only true if now() is monotonic with commit order.

There are variants of this that use multiple tables that check this is also true even when the txns do not conflict on writes (just on reads).

Ok, this is vastly different than saying now() must be monotonic with commit order because there are other interactions between the two transactions which provide that guarantee. Without those interactions, I don't see how or why we would provide that guarantee about now().

knz · 2018-02-06T17:49:17Z

Yes that is why above I said "only true if the txns have touching keys"

bdarnell · 2018-02-06T18:51:01Z

I checked and our jepsen tests do not exercise the 1PC code path. Should we schedule a roadmap item to make it happen in 2.1?

It might be nice, but I wouldn't prioritize it. The current issue about commit timestamps changing applies to both one- and two-phase transactions.

Review status: 1 of 7 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.

pkg/kv/txn_coord_sender.go, line 466 at r2 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

We don't need to clear the flag right now and in fact we still want it. Even when the batch is split, if NoRefreshSpans is true, it means we can commit a serializable txn despite the timestamp timestamp having moved forward. I'm going to make a further change to effect that optimization. When we do that, we'll need to reset this flag more precisely at the dist sender to take into consideration any part of the split batch, already executed, which will generate a refresh span (e.g. cputs and incs).

I don't understand. If the batch is split, then the first parts of the batch could create refresh spans, invalidating the NoRefreshSpans flag on the EndTransactionRequest. (I'm talking about the splitting by type of BatchRequest.Split, not the splitting by key of kv.truncate)

pkg/roachpb/api.proto, line 463 at r2 (raw file):

We don't currently leak the commit timestamp to the SQL client before the commit

But unless you're in snapshot isolation, if there's a change in the timestamp the transaction would restart and the SQL layer would see the updated OrigTimestamp.

I'm not too concerned personally about most of the SQL timestamp functions (although I remember there was a lot of consideration given to what we should provide here, so maybe I'm forgetting something important). However, cluster_logical_timestamp does need to see the commit timestamp. I'm pretty sure this could break the monotonic jepsen test.

Comments from Reviewable

spencerkimball · 2018-02-07T16:55:25Z

Review status: 1 of 7 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.

pkg/kv/txn_coord_sender.go, line 466 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I don't understand. If the batch is split, then the first parts of the batch could create refresh spans, invalidating the NoRefreshSpans flag on the EndTransactionRequest. (I'm talking about the splitting by type of BatchRequest.Split, not the splitting by key of kv.truncate)

Well the code in this particular PR only executes on the 1PC path, so the batch could not have been split. The change I'm working on now will allow replica-side retry optimization, and that change resets the NoRefreshSpans flag if the batch is split and if the previously-executed split of the batch had refresh spans.

pkg/roachpb/api.proto, line 463 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We don't currently leak the commit timestamp to the SQL client before the commit

But unless you're in snapshot isolation, if there's a change in the timestamp the transaction would restart and the SQL layer would see the updated OrigTimestamp.

I'm not too concerned personally about most of the SQL timestamp functions (although I remember there was a lot of consideration given to what we should provide here, so maybe I'm forgetting something important). However, cluster_logical_timestamp does need to see the commit timestamp. I'm pretty sure this could break the monotonic jepsen test.

Looks like we're having to handle this case after all. The fixes are in an upcoming change.

Comments from Reviewable

bdarnell · 2018-02-07T19:45:28Z

Review status: 1 of 7 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.

pkg/kv/txn_coord_sender.go, line 466 at r2 (raw file):

Well the code in this particular PR only executes on the 1PC path

Ah, that's what I was missing.

The change I'm working on now will allow replica-side retry optimization, and that change resets the NoRefreshSpans flag if the batch is split and if the previously-executed split of the batch had refresh spans.

OK. That change is going to cause problems for mixed-version clusters, since nodes that only have this commit will be incorrectly sending NoRefreshSpans on EndTransactionRequests that have been split. An easy way to get around that would be to change the tag number for the NoRefreshSpans proto field.

Comments from Reviewable

spencerkimball · 2018-02-07T19:53:20Z

pkg/kv/txn_coord_sender.go, line 466 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Well the code in this particular PR only executes on the 1PC path

Ah, that's what I was missing.

The change I'm working on now will allow replica-side retry optimization, and that change resets the NoRefreshSpans flag if the batch is split and if the previously-executed split of the batch had refresh spans.

OK. That change is going to cause problems for mixed-version clusters, since nodes that only have this commit will be incorrectly sending NoRefreshSpans on EndTransactionRequests that have been split. An easy way to get around that would be to change the tag number for the NoRefreshSpans proto field.

I didn't realize we needed to handle mixed-version clusters on these unstable point releases. I don't think that should be a requirement, especially for a problem like this.

Comments from Reviewable

bdarnell · 2018-02-07T19:55:39Z

Review status: 1 of 7 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.

pkg/kv/txn_coord_sender.go, line 466 at r2 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

I didn't realize we needed to handle mixed-version clusters on these unstable point releases. I don't think that should be a requirement, especially for a problem like this.

We do rolling upgrades of our continuous and test clusters, and sometimes tell customers to do the same. We generally don't do this with critical data, but it's still good to get into the habit of compatibility between versions at any time. (In this case, the best solution would have been to clear the NoRefreshSpans on split in the original commit, but now that it's there and not cleared I think we should change the tag number to preserve compatibility)

Comments from Reviewable

nvanbenschoten · 2019-01-31T20:48:41Z

pkg/storage/replica.go

 	ts := ba.Timestamp
+	if br != nil {
+		ts = br.Timestamp


@spencerkimball or @bdarnell, do you know what makes this safe in relation to lease transfers? Specifically, a batch goes through the following steps:

1. update the local node's clock with `ba.Timestamp` when passing through the Store 2. under lock, we check whether a lease transfer is in flight 2a. if so, wait for the lease to stabilize 3. if not, evaluate the batch without ever coordinating with possible lease transfers again 4. bump timestamp cache with batch's timestamp (this code)

From the lease transfers point of view:

1. lock 2. grab a local clock reading 3. start transfer with that time, which will be used as the new leaseholder's tscache lower bound 4. unlock

Since this timestamp is larger than the original ba.Timestamp, isn't it possible that it's contribution to the timestamp cache is missed? Are we relying on some implicit assumption that this timestamp will never be bumped past clock.Now()? Is that a fair assumption to make in all cases? Are we assuming that this timestamp will always be based on existing data on the Range, which in turn must be at a lower timestamp than the clock on the Store that houses it? If so, I don't see how we're confident about that conclussion, given that we only set transactional batches' timestamps to the txn's orig timestamp (where it reads) and we don't include a timestamp when resolving intents.

Yes, I think we're implicitly relying on an assumption that any KV data that might bump the response timestamp will have been through some path that would have updated the HLC. I don't know whether that's always valid. It sounds like we might be missing some cases to pull timestamps out of the request (like in intent resolution).

It's safe because the br.Timestamp is only ever forwarded to the timestamp of a key written to the current range. In other words, the local clock must already be at or past any value of key written to the range. You might want to augment the comment here.

This thought experiment blossomed into the beautiful issue we see in #36431. Over there we note that even if we did update the leaseholder's HLC clock on intent resolution, we'd still run into issues with transaction uncertainty when intents are moved forward.

We'll need to address that issue by enforcing clock updates during txn refreshes. In addition, we should do more to ensure that clock updates are never missed, even when BatchRequests don't explicitly set a header timestamp. Trying to address this on a case-by-case basis (e.g. #35297 and #36307) seems like a fraught effort. There has been some discussion about adding a gRPC interceptor to deal with clock updates at a lower-level than in the BatchRequest domain.

This new assertion would fire if bug like the one hypothesized in the following comment exist: cockroachdb#22315 (comment) Release note: None

35297: storage: ensure that PushTxn request's timestamp cache updates are safe r=nvanbenschoten a=nvanbenschoten This PR's overall goal is to ensure that the timestamp cache updates performed by `PushTxn` requests are always safe. It does this in a series of steps: 1. it makes its `PushTo` argument inclusive, making it easier to use and easier to assert against (e.g. `req.PushTo <= req.Timestamp`). 2. it removes the `Now` argument and begins using the batch's `Timestamp` field instead. This field is used to update the receiving store's clock _before_ the request is evaluated and before the lease transfer check is performed. 3. it adds an assertion that all timestamp cache updates are safe given their local clock. This will also catch hypothesized bugs like #22315 (comment). I'm planning on getting this in to 19.1. Co-authored-by: Nathan VanBenschoten <[email protected]>

This fixes a bug introduced in cockroachdb#22315 with the introduction of a "local retry" to avoid WriteTooOldError inside of 1PC transactions that were not carrying out any reads. Such a transaction can simply be re-evaluated at a higher timestamp, which can allow it to commit. The bug was that when such a re-evaluation was carried out, we were not discarding the MVCCStats accrued from the first attempt. In effect, the request would thus be double-counted in stats, which would set off the consistency checker. The fix is easy: discard the delta before re-evaluating. Fixes cockroachdb#31870. Release note (bug fix): remove a source of (benign) stats inconsistencies (i.e. the stats of a range not accurately reflecting its contents).

41983: engine: don't create aux dir for read-only engine r=petermattis a=tbg It was mildly irritating that ./cockroach debug raft-log not/there 1 would create `not/there/auxiliary`. Release note: None 41986: storage: fix stats inconsistency due to WriteTooOld optimization r=irfansharif a=tbg This fixes a bug introduced in #22315 with the introduction of a "local retry" to avoid WriteTooOldError inside of 1PC transactions that were not carrying out any reads. Such a transaction can simply be re-evaluated at a higher timestamp, which can allow it to commit. The bug was that when such a re-evaluation was carried out, we were not discarding the MVCCStats accrued from the first attempt. In effect, the request would thus be double-counted in stats, which would set off the consistency checker. The fix is easy: discard the delta before re-evaluating. Fixes #31870. Release note (bug fix): remove a source of (benign) stats inconsistencies (i.e. the stats of a range not accurately reflecting its contents). Co-authored-by: Tobias Schottdorf <[email protected]>

This fixes a bug introduced in cockroachdb#22315 with the introduction of a "local retry" to avoid WriteTooOldError inside of 1PC transactions that were not carrying out any reads. Such a transaction can simply be re-evaluated at a higher timestamp, which can allow it to commit. The bug was that when such a re-evaluation was carried out, we were not discarding the MVCCStats accrued from the first attempt. In effect, the request would thus be double-counted in stats, which would set off the consistency checker. The fix is easy: discard the delta before re-evaluating. Fixes cockroachdb#31870. Release note (bug fix): remove a source of (benign) stats inconsistencies (i.e. the stats of a range not accurately reflecting its contents).

spencerkimball requested review from bdarnell, nvanbenschoten and a team February 1, 2018 23:20

spencerkimball force-pushed the avoid-client-retries-on-1pc branch from b1d7e68 to c2a562e Compare February 2, 2018 20:35

spencerkimball force-pushed the avoid-client-retries-on-1pc branch from c2a562e to 057d20f Compare February 2, 2018 22:42

spencerkimball force-pushed the avoid-client-retries-on-1pc branch from 057d20f to 662e706 Compare February 2, 2018 23:39

spencerkimball force-pushed the avoid-client-retries-on-1pc branch from 662e706 to c6f9b92 Compare February 5, 2018 15:41

spencerkimball merged commit bb1ce80 into cockroachdb:master Feb 5, 2018

spencerkimball deleted the avoid-client-retries-on-1pc branch February 5, 2018 21:52

tbg mentioned this pull request Mar 3, 2018

perf: investigate loadgen/kv slowdown #15797

Closed

nvanbenschoten reviewed Jan 31, 2019

View reviewed changes

nvanbenschoten mentioned this pull request Mar 1, 2019

storage: ensure that PushTxn request's timestamp cache updates are safe #35297

Merged

tbg mentioned this pull request Oct 29, 2019

storage: fix stats inconsistency due to WriteTooOld optimization #41986

Merged

tbg mentioned this pull request Nov 4, 2019

backport-19.2: storage: fix stats inconsistency due to WriteTooOld optimization #42148

Merged

storage: minimize retries with 1PC commits #22315

storage: minimize retries with 1PC commits #22315

Conversation

spencerkimball commented Feb 1, 2018 • edited Loading

cockroach-teamcity commented Feb 1, 2018

spencerkimball commented Feb 2, 2018

nvanbenschoten commented Feb 2, 2018

spencerkimball commented Feb 2, 2018

nvanbenschoten commented Feb 2, 2018

spencerkimball commented Feb 2, 2018

spencerkimball commented Feb 2, 2018

tbg commented Feb 2, 2018 via email

spencerkimball commented Feb 3, 2018

tbg commented Feb 3, 2018 via email

nvanbenschoten commented Feb 4, 2018

nvanbenschoten commented Feb 4, 2018

bdarnell commented Feb 5, 2018

spencerkimball commented Feb 5, 2018

nvanbenschoten commented Feb 5, 2018

jordanlewis commented Feb 5, 2018

spencerkimball commented Feb 5, 2018

bdarnell commented Feb 5, 2018

andreimatei commented Feb 6, 2018

andreimatei commented Feb 6, 2018

spencerkimball commented Feb 6, 2018

knz commented Feb 6, 2018

petermattis commented Feb 6, 2018

knz commented Feb 6, 2018

knz commented Feb 6, 2018 • edited Loading

andreimatei commented Feb 6, 2018

knz commented Feb 6, 2018

petermattis commented Feb 6, 2018

knz commented Feb 6, 2018

bdarnell commented Feb 6, 2018

spencerkimball commented Feb 7, 2018

bdarnell commented Feb 7, 2018

spencerkimball commented Feb 7, 2018

bdarnell commented Feb 7, 2018

nvanbenschoten Jan 31, 2019

Choose a reason for hiding this comment

bdarnell Jan 31, 2019

Choose a reason for hiding this comment

spencerkimball Feb 2, 2019

Choose a reason for hiding this comment

nvanbenschoten Apr 2, 2019

Choose a reason for hiding this comment

spencerkimball commented Feb 1, 2018 •

edited

Loading

knz commented Feb 6, 2018 •

edited

Loading