Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: minimize retries with 1PC commits #22315

Conversation

spencerkimball
Copy link
Member

@spencerkimball spencerkimball commented Feb 1, 2018

Introduce a new NoRefreshSpans field to the EndTransactionRequest
arguments. This specifies that a serializable isolation transaction has
encountered no refresh spans. On a 1PC commit, this can be used to
avoid serializable restarts by re-executing the 1PC transaction at an
appropriately higher timestamp in the event of the timestamp being
forwarded by the timestamp cache or because of write-too-old errors.

When evaluating a write batch, we now allow a local retry for write
too old errors for non-transactional batches, and for serializable
1PC txns where NoRefreshSpans is true.

Release note: None

@spencerkimball spencerkimball requested review from bdarnell, nvanbenschoten and a team February 1, 2018 23:20
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@spencerkimball spencerkimball force-pushed the avoid-client-retries-on-1pc branch from b1d7e68 to c2a562e Compare February 2, 2018 20:35
@spencerkimball
Copy link
Member Author

@nvanbenschoten this now evaluates the full batch for both transactional and non-transactional batches to find the highest timestamp to satisfy write too old errors. Local retries on the replica at evaluation time are allowed for non-transactional batches always and for serializable 1PC transactional batches, as long as they have the NoRefreshSpans flag set to true.

@nvanbenschoten
Copy link
Member

Thanks @spencerkimball, I'm more comfortable with the change now that we're attempting to minimize the timestamp that these transactions will be pushed to. This eliminates the increase in WriteTooOld errors that you were seeing before, right?

@spencerkimball
Copy link
Member Author

@nvanbenschoten it does eliminate those yes.

@nvanbenschoten
Copy link
Member

Reviewed 5 of 7 files at r1.
Review status: 5 of 7 files reviewed at latest revision, all discussions resolved, some commit checks failed.


pkg/roachpb/api.proto, line 458 at r1 (raw file):

  // intents are left in the event of an error.
  bool require_1pc = 6 [(gogoproto.customname) = "Require1PC"];
  // Set to true if this transaction is serializable isolation and has

Does it ever make sense to set this for snapshot txns?


pkg/roachpb/api.proto, line 461 at r1 (raw file):

If true, one phase commits can use the executing server's
// hlc clock now timestamp to avoid retries.

replace "use the executing server's hlc clock now timestamp" with "increase their timestamp on the executing server"


pkg/roachpb/api.proto, line 462 at r1 (raw file):

  // etc.). If true, one phase commits can use the executing server's
  // hlc clock now timestamp to avoid retries.
  bool no_refresh_spans = 7;

Do you think we should make this flag even more general? Something like committed_to_timestamp or timestamp_negotiable.

As discussed with @andreimatei, we don't want to increase our transaction timestamp if we've done anything that relies on our initial timestamp. This includes having performed reads and writes at the timestamp but can also include leaking the timestamp in other ways like reporting it to SQL functions.


pkg/storage/replica.go, line 5209 at r1 (raw file):

	rec := NewReplicaEvalContext(r, spans)
	// We can retry locally if this is a non-transactional request.

nit: pull ba.Txn == nil into a variable like you did above with retryLocally so that it's easier to see the parallels.


pkg/storage/replica.go, line 5273 at r1 (raw file):

	if retry, reason := isEndTransactionTriggeringRetryError(ba.Txn); retry {
		// We can still attempt the one phase commit if the end
		// transaction request specifies there are no refresh spans.

Is this actually possible? Doesn't isEndTransactionTriggeringRetryError == true imply that the txn has performed previous writes?


Comments from Reviewable

@spencerkimball spencerkimball force-pushed the avoid-client-retries-on-1pc branch from c2a562e to 057d20f Compare February 2, 2018 22:42
@spencerkimball
Copy link
Member Author

Review status: 5 of 7 files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


pkg/roachpb/api.proto, line 458 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Does it ever make sense to set this for snapshot txns?

We don't track refresh spans for snapshot txns because we don't use them for refreshing, so this optimization doesn't apply to SNAPSHOT.


pkg/roachpb/api.proto, line 461 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

If true, one phase commits can use the executing server's
// hlc clock now timestamp to avoid retries.

replace "use the executing server's hlc clock now timestamp" with "increase their timestamp on the executing server"

Oops. this was a leftover bit.


pkg/roachpb/api.proto, line 462 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do you think we should make this flag even more general? Something like committed_to_timestamp or timestamp_negotiable.

As discussed with @andreimatei, we don't want to increase our transaction timestamp if we've done anything that relies on our initial timestamp. This includes having performed reads and writes at the timestamp but can also include leaking the timestamp in other ways like reporting it to SQL functions.

OK I'm fine with that. Changed it to can_forward_commit_timestamp. "negotiable" sounds like it could do something other than move forward.


pkg/storage/replica.go, line 5209 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: pull ba.Txn == nil into a variable like you did above with retryLocally so that it's easier to see the parallels.

Done.


pkg/storage/replica.go, line 5273 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is this actually possible? Doesn't isEndTransactionTriggeringRetryError == true imply that the txn has performed previous writes?

Yes, happens all the time if the batch affects keys which have newer timestamps from the timestamp cache. I improved the comment.


Comments from Reviewable

@spencerkimball
Copy link
Member Author

@nvanbenschoten I realize we can actually avoid retries on non fast path (non-1PC) transactions as well with this knowledge about having no refresh spans.

@spencerkimball spencerkimball force-pushed the avoid-client-retries-on-1pc branch from 057d20f to 662e706 Compare February 2, 2018 23:39
@tbg
Copy link
Member

tbg commented Feb 2, 2018 via email

@spencerkimball
Copy link
Member Author

Fixes #15797

With this change, kv --concurrency=8 --cycle-length=1 exhibits:

_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
      1s        0         2855.9         2855.9      2.8      3.4      5.0      6.8
      2s        0         2950.1         2903.0      2.8      3.0      4.7      6.8
      3s        0         2886.7         2897.6      2.8      3.3      5.8      7.9
      4s        0         2899.1         2898.0      2.8      3.1      5.8      7.3
      5s        0         2937.1         2905.8      2.6      3.0      6.3      8.9
      6s        0         2915.8         2907.5      2.8      3.0      6.6      7.1
      7s        0         2862.1         2901.0      2.8      3.1      7.1     11.5
      8s        0         2717.8         2878.1      2.8      3.9      8.9     15.2
      9s        0         2750.0         2863.8      2.8      3.5      8.1     11.5
     10s        0         2872.2         2864.7      2.8      3.1      8.4      9.4

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
   10.0s        0          28649         2864.7      2.8      2.8      3.3      6.8     15.2

Previous sha's performance with same options has:

_elapsed___errors__ops/sec(inst)___ops/sec(cum)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
      1s        0          462.7          462.7     11.5     48.2     62.9     79.7
      2s        0          449.4          456.0     16.3     46.1     65.0     88.1
      3s        0          441.6          451.2     16.8     48.2     58.7     79.7
      4s        0          458.0          452.9     15.7     46.1     52.4     75.5
      5s        0          438.2          450.0     16.8     48.2     60.8     71.3
      6s        0          451.4          450.2     10.0     46.1     56.6     71.3
      7s        0          361.1          437.5     22.0     56.6     75.5    109.1
      8s        0          416.1          434.8      8.9     52.4     75.5     96.5
      9s        0          440.3          435.5     13.6     50.3     58.7     62.9
     10s        0          451.4          437.0      9.4     46.1     58.7     79.7

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
   10.0s        0           4371          437.0     18.3     14.2     48.2     65.0    109.1

@tbg
Copy link
Member

tbg commented Feb 3, 2018 via email

@nvanbenschoten
Copy link
Member

:lgtm: that's a nice improvement!

I realize we can actually avoid retries on non fast path (non-1PC) transactions as well with this knowledge about having no refresh spans.

Yeah, that's a good point. We can always allow transaction timestamps to be pushed forward to avoid retries up until the timestamp is leaked for the first time. To make that change I think we'll need to move the CanForwardCommitTimestamp flag from EndTransactionRequest onto the Transaction proto itself.


Reviewed 4 of 6 files at r2.
Review status: 6 of 7 files reviewed at latest revision, all discussions resolved, all commit checks successful.


pkg/roachpb/data.go, line 1145 at r2 (raw file):

	}
	// Also forward by the existing timestamp.
	ts.Forward(err.ExistingTimestamp.Add(0, 1))

Why was this suddenly caught? Did part of this change cause it to become more important?

s/.Add(0, 1)/.Next()/

Also, It's unfortunate that the discrepancy between ReadWithinUncertaintyIntervalError.ExistingTimestamp and WriteTooOldError.ActualTimestamp caused us to make this mistake. I'd prefer we change them to both reflect either the existing value's timestamp or the existing value's timestamp+1. Either way, let's prevent future confusion. If this isn't easily possible because of the migration concern, let's at least be more explicit about the disagreement.


Comments from Reviewable

@nvanbenschoten
Copy link
Member

@spencerkimball, do you mind trying this change out on TPC-C as well? #15797 (comment) implies that TPC-C might also be affected by an approach like this.

@bdarnell
Copy link
Contributor

bdarnell commented Feb 5, 2018

Reviewed 1 of 7 files at r1, 6 of 6 files at r2.
Review status: all files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


pkg/kv/txn_coord_sender.go, line 464 at r2 (raw file):

			tc.mu.intentsSizeBytes = intentsSizeBytes

			if tc.mu.meta.Txn.IsSerializable() && tc.mu.meta.RefreshValid &&

While this is all fresh in your mind, make sure there is commentary somewhere about why we can't do this for snapshot transactions. (is it just because we don't accumulate refresh spans for non-serializable transactions or is there a deeper reason? If we weren't thinking about getting rid of snapshot completely, it might be worth saving refresh spans just to enable this optimization)

Maybe setting the isolation to snapshot should set RefreshValid to false so we don't need the separate IsSerializable check here.


pkg/kv/txn_coord_sender.go, line 466 at r2 (raw file):

			if tc.mu.meta.Txn.IsSerializable() && tc.mu.meta.RefreshValid &&
				len(tc.mu.meta.RefreshReads) == 0 && len(tc.mu.meta.RefreshWrites) == 0 {
				et.CanForwardCommitTimestamp = true

Can the DistSender split the batch below this point? We need to make sure to clear the CanForwardCommitTimestamp flag if that happens.


pkg/storage/replica.go, line 5238 at r2 (raw file):

		br, res, pErr = evaluateBatch(ctx, idKey, batch, rec, ms, ba)
		// If we can retry, set a higher batch timestamp and continue.
		if wtoErr, ok := pErr.GetDetail().(*roachpb.WriteTooOldError); ok && canRetry {

If retries is already 1, this will mutate ba.Timestamp but won't actually retry, which seems odd. On the other hand, since any concurrent writes should be blocked in the command queue and we look at all requests in the batch, we really shouldn't see more than one WriteTooOldError here. Maybe we should panic if we get more than one?


pkg/storage/replica.go, line 5487 at r2 (raw file):

					}
				}
				// Return error early on some requests for serializable isolation.

I think some comments about how we define "returning early" here (i.e. the end of the batch) would be helpful. This is a heuristic that may not always be optimal. Finishing the batch helps us choose the right timestamp in the event that there are multiple conflicts and the client retry will end up writing to the same keys the next time around. If the client behavior changes on retries then laying down the intents on this attempt was wasted work. If running the rest of the batch is beneficial, then the same logic might argue for running the entire remainder of the transaction. Returning a WriteTooOldError at the end of the batch is simply choosing a local minimum in the cost/benefit curve, since the cost of continuing the transaction is much less when we're not requiring additional round trips.


Comments from Reviewable

@spencerkimball
Copy link
Member Author

@nvanbenschoten are there instructions for running TPC-C?


Review status: all files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


pkg/kv/txn_coord_sender.go, line 464 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

While this is all fresh in your mind, make sure there is commentary somewhere about why we can't do this for snapshot transactions. (is it just because we don't accumulate refresh spans for non-serializable transactions or is there a deeper reason? If we weren't thinking about getting rid of snapshot completely, it might be worth saving refresh spans just to enable this optimization)

Maybe setting the isolation to snapshot should set RefreshValid to false so we don't need the separate IsSerializable check here.

We could do this for snapshot transactions, but I don't think we should. As you point out, we don't accumulate the refresh spans for SNAPSHOT. Not sure it makes sense to add any commentary considering the reason is so straightforward.


pkg/kv/txn_coord_sender.go, line 466 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Can the DistSender split the batch below this point? We need to make sure to clear the CanForwardCommitTimestamp flag if that happens.

It can. Originally the name of this flag was NoRefreshSpans which was specifically meant to still be accurate for that case, but then it was suggested to change the name and I didn't remember this case when agreeing. I think I'm going to change this back to NoRefreshSpans.


pkg/roachpb/data.go, line 1145 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why was this suddenly caught? Did part of this change cause it to become more important?

s/.Add(0, 1)/.Next()/

Also, It's unfortunate that the discrepancy between ReadWithinUncertaintyIntervalError.ExistingTimestamp and WriteTooOldError.ActualTimestamp caused us to make this mistake. I'd prefer we change them to both reflect either the existing value's timestamp or the existing value's timestamp+1. Either way, let's prevent future confusion. If this isn't easily possible because of the migration concern, let's at least be more explicit about the disagreement.

I added this addition for some unittest cases. It never fires in real life. But turns out that when I first added it, just using ExistingTimestamp was incorrect – need to add a logical tick to make sure we don't get another WriteTooOldError. Nothing prompted this fix except realizing the original change was incorrect.


pkg/storage/replica.go, line 5238 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

If retries is already 1, this will mutate ba.Timestamp but won't actually retry, which seems odd. On the other hand, since any concurrent writes should be blocked in the command queue and we look at all requests in the batch, we really shouldn't see more than one WriteTooOldError here. Maybe we should panic if we get more than one?

You can get more than one WriteTooOldError in the case where a non-transactional batch updates the same key on more than one request. Obviously this can also by extension happen to a transactional batch which is attempting to execute on the fast path. Added a comment.


pkg/storage/replica.go, line 5487 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think some comments about how we define "returning early" here (i.e. the end of the batch) would be helpful. This is a heuristic that may not always be optimal. Finishing the batch helps us choose the right timestamp in the event that there are multiple conflicts and the client retry will end up writing to the same keys the next time around. If the client behavior changes on retries then laying down the intents on this attempt was wasted work. If running the rest of the batch is beneficial, then the same logic might argue for running the entire remainder of the transaction. Returning a WriteTooOldError at the end of the batch is simply choosing a local minimum in the cost/benefit curve, since the cost of continuing the transaction is much less when we're not requiring additional round trips.

Added to the comment.

An important part of the story here is when this returning early situation is triggered by a conditional put (increment is not even really a factor in practice). On a conditional put, the chance that the retry will succeed is pretty slim, and so laying down intents is almost always going to be a bad idea. The reason cputs are likely to fail is that they're often either based on a read of the underlying value which will change on the forwarded retry timestamp, or they're expecting an empty value, which won't be true on the retry.


Comments from Reviewable

Introduce a new `NoRefreshSpans` field to the `EndTransactionRequest`
arguments. This specifies that a serializable isolation transaction has
encountered no refresh spans. On a 1PC commit, this can be used to
avoid serializable restarts by re-executing the 1PC transaction at an
appropriately higher timestamp in the event of the timestamp being
forwarded by the timestamp cache or because of write-too-old errors.

When evaluating a write batch, we now allow a local retry for write
too old errors for non-transactional batches, and for serializable
1PC txns where `NoRefreshSpans` is true.

Release note: None
@spencerkimball spencerkimball force-pushed the avoid-client-retries-on-1pc branch from 662e706 to c6f9b92 Compare February 5, 2018 15:41
@nvanbenschoten
Copy link
Member

@spencerkimball yes, the load generator is here. You should just be able to run ./tpcc -drop -load -warehouses=1. Try with both -no-wait=false and -no-wait=true.


Review status: 1 of 7 files reviewed at latest revision, 5 unresolved discussions, some commit checks failed.


Comments from Reviewable

@jordanlewis
Copy link
Member

FWIW no-wait=false won't be very interesting with just 1 warehouse, because the load generator will mostly be sleeping.

@spencerkimball
Copy link
Member Author

The results are not appreciably different for the tpcc load generator with this change.

@spencerkimball spencerkimball merged commit bb1ce80 into cockroachdb:master Feb 5, 2018
@spencerkimball spencerkimball deleted the avoid-client-retries-on-1pc branch February 5, 2018 21:52
@bdarnell
Copy link
Contributor

bdarnell commented Feb 5, 2018

Review status: 1 of 7 files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


pkg/kv/txn_coord_sender.go, line 466 at r2 (raw file):

It can.

Then don't we need to clear this flag (whatever it's called) when DistSender splits the batch? The sub-batches could create refresh spans but the EndTransaction sent on its own would still have NoRefreshSpans set.


pkg/roachpb/api.proto, line 463 at r2 (raw file):

  // path. For this to be true, the transaction must not have
  // accumulated any refreshable spans (e.g. from reads, cputs, etc.),
  // and must not have "leaked" a commit timestamp to the SQL client.

What happened to the part about leaking a commit timestamp to the SQL client? Don't we still need to do that?


Comments from Reviewable

@andreimatei
Copy link
Contributor

Review status: 1 of 7 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.


pkg/roachpb/api.proto, line 463 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

What happened to the part about leaking a commit timestamp to the SQL client? Don't we still need to do that?

On a related note, I want to check that we're good on other txns that care about their timestamps particularly - the incremental backup roll-up txns. Are those unaffected by this series of changes?


Comments from Reviewable

@andreimatei
Copy link
Contributor

Review status: 1 of 7 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.


pkg/roachpb/api.proto, line 463 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

On a related note, I want to check that we're good on other txns that care about their timestamps particularly - the incremental backup roll-up txns. Are those unaffected by this series of changes?

Cc @danhhz @knz


Comments from Reviewable

@spencerkimball
Copy link
Member Author

The txn start time was not always the txn commit time, as with SNAPSHOT txns. So now() if it used to "work", didn't actually work with both isolation levels.


Comments from Reviewable

@knz
Copy link
Contributor

knz commented Feb 6, 2018

yeah I am not concerned about snapshot. We never pretended to do anything correct with it anyway :)
The jepsen test suite didn't exercise snapshot either (AFAIK)

Meanwhile, I think we can defer working on this until 2.1 work starts, assuming we clearly document what is going on. A user who wishes to preserve the pre-optimization semantics can tweak their SQL code to force the txn to become 2PC, in which case I believe the previous behavior holds (is that correct?)

@petermattis
Copy link
Collaborator

It used to be that the txn start time was also the txn commit time, so it worked that way yes. Now is not the case any more.

I must have missed context here: why does now() need to be the transaction commit time? You're clearly worried about something that I'm not understanding.


Review status: 1 of 7 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.


Comments from Reviewable

@knz
Copy link
Contributor

knz commented Feb 6, 2018

test monotonic inserts: insert into t(x, y) select max(x)+1, now() from multiple concurrent clients must see the two columns x and y grow monotonically. This is only true if now() is monotonic with commit order.

There are variants of this that use multiple tables that check this is also true even when the txns do not conflict on writes (just on reads).

@knz
Copy link
Contributor

knz commented Feb 6, 2018

That monotonic insert test as written above however does not use 1PC (because it needs to read the max) so it's not impacted. But I wonder if there are other possible equivalent tests that can use 1PC and that would be impacted.

@andreimatei
Copy link
Contributor

This particular PR is about 1PC, but the broader set of changes affects all txns.

@knz
Copy link
Contributor

knz commented Feb 6, 2018

And just to be clear, a disclaimer: please take my comments above with a grain of salt, it's been a while since I last checked what these tests were actually testing. I recall we went through multiple iterations back then to properly define what tests were correct (i.e. the behavior they were testing was strictly required from the most general definition of serializable isolation).

@petermattis
Copy link
Collaborator

test monotonic inserts: insert into t(x, y) select max(x)+1, now() from multiple concurrent clients must see the two columns x and y grow monotonically. This is only true if now() is monotonic with commit order.

There are variants of this that use multiple tables that check this is also true even when the txns do not conflict on writes (just on reads).

Ok, this is vastly different than saying now() must be monotonic with commit order because there are other interactions between the two transactions which provide that guarantee. Without those interactions, I don't see how or why we would provide that guarantee about now().

@knz
Copy link
Contributor

knz commented Feb 6, 2018

Yes that is why above I said "only true if the txns have touching keys"

@bdarnell
Copy link
Contributor

bdarnell commented Feb 6, 2018

I checked and our jepsen tests do not exercise the 1PC code path. Should we schedule a roadmap item to make it happen in 2.1?

It might be nice, but I wouldn't prioritize it. The current issue about commit timestamps changing applies to both one- and two-phase transactions.


Review status: 1 of 7 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.


pkg/kv/txn_coord_sender.go, line 466 at r2 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

We don't need to clear the flag right now and in fact we still want it. Even when the batch is split, if NoRefreshSpans is true, it means we can commit a serializable txn despite the timestamp timestamp having moved forward. I'm going to make a further change to effect that optimization. When we do that, we'll need to reset this flag more precisely at the dist sender to take into consideration any part of the split batch, already executed, which will generate a refresh span (e.g. cputs and incs).

I don't understand. If the batch is split, then the first parts of the batch could create refresh spans, invalidating the NoRefreshSpans flag on the EndTransactionRequest. (I'm talking about the splitting by type of BatchRequest.Split, not the splitting by key of kv.truncate)


pkg/roachpb/api.proto, line 463 at r2 (raw file):

We don't currently leak the commit timestamp to the SQL client before the commit

But unless you're in snapshot isolation, if there's a change in the timestamp the transaction would restart and the SQL layer would see the updated OrigTimestamp.

I'm not too concerned personally about most of the SQL timestamp functions (although I remember there was a lot of consideration given to what we should provide here, so maybe I'm forgetting something important). However, cluster_logical_timestamp does need to see the commit timestamp. I'm pretty sure this could break the monotonic jepsen test.


Comments from Reviewable

@spencerkimball
Copy link
Member Author

Review status: 1 of 7 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.


pkg/kv/txn_coord_sender.go, line 466 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I don't understand. If the batch is split, then the first parts of the batch could create refresh spans, invalidating the NoRefreshSpans flag on the EndTransactionRequest. (I'm talking about the splitting by type of BatchRequest.Split, not the splitting by key of kv.truncate)

Well the code in this particular PR only executes on the 1PC path, so the batch could not have been split. The change I'm working on now will allow replica-side retry optimization, and that change resets the NoRefreshSpans flag if the batch is split and if the previously-executed split of the batch had refresh spans.


pkg/roachpb/api.proto, line 463 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We don't currently leak the commit timestamp to the SQL client before the commit

But unless you're in snapshot isolation, if there's a change in the timestamp the transaction would restart and the SQL layer would see the updated OrigTimestamp.

I'm not too concerned personally about most of the SQL timestamp functions (although I remember there was a lot of consideration given to what we should provide here, so maybe I'm forgetting something important). However, cluster_logical_timestamp does need to see the commit timestamp. I'm pretty sure this could break the monotonic jepsen test.

Looks like we're having to handle this case after all. The fixes are in an upcoming change.


Comments from Reviewable

@bdarnell
Copy link
Contributor

bdarnell commented Feb 7, 2018

Review status: 1 of 7 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.


pkg/kv/txn_coord_sender.go, line 466 at r2 (raw file):

Well the code in this particular PR only executes on the 1PC path

Ah, that's what I was missing.

The change I'm working on now will allow replica-side retry optimization, and that change resets the NoRefreshSpans flag if the batch is split and if the previously-executed split of the batch had refresh spans.

OK. That change is going to cause problems for mixed-version clusters, since nodes that only have this commit will be incorrectly sending NoRefreshSpans on EndTransactionRequests that have been split. An easy way to get around that would be to change the tag number for the NoRefreshSpans proto field.


Comments from Reviewable

@spencerkimball
Copy link
Member Author

pkg/kv/txn_coord_sender.go, line 466 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Well the code in this particular PR only executes on the 1PC path

Ah, that's what I was missing.

The change I'm working on now will allow replica-side retry optimization, and that change resets the NoRefreshSpans flag if the batch is split and if the previously-executed split of the batch had refresh spans.

OK. That change is going to cause problems for mixed-version clusters, since nodes that only have this commit will be incorrectly sending NoRefreshSpans on EndTransactionRequests that have been split. An easy way to get around that would be to change the tag number for the NoRefreshSpans proto field.

I didn't realize we needed to handle mixed-version clusters on these unstable point releases. I don't think that should be a requirement, especially for a problem like this.


Comments from Reviewable

@bdarnell
Copy link
Contributor

bdarnell commented Feb 7, 2018

Review status: 1 of 7 files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.


pkg/kv/txn_coord_sender.go, line 466 at r2 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

I didn't realize we needed to handle mixed-version clusters on these unstable point releases. I don't think that should be a requirement, especially for a problem like this.

We do rolling upgrades of our continuous and test clusters, and sometimes tell customers to do the same. We generally don't do this with critical data, but it's still good to get into the habit of compatibility between versions at any time. (In this case, the best solution would have been to clear the NoRefreshSpans on split in the original commit, but now that it's there and not cleared I think we should change the tag number to preserve compatibility)


Comments from Reviewable

ts := ba.Timestamp
if br != nil {
ts = br.Timestamp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spencerkimball or @bdarnell, do you know what makes this safe in relation to lease transfers? Specifically, a batch goes through the following steps:

1. update the local node's clock with `ba.Timestamp` when passing through the Store
2. under lock, we check whether a lease transfer is in flight
2a. if so, wait for the lease to stabilize
3. if not, evaluate the batch without ever coordinating with possible lease transfers again
4. bump timestamp cache with batch's timestamp (this code)

From the lease transfers point of view:

1. lock
2. grab a local clock reading
3. start transfer with that time, which will be used as the new leaseholder's tscache lower bound
4. unlock

Since this timestamp is larger than the original ba.Timestamp, isn't it possible that it's contribution to the timestamp cache is missed? Are we relying on some implicit assumption that this timestamp will never be bumped past clock.Now()? Is that a fair assumption to make in all cases? Are we assuming that this timestamp will always be based on existing data on the Range, which in turn must be at a lower timestamp than the clock on the Store that houses it? If so, I don't see how we're confident about that conclussion, given that we only set transactional batches' timestamps to the txn's orig timestamp (where it reads) and we don't include a timestamp when resolving intents.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we're implicitly relying on an assumption that any KV data that might bump the response timestamp will have been through some path that would have updated the HLC. I don't know whether that's always valid. It sounds like we might be missing some cases to pull timestamps out of the request (like in intent resolution).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's safe because the br.Timestamp is only ever forwarded to the timestamp of a key written to the current range. In other words, the local clock must already be at or past any value of key written to the range. You might want to augment the comment here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This thought experiment blossomed into the beautiful issue we see in #36431. Over there we note that even if we did update the leaseholder's HLC clock on intent resolution, we'd still run into issues with transaction uncertainty when intents are moved forward.

We'll need to address that issue by enforcing clock updates during txn refreshes. In addition, we should do more to ensure that clock updates are never missed, even when BatchRequests don't explicitly set a header timestamp. Trying to address this on a case-by-case basis (e.g. #35297 and #36307) seems like a fraught effort. There has been some discussion about adding a gRPC interceptor to deal with clock updates at a lower-level than in the BatchRequest domain.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Mar 1, 2019
This new assertion would fire if bug like the one hypothesized in
the following comment exist:
cockroachdb#22315 (comment)

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Mar 2, 2019
This new assertion would fire if bug like the one hypothesized in
the following comment exist:
cockroachdb#22315 (comment)

Release note: None
craig bot pushed a commit that referenced this pull request Mar 2, 2019
35297: storage: ensure that PushTxn request's timestamp cache updates are safe r=nvanbenschoten a=nvanbenschoten

This PR's overall goal is to ensure that the timestamp cache updates performed by `PushTxn` requests are always safe. It does this in a series of steps:
1. it makes its `PushTo` argument inclusive, making it easier to use and easier to assert against (e.g. `req.PushTo <= req.Timestamp`).
2. it removes the `Now` argument and begins using the batch's `Timestamp` field instead. This field is used to update the receiving store's clock _before_ the request is evaluated and before the lease transfer check is performed.
3. it adds an assertion that all timestamp cache updates are safe given their local clock. This will also catch hypothesized bugs like #22315 (comment).

I'm planning on getting this in to 19.1.

Co-authored-by: Nathan VanBenschoten <[email protected]>
tbg added a commit to tbg/cockroach that referenced this pull request Oct 29, 2019
This fixes a bug introduced in cockroachdb#22315 with the introduction of a "local
retry" to avoid WriteTooOldError inside of 1PC transactions that were
not carrying out any reads. Such a transaction can simply be
re-evaluated at a higher timestamp, which can allow it to commit.

The bug was that when such a re-evaluation was carried out, we were
not discarding the MVCCStats accrued from the first attempt. In
effect, the request would thus be double-counted in stats, which
would set off the consistency checker.

The fix is easy: discard the delta before re-evaluating.

Fixes cockroachdb#31870.

Release note (bug fix): remove a source of (benign) stats
inconsistencies (i.e. the stats of a range not accurately reflecting its
contents).
craig bot pushed a commit that referenced this pull request Oct 29, 2019
41983: engine: don't create aux dir for read-only engine r=petermattis a=tbg

It was mildly irritating that

    ./cockroach debug raft-log not/there 1

would create `not/there/auxiliary`.

Release note: None

41986: storage: fix stats inconsistency due to WriteTooOld optimization r=irfansharif a=tbg

This fixes a bug introduced in #22315 with the introduction of a "local
retry" to avoid WriteTooOldError inside of 1PC transactions that were
not carrying out any reads. Such a transaction can simply be
re-evaluated at a higher timestamp, which can allow it to commit.

The bug was that when such a re-evaluation was carried out, we were
not discarding the MVCCStats accrued from the first attempt. In
effect, the request would thus be double-counted in stats, which
would set off the consistency checker.

The fix is easy: discard the delta before re-evaluating.

Fixes #31870.

Release note (bug fix): remove a source of (benign) stats
inconsistencies (i.e. the stats of a range not accurately reflecting its
contents).

Co-authored-by: Tobias Schottdorf <[email protected]>
tbg added a commit to tbg/cockroach that referenced this pull request Nov 4, 2019
This fixes a bug introduced in cockroachdb#22315 with the introduction of a "local
retry" to avoid WriteTooOldError inside of 1PC transactions that were
not carrying out any reads. Such a transaction can simply be
re-evaluated at a higher timestamp, which can allow it to commit.

The bug was that when such a re-evaluation was carried out, we were
not discarding the MVCCStats accrued from the first attempt. In
effect, the request would thus be double-counted in stats, which
would set off the consistency checker.

The fix is easy: discard the delta before re-evaluating.

Fixes cockroachdb#31870.

Release note (bug fix): remove a source of (benign) stats
inconsistencies (i.e. the stats of a range not accurately reflecting its
contents).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants