kv: support STAGING transactions in kv protocol and client #36012

nvanbenschoten · 2019-03-21T00:06:24Z

This PR polishes off the protocol and client level changes from #35165 to allow transactions to be moved to the STAGING state during parallel commits. To do so, the change makes two major contributions.

First, it adds support to EndTransaction request and response for moving a transaction record to the STAGING status in addition to the ABORTED and COMMITTED status. This is done in a way that interacts elegantly with a transaction's in-flight writes and allows the request to move a transaction record directly to the COMMITTED status when all of the in-flight writes are on the same range as the transaction record.

The second contribution is that the change adds full support to the txnCommitter interceptor to issue parallel commits when possible and then to launch the asynchronous portion of marking a transaction as explicitly committed if it determines that a transaction is implicitly committed. This is likely the area of the change where reviewers will want to look first, as it incorporates the bulk of the parallel commits algorithm and also serves as its primary documentation.

The final remaining piece of work after this change is to adjust DistSender to actually send EndTransaction requests in parallel with writes in the same batch and with QueryIntents for previous writes. This change makes it safe to do so, but doesn't actually add any logic to DistSender. There are a few complications to that piece and it's also the least interesting part of this change, which is why I have saved it for last.

cockroach-teamcity · 2019-03-21T00:06:34Z

This change is

nvanbenschoten · 2019-04-02T19:43:55Z

Ok, this is ready for review when possible. The one major questions I have that I'd appreciate input from @tbg, @ajwerner, and maybe @andreimatei on is about how we should represent a transaction's "intent spans" and "in-flight requests". This comes up both in the EndTransactionRequest and in the Transaction record. The question is a continuation of the thread from #35763 (review). I'll continue to work on this, but it's something to keep in mind when reviewing the PR.

Also, I'm not planning on merging this until after the DistSender changes are also ready to go in because on its own this will cause a performance hit.

ajwerner

Reviewed 4 of 4 files at r1, 2 of 2 files at r2, 28 of 28 files at r3, 17 of 30 files at r4, 2 of 2 files at r5, 28 of 29 files at r6.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @tbg)

pkg/kv/txn_interceptor_committer.go, line 147 at r3 (raw file):

		// the status back to PENDING.
		if txn := pErr.GetTxn(); txn != nil && txn.Status == roachpb.STAGING {
			pErr.SetTxn(cloneWithStatus(txn, roachpb.PENDING))

Just to check my understanding, at this point the serialized Txn record will be STAGING but we know here that it's not committed so our in-memory view of the Txn is PENDING?

pkg/kv/txn_interceptor_committer.go, line 151 at r3 (raw file):

		// Same deal with MixedSuccessErrors.
		// TODO(nvanbenschoten): Merge code? Wait for MixedSuccessError to go away?
		if aPSErr, ok := pErr.GetDetail().(*roachpb.MixedSuccessError); ok {

What is aPSErr supposed to mean?

pkg/kv/txn_interceptor_committer.go, line 255 at r3 (raw file):

// committing EndTransaction in parallel with other writes. If so, it returns
// the set of writes that may still be in-flight when the EndTransaction is
// evaluated. These writes should be declared are requirements for the

either should be declared as or are

pkg/kv/txn_interceptor_committer.go, line 405 at r6 (raw file):

	ba.Add(&et)

	const maxAttempts = 5

Comment on why retries might happen and why it is a good idea?

pkg/kv/txn_interceptor_committer_test.go, line 194 at r3 (raw file):

	mockSender.MockSend(func(ba roachpb.BatchRequest) (*roachpb.BatchResponse, *roachpb.Error) {
		require.Equal(t, 3, len(ba.Requests))

require.Len is a thing

pkg/kv/txn_interceptor_pipeliner.go, line 68 at r3 (raw file):

// that have been proposed but not proven to have succeeded are first checked
// before considering the transaction committed. These async writes are referred
// to as "outstanding writes" and this process of proving that an outstanding

should we call them "in-flight writes" for consistency? Can you help elucidate the difference between "outstanding" and "in-flight" for me? Both terms are used in a sentence below. From what I gather one is the set of writes stored in the BTree of the txnPipeliner and another is the map of the EndTransaction populated by the txnComitter. The mixed terminology is a little bit confusing.

pkg/roachpb/data.go, line 980 at r3 (raw file):

	}
	if !t.Status.IsFinalized() {
		if (t.Epoch < o.Epoch) || (t.Epoch == o.Epoch && o.Status != PENDING) {

putting a pin in this to remember to think harder about this condition

pkg/storage/replica_test.go, line 10672 at r3 (raw file):

				return sendWrappedWithErr(etH, &et)
			},
			expError: "TransactionAbortedError(ABORT_REASON_ALREADY_COMMITTED_OR_ROLLED_BACK_POSSIBLE_REPLAY)",

Is TransactionAbortedError a good name for this error?

pkg/storage/replica_test.go, line 11114 at r3 (raw file):

			},
			run: func(txn *roachpb.Transaction, _ hlc.Timestamp) error {
				rt := recoverTxnArgs(txn, true /* implicitlyCommitted */)

This got renamed to allWritesFound

pkg/storage/replica_write.go, line 506 at r3 (raw file):

				et = etClone

				reqsClone := append([]roachpb.RequestUnion(nil), ba.Requests...)

preallocate this because you know the size

pkg/storage/replica_write.go, line 494 at r6 (raw file):

	}

	origET := et

Can you comment about what's going on here?

pkg/storage/batcheval/cmd_end_transaction.go, line 242 at r3 (raw file):

		case roachpb.PENDING:
			if h.Txn.Epoch < reply.Txn.Epoch {

Given the code duplication you could make this case PENDING, STAGING and put a

if reply.Txn.Status == roachpb.STAGING {
   // Remove the existing in-flight writes. They may be replaced.
   reply.Txn.InFlightWrites = nil
}

ajwerner

My comments are questions more than anything

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @nvanbenschoten, and @tbg)

nvanbenschoten

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @tbg)

pkg/kv/txn_interceptor_committer.go, line 147 at r3 (raw file):

Previously, ajwerner wrote…

Just to check my understanding, at this point the serialized Txn record will be STAGING but we know here that it's not committed so our in-memory view of the Txn is PENDING?

Yes, exactly! This is really subtle so I'll improve the comment.

pkg/kv/txn_interceptor_committer.go, line 151 at r3 (raw file):

Previously, ajwerner wrote…

What is aPSErr supposed to mean?

I think it's "a partial success error". I got that from here. I really need to just address the TODO and get rid of MixedSuccessErrors entirely.

pkg/kv/txn_interceptor_committer.go, line 255 at r3 (raw file):

Previously, ajwerner wrote…

either should be declared as or are

Done.

pkg/kv/txn_interceptor_committer.go, line 405 at r6 (raw file):

Previously, ajwerner wrote…

Comment on why retries might happen and why it is a good idea?

I removed this. I don't have a strong enough understanding of when retries would be a good idea. We can add them back in if we see a need for them.

pkg/kv/txn_interceptor_committer_test.go, line 194 at r3 (raw file):

Previously, ajwerner wrote…

require.Len is a thing

Nice! Switched where appropriate.

pkg/kv/txn_interceptor_pipeliner.go, line 68 at r3 (raw file):

Previously, ajwerner wrote…

should we call them "in-flight writes" for consistency? Can you help elucidate the difference between "outstanding" and "in-flight" for me? Both terms are used in a sentence below. From what I gather one is the set of writes stored in the BTree of the txnPipeliner and another is the map of the EndTransaction populated by the txnComitter. The mixed terminology is a little bit confusing.

Yeah, I think this is a good point. The distinction here is that:
"outstanding" = writes that have not completed by the time an EndTransaction is issued
"in-flight" = writes that are in-flight at the same time as an EndTransaction

This distinction here makes sense given the development progression of txn pipelining and now parallel commits, but as a reader of this code I can see how the distinction would feel artificial. There's definitely a draw to unifying the concepts.

If we were to unify them, which name would you like better? "Outstanding"? "In-flight"? "Concurrent"? Something else? I'm into the idea of making the change, I just don't want to change the 200 instances of "outstanding" to something else before we've settled on something better. @tbg do you have an opinion?

pkg/storage/replica_test.go, line 10672 at r3 (raw file):

Previously, ajwerner wrote…

Is TransactionAbortedError a good name for this error?

TransactionAbortedErrors arise in these situations because the transaction record is committed and already GCed by the time the second request reaches it. At that point, there's no way to tell whether the transaction was committed or aborted (unlike in the "without eager gc" cases below). All we really know if that the transaction record can't be created again in the future.

Fortunately, it doesn't really matter because transaction records can't be GCed until after all of their intents have already been resolved. This is explained in pretty good detail here.

pkg/storage/replica_test.go, line 11114 at r3 (raw file):

Previously, ajwerner wrote…

This got renamed to allWritesFound

Ah good catch, but other way around. I renamed it to implicitlyCommitted in the previous PR and didn't catch all of these during the rebase. Done.

pkg/storage/replica_write.go, line 506 at r3 (raw file):

Previously, ajwerner wrote…

preallocate this because you know the size

That doesn't seem to make a difference:

var slice = make([]int, 512)

func BenchmarkPreAlloc(b *testing.B) {
	for i := 0; i < b.N; i++ {
		_ = append(make([]int, 0, len(slice)), slice...)
	}
}

func BenchmarkNoPreAlloc(b *testing.B) {
	for i := 0; i < b.N; i++ {
		_ = append([]int(nil), slice...)
	}
}

name          time/op
PreAlloc-4     722ns ± 1%
NoPreAlloc-4   723ns ± 4%

name          alloc/op
PreAlloc-4    4.10kB ± 0%
NoPreAlloc-4  4.10kB ± 0%

name          allocs/op
PreAlloc-4      1.00 ± 0%
NoPreAlloc-4    1.00 ± 0%

pkg/storage/replica_write.go, line 494 at r6 (raw file):

Previously, ajwerner wrote…

Can you comment about what's going on here?

Done.

pkg/storage/batcheval/cmd_end_transaction.go, line 242 at r3 (raw file):

Previously, ajwerner wrote…

Given the code duplication you could make this case PENDING, STAGING and put a
if reply.Txn.Status == roachpb.STAGING {
   // Remove the existing in-flight writes. They may be replaced.
   reply.Txn.InFlightWrites = nil
}

Done.

This was suggested in cockroachdb#36012. The following commit will make a distinction between "in-flight" writes and a "write footprint", so it makes sense to first perform this global rename. Release note: None

nvanbenschoten · 2019-05-03T21:24:16Z

The final two points to address here before it is ready to review are:

restructuring the in-flight writes in the EndTransactionRequest and Transaction protos.
unifying the concepts of "outstanding" pipelined writes and "in-flight" writes during a parallel commit.

Both of these are addressed in #37302, so that PR is the final precursor to this one.

This commit addresses the impedance mismatch discussed in cockroachdb#35763 (review) between tracking in-flight writes and tracking a transaction's intent "footprint". These are two related but fundamentally different concepts and it's best to be explicit about how they relate and how they differ. To this end, this commit moves all intent tracking into the txnPipeliner. The pipeliner now tracks two separate sets: the "in-flight writes" and the "write footprint". The former is morally a subset of the latter, but for the sake of avoiding duplication, in-flight writes are not included in the write footprint. Instead, the write footprint includes: 1. write from the current epoch that have already succeeded (point writes that were pipelined and already proved and ranged writes) 2. writes from the current epoch that may or may not have succeeded 3. all writes from prior epochs Structuring the two sets this way has a major benefit -- it is exactly how we would like to represent the writes that a transaction has performed on `EndTransactionRequests` and on STAGING `Transaction` records. This will simplify client-logic in cockroachdb#36012 significantly. Release note: None

This was suggested in cockroachdb#36012. The following commit will make a distinction between "in-flight" writes and a "write footprint", so it makes sense to first perform this global rename. Release note: None

This allows QueryIntent to behave correctly even as a transaction's provisional commit timestamp grows. Release note: None

Parallel commits will want intents that are found to be missing to be prevented from ever being written. This is true both for the committing transaction itself and other transactions attempting to recover a STAGING transaction. In fact, all use cases of QueryIntent will now want the PREVENT IfMissingBehavior. However, a committing transaction will still also want the existing behavior of returning an error if the intent is missing, which is provided by the RETURN_ERROR IfMissingBehavior. This indicates that the options in the IfMissingBehavior are not sufficiently orthogonal to justify an enum. This change removes the enum and replaces it with an `ErrorIfMissing` bool. This is backwards compatible because up until now, no has used the PREVENT behavior. Release note: None

These checks usually go hand-in-hand, but the output of the error check is almost always more useful when tests fail. Release note: None

Release note: None

roachpb.EndTransactionRequest is 120 bytes large. We shouldn't pass it by value to these leaf functions. Release note: None

This commit adjusts the representation of `InFlightWrites` in Transaction protos in accordance with the thinking from cockroachdb#37302. The structure no longer points into the `IntentSpans`. Instead, it lives alongside the `IntentSpans`, which logically extends to include all of the `InFlightWrites` but doesn't in practice to avoid duplication. Release note: None

nvanbenschoten · 2019-05-15T13:29:55Z

@tbg I decided I didn't want to sit on this until #37469 is ready because this large of a PR tends to rot and I don't intend to finalize that PR until next week. I added in one more commit that temporarily disables parallel commits, which we can revert when #37469 goes in.

Here's a note from that commit message:

The fear of merging #36012 without #37469 and without disabling parallel
commits is that we would then be adding a new transaction state transition
without actually performing implicit commits in parallel with the rest of
the transaction's writes. This appears to slow down nightly benchmarks by
about 5-10%.

With that extra commit, this PR is now eligible to be merged into master.

tbg · 2019-05-15T13:35:24Z

Sounds good, I didn't think that was easy enough to do or I would've suggested so myself during review. Going to give this a look later today.

This commit allows us to merge cockroachdb#36012 without waiting for cockroachdb#37469. We can revert this commit once the changes in cockroachdb#37469 land. The fear of merging cockroachdb#36012 without cockroachdb#37469 and without disabling parallel commits is that we would then be adding a new transaction state transition without actually performing implicit commits in parallel with the rest of the transaction's writes. This appears to slow down nightly benchmarks by about 5-10%. Release note: None

tbg

Reviewed 9 of 9 files at r15, 3 of 3 files at r16.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/storage/replica_write.go, line 527 at r15 (raw file):

				// Ranged write or read. See below.
			}
		}

assert len(inflight) >= len(writes)

pkg/storage/replica_write.go, line 572 at r15 (raw file):

		//
		// This reduces the complexity of this loop from O(n*log(m)) down to
		// O(n) where n is the number of requests in the batch and m is the

point writes

#batch + #pointwritesinbatch * log inflight

This PR polishes off the protocol and client level changes from cockroachdb#35165 to allow transactions to be moved to the STAGING state during parallel commits. To do so, the change makes two major contributions. First, it adds support to EndTransaction request and response for moving a transaction record to the STAGING status in addition to the ABORTED and COMMITTED status. This is done in a way that interacts elegantly with a transaction's in-flight writes and allows the request to move a transaction record directly to the COMMITTED status when all of the in-flight writes are on the same range as the transaction record. The second contribution is that the change adds full support to the txnCommitter interceptor to issue parallel commits when possible and then to launch the asynchronous portion of marking a transaction as explicitly committed if it determines that a transaction is implicitly committed. This is likely the area of the change where reviewers will want to look first, as it incorporates the bulk of the parallel commits algorithm and also serves as its primary documentation. The final remaining piece of work after this change is to adjust DistSender to actually send EndTransaction requests in parallel with writes in the same batch and with QueryIntents for previous writes. This change makes it safe to do so, but doesn't actually add any logic to DistSender. There are a few complications to that piece and it's also the least interesting part of this change, which is why I have saved it for last. Release note: None

Release note: None

Just code movement. Release note: None

Release note: None

This commit allows us to merge cockroachdb#36012 without waiting for cockroachdb#37469. We can revert this commit once the changes in cockroachdb#37469 land. The fear of merging cockroachdb#36012 without cockroachdb#37469 and without disabling parallel commits is that we would then be adding a new transaction state transition without actually performing implicit commits in parallel with the rest of the transaction's writes. This appears to slow down nightly benchmarks by about 5-10%. Release note: None

nvanbenschoten

TFTRs!

Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @tbg)

pkg/storage/replica_write.go, line 527 at r15 (raw file):

Previously, tbg (Tobias Grieger) wrote…

assert len(inflight) >= len(writes)

Done.

pkg/storage/replica_write.go, line 572 at r15 (raw file):

Previously, tbg (Tobias Grieger) wrote…

point writes

#batch + #pointwritesinbatch * log inflight

Done.

nvanbenschoten · 2019-05-15T21:14:46Z

bors r+

36012: kv: support STAGING transactions in kv protocol and client r=nvanbenschoten a=nvanbenschoten This PR polishes off the protocol and client level changes from #35165 to allow transactions to be moved to the STAGING state during parallel commits. To do so, the change makes two major contributions. First, it adds support to `EndTransaction` request and response for moving a transaction record to the STAGING status in addition to the ABORTED and COMMITTED status. This is done in a way that interacts elegantly with a transaction's in-flight writes and allows the request to move a transaction record directly to the COMMITTED status when all of the in-flight writes are on the same range as the transaction record. The second contribution is that the change adds full support to the `txnCommitter` interceptor to issue parallel commits when possible and then to launch the asynchronous portion of marking a transaction as explicitly committed if it determines that a transaction is implicitly committed. This is likely the area of the change where reviewers will want to look first, as it incorporates the bulk of the parallel commits algorithm and also serves as its primary documentation. The final remaining piece of work after this change is to adjust DistSender to actually send `EndTransaction` requests in parallel with writes in the same batch and with `QueryIntents` for previous writes. This change makes it safe to do so, but doesn't actually add any logic to DistSender. There are a few complications to that piece and it's also the least interesting part of this change, which is why I have saved it for last. Co-authored-by: Nathan VanBenschoten <[email protected]>

craig · 2019-05-15T21:39:41Z

Build succeeded

GitHub CI (Cockroach)

nvanbenschoten requested review from ajwerner and tbg March 21, 2019 00:06

nvanbenschoten force-pushed the nvanbenschoten/txnParallelCommitter2 branch 4 times, most recently from 742f533 to 07f6a10 Compare March 27, 2019 22:21

nvanbenschoten force-pushed the nvanbenschoten/txnParallelCommitter2 branch from 07f6a10 to 06f74b4 Compare April 2, 2019 19:37

nvanbenschoten marked this pull request as ready for review April 2, 2019 19:43

nvanbenschoten requested review from a team April 2, 2019 19:43

ajwerner reviewed Apr 2, 2019

View reviewed changes

nvanbenschoten force-pushed the nvanbenschoten/txnParallelCommitter2 branch 2 times, most recently from f9d3fc5 to 2f1d26c Compare April 17, 2019 18:41

nvanbenschoten commented Apr 17, 2019

View reviewed changes

nvanbenschoten force-pushed the nvanbenschoten/txnParallelCommitter2 branch from 2f1d26c to 5791929 Compare April 17, 2019 18:43

nvanbenschoten force-pushed the nvanbenschoten/txnParallelCommitter2 branch from 5791929 to a635e1c Compare May 2, 2019 19:58

nvanbenschoten mentioned this pull request May 3, 2019

kv: delete txnIntentCollector, track intent spans in txnPipeliner as "write footprint" #37302

Merged

nvanbenschoten force-pushed the nvanbenschoten/txnParallelCommitter2 branch 2 times, most recently from 1811d8e to d7ed002 Compare May 11, 2019 03:27

nvanbenschoten mentioned this pull request May 11, 2019

kv: track withCommit across concurrent batches split by divideAndSendBatchToRanges #37469

Merged

nvanbenschoten added 6 commits May 15, 2019 09:07

roachpb: forward search timestamp for QueryIntent by batch header txn

0cca5b1

This allows QueryIntent to behave correctly even as a transaction's provisional commit timestamp grows. Release note: None

kv: perform nil err check before non-nil resp check

ea5afeb

These checks usually go hand-in-hand, but the output of the error check is almost always more useful when tests fail. Release note: None

kv: use require.Len in txn interceptor tests

ec55ce5

Release note: None

storage/batcheval: s/args roachpb.EndTxnReq/args *roachpb.EndEndTxnReq/g

0e3d81b

roachpb.EndTransactionRequest is 120 bytes large. We shouldn't pass it by value to these leaf functions. Release note: None

nvanbenschoten force-pushed the nvanbenschoten/txnParallelCommitter2 branch from d7ed002 to c7bad2b Compare May 15, 2019 13:26

nvanbenschoten force-pushed the nvanbenschoten/txnParallelCommitter2 branch from c7bad2b to 987cc5b Compare May 15, 2019 14:10

tbg approved these changes May 15, 2019

View reviewed changes

nvanbenschoten added 5 commits May 15, 2019 14:39

kv: add metric to track parallel commits

7b944e9

Release note: None

storage: reduce indentation in Replica.updateTimestampCache

a010c7d

Just code movement. Release note: None

roachpb: create IsParallelCommit method on EndTransactionRequest

1e58017

Release note: None

nvanbenschoten force-pushed the nvanbenschoten/txnParallelCommitter2 branch from 987cc5b to 9d874cc Compare May 15, 2019 18:41

nvanbenschoten commented May 15, 2019

View reviewed changes

craig bot merged commit 9d874cc into cockroachdb:master May 15, 2019

nvanbenschoten deleted the nvanbenschoten/txnParallelCommitter2 branch May 20, 2019 21:02

nvanbenschoten mentioned this pull request Jul 19, 2019

kv/storage: support parallel transaction commits #30999

Closed

knz mentioned this pull request Nov 10, 2019

User-facing changes in 19.2 that were not picked up in release notes cockroachdb/docs#5819

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: support STAGING transactions in kv protocol and client #36012

kv: support STAGING transactions in kv protocol and client #36012

nvanbenschoten commented Mar 21, 2019 •

edited

Loading

cockroach-teamcity commented Mar 21, 2019

nvanbenschoten commented Apr 2, 2019

ajwerner left a comment

ajwerner left a comment

nvanbenschoten left a comment

nvanbenschoten commented May 3, 2019

nvanbenschoten commented May 15, 2019

tbg commented May 15, 2019

tbg left a comment

nvanbenschoten left a comment

nvanbenschoten commented May 15, 2019

craig bot commented May 15, 2019

kv: support STAGING transactions in kv protocol and client #36012

kv: support STAGING transactions in kv protocol and client #36012

Conversation

nvanbenschoten commented Mar 21, 2019 • edited Loading

cockroach-teamcity commented Mar 21, 2019

nvanbenschoten commented Apr 2, 2019

ajwerner left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

nvanbenschoten commented May 3, 2019

nvanbenschoten commented May 15, 2019

tbg commented May 15, 2019

tbg left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

nvanbenschoten commented May 15, 2019

craig bot commented May 15, 2019

Build succeeded

nvanbenschoten commented Mar 21, 2019 •

edited

Loading