sql: prioritize retryable errors in synchronizeParallelStmts #23294

nvanbenschoten · 2018-03-01T21:46:24Z

At the moment, parallel statement execution works by sending batches
concurrently through a single client.Txn. This make the handling of
retryable errors tricky because it's difficult to know when its safe
to prepare the transaction state for a retry. Our approach to this is
far from optimal, and relies on a mess of locking in both client.Txn
and TxnCoordSender. This works well enough to prevent anything from
seriously going wrong (#17197), but can result in some confounding error
behavior when statements operate in the context of transaction epochs
that they weren't expecting.

The ideal situation would be for all statements with a handle to a txn
to always work under the same txn epoch at a single point in time. Any
retryable error seen by these statements would be propagated up through
client.Txn without changing any state (and without yet being converted
to a HandledRetryableTxnError), and only after the statements have all
been synchronized would the retryable error be used to update the txn
and prepare for the retry attempt. This would require a change like #22615.
I've created a POC for this approach, but it is way to invasive to
cherry-pick.

So with our current state of things, we need to do a better job catching
errors caused by concurrent retries. In the past we've tried to carefully
determine which errors could be a symptom of a concurrent retry and ignore
them. I now think this was a mistake, as this process of inferring which
errors could be caused by a txn retry is fraught for failure. We now
always return retryable errors from synchronizeParallelStmts when they
exist. The reasoning for this is that if an error was a symptom of the
txn retry, it will not be present during the next txn attempt. If it was
not and instead was a legitimate query execution error, we expect to
hit it again on the next txn attempt and the behavior will mirror that
where the statement throwing the execution error was not even run before
the parallel queue hit the retryable error.

Note: the duplicated code will go away when session.go is retired next week!

Release note: None

At the moment, parallel statement execution works by sending batches concurrently through a single `client.Txn`. This make the handling of retryable errors tricky because it's difficult to know when its safe to prepare the transaction state for a retry. Our approach to this is far from optimal, and relies on a mess of locking in both `client.Txn` and `TxnCoordSender`. This works well enough to prevent anything from seriously going wrong (cockroachdb#17197), but can result in some confounding error behavior when statements operate in the context of transaction epochs that they weren't expecting. The ideal situation would be for all statements with a handle to a txn to always work under the same txn epoch at a single point in time. Any retryable error seen by these statements would be propagated up through `client.Txn` without changing any state (and without yet being converted to a `HandledRetryableTxnError`), and only after the statements have all been synchronized would the retryable error be used to update the txn and prepare for the retry attempt. This would require a change like cockroachdb#22615. I've created a POC for this approach, but it is way to invasive to cherry-pick. So with our current state of things, we need to do a better job catching errors caused by concurrent retries. In the past we've tried to carefully determine which errors could be a symptom of a concurrent retry and ignore them. I now think this was a mistake, as this process of inferring which errors could be caused by a txn retry is fraught for failure. We now always return retryable errors from synchronizeParallelStmts when they exist. The reasoning for this is that if an error was a symptom of the txn retry, it will not be present during the next txn attempt. If it was not and instead was a legitimate query execution error, we expect to hit it again on the next txn attempt and the behavior will mirror that where the statement throwing the execution error was not even run before the parallel queue hit the retryable error. Release note: None

cockroach-teamcity · 2018-03-01T21:46:30Z

This change is

bdarnell · 2018-03-01T22:30:14Z

Review status: 0 of 3 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.

pkg/sql/conn_executor.go, line 1389 at r1 (raw file):

		// Sort the errors according to their importance.
		curTxn := ex.state.mu.txn.Proto()
		sort.Slice(errs, func(i, j int) bool {

Sorting the whole slice is unnecessary if you just want the minimum (may not be worth worrying about, though).

Comments from Reviewable

nvanbenschoten · 2018-03-01T23:26:08Z

Review status: 0 of 3 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.

pkg/sql/conn_executor.go, line 1389 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Sorting the whole slice is unnecessary if you just want the minimum (may not be worth worrying about, though).

Yeah, that's a good point. This slice will only be on the order of the number of statements run in parallel though, so I'm not too worried about the cost of sorting the whole slice if it makes this easier to understand.

Comments from Reviewable

nvanbenschoten requested review from andreimatei and a team March 1, 2018 21:46

nvanbenschoten merged commit e075756 into cockroachdb:master Mar 1, 2018

nvanbenschoten deleted the nvanbenschoten/throwawayErr branch March 1, 2018 23:37

nvanbenschoten mentioned this pull request Mar 6, 2018

sql/logictest: TestParallel/subquery_retry_multinode failed under stress #22926

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: prioritize retryable errors in synchronizeParallelStmts #23294

sql: prioritize retryable errors in synchronizeParallelStmts #23294

nvanbenschoten commented Mar 1, 2018

cockroach-teamcity commented Mar 1, 2018

bdarnell commented Mar 1, 2018

nvanbenschoten commented Mar 1, 2018

sql: prioritize retryable errors in synchronizeParallelStmts #23294

sql: prioritize retryable errors in synchronizeParallelStmts #23294

Conversation

nvanbenschoten commented Mar 1, 2018

cockroach-teamcity commented Mar 1, 2018

bdarnell commented Mar 1, 2018

nvanbenschoten commented Mar 1, 2018