sql: prioritize retryable errors in synchronizeParallelStmts #23294
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
At the moment, parallel statement execution works by sending batches
concurrently through a single
client.Txn
. This make the handling ofretryable errors tricky because it's difficult to know when its safe
to prepare the transaction state for a retry. Our approach to this is
far from optimal, and relies on a mess of locking in both
client.Txn
and
TxnCoordSender
. This works well enough to prevent anything fromseriously going wrong (#17197), but can result in some confounding error
behavior when statements operate in the context of transaction epochs
that they weren't expecting.
The ideal situation would be for all statements with a handle to a txn
to always work under the same txn epoch at a single point in time. Any
retryable error seen by these statements would be propagated up through
client.Txn
without changing any state (and without yet being convertedto a
HandledRetryableTxnError
), and only after the statements have allbeen synchronized would the retryable error be used to update the txn
and prepare for the retry attempt. This would require a change like #22615.
I've created a POC for this approach, but it is way to invasive to
cherry-pick.
So with our current state of things, we need to do a better job catching
errors caused by concurrent retries. In the past we've tried to carefully
determine which errors could be a symptom of a concurrent retry and ignore
them. I now think this was a mistake, as this process of inferring which
errors could be caused by a txn retry is fraught for failure. We now
always return retryable errors from synchronizeParallelStmts when they
exist. The reasoning for this is that if an error was a symptom of the
txn retry, it will not be present during the next txn attempt. If it was
not and instead was a legitimate query execution error, we expect to
hit it again on the next txn attempt and the behavior will mirror that
where the statement throwing the execution error was not even run before
the parallel queue hit the retryable error.
Note: the duplicated code will go away when
session.go
is retired next week!Release note: None