client: fix race between transport draining and new RPCs #2919
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #2767
Before these fixes, it was possible to see errors on new RPCs after a
connection began draining, and before establishing a new connection. There is
an inherent race between choosing a SubConn and attempting to creating a stream
on it. We should be able to avoid application-visible RPC errors due to this
with transparent retry. However, several bugs were preventing this from
working correctly:
Non-wait-for-ready RPCs were skipping transparent retry, though the retry
design calls for retrying them.
The transport closed itself (and would consequently error new RPCs) before
notifying the SubConn that it was draining.
The SubConn wasn't synchronously updating itself once it was notified about
the closing or draining state.
The SubConn would go into the TRANSIENT_FAILURE state instantaneously,
causing RPCs to fail instead of queue.