Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: fix race between transport draining and new RPCs #2919

Merged
merged 2 commits into from
Jul 22, 2019

Conversation

dfawley
Copy link
Member

@dfawley dfawley commented Jul 19, 2019

Fixes #2767

Before these fixes, it was possible to see errors on new RPCs after a
connection began draining, and before establishing a new connection. There is
an inherent race between choosing a SubConn and attempting to creating a stream
on it. We should be able to avoid application-visible RPC errors due to this
with transparent retry. However, several bugs were preventing this from
working correctly:

  1. Non-wait-for-ready RPCs were skipping transparent retry, though the retry
    design calls for retrying them.

  2. The transport closed itself (and would consequently error new RPCs) before
    notifying the SubConn that it was draining.

  3. The SubConn wasn't synchronously updating itself once it was notified about
    the closing or draining state.

  4. The SubConn would go into the TRANSIENT_FAILURE state instantaneously,
    causing RPCs to fail instead of queue.

Before these fixes, it was possible to see errors on new RPCs after a
connection began draining, and before establishing a new connection.  There is
an inherent race between choosing a SubConn and attempting to creating a stream
on it.  We should be able to avoid application-visible RPC errors due to this
with transparent retry.  However, several bugs were preventing this from
working correctly:

1. Non-wait-for-ready RPCs were skipping transparent retry, though the retry
design calls for retrying them.

2. The transport closed itself (and would consequently error new RPCs) before
notifying the SubConn that it was draining.

3. The SubConn wasn't synchronously updating itself once it was notified about
the closing or draining state.

4. The SubConn would go into the TRANSIENT_FAILURE state instantaneously,
causing RPCs to fail instead of queue.
@dfawley dfawley added this to the 1.23 Release milestone Jul 19, 2019
@dfawley dfawley requested a review from menghanl July 19, 2019 22:16
test/goaway_test.go Outdated Show resolved Hide resolved
@gyuho
Copy link
Contributor

gyuho commented Jul 23, 2019

@dfawley Can we confirm 1.23 release date? https://github.com/grpc/grpc-go/milestone/21 says August 13. etcd is using 1.22. Would like to use 1.23 with this fix for our 3.4 release. Thanks!

/cc @jpbetz

@menghanl
Copy link
Contributor

@gyuho The milestone has the right date. Release 1.23 is scheduled for August 13.

@gyuho
Copy link
Contributor

gyuho commented Jul 23, 2019

@menghanl Thanks for the confirmation!

@lock lock bot locked as resolved and limited conversation to collaborators Jan 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unavailable error when MaxConnectionAge and MaxConnectionAgeGrace is enabled on the server
3 participants