-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
meta/autoid: make autoid client ResetConn operation concurrency-safe (#50522) #50592
meta/autoid: make autoid client ResetConn operation concurrency-safe (#50522) #50592
Conversation
Signed-off-by: ti-chi-bot <[email protected]>
/test unit-test |
@tiancaiamao: No presubmit jobs available for pingcap/[email protected] In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## release-7.1 #50592 +/- ##
================================================
Coverage ? 73.4203%
================================================
Files ? 1207
Lines ? 378703
Branches ? 0
================================================
Hits ? 278045
Misses ? 82922
Partials ? 17736 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bb7133, xhebox, zimulala The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This is an automated cherry-pick of #50522
What problem does this PR solve?
Issue Number: close #50519
Problem Summary:
What changed and how does it work?
The old logic is not concurrency-safe. Although some of the fields are protected by mutex, the logic is not correct.
Imagine that there are 7K concurrent Alloc() operation, and one of them meet rpc error.
Then one
ResetConn()
is called, which invokesgrpcConn.Close()
.Note, there are many ongoing calling of
Alloc()
and still using the conn, so close the grpcConn cause this error:This error in turn cause
ResetConn()
calling.Even though
GetClient()
get a new client, the new client might be reset again by theResetConn()
mistakenly!So the client can not recover from the error automatically some times (when the concurrent contention is high enough).
To summarize, there are too things we should avoid:
grpcConn.Close()
inResetConn()
when the grpcConn might still been using by some other sessions.ResetConn()
could be called multiple times and mistakenly reset the new connection.Check List
Tests
Rename this to cmd/autoid/main.go and go run it (modify tidb/pkg/autoid_service/autoid.go and mock 1/1000 percent error rate also).
main.txt
Use the test described in #50519, after the fix, the QPS become stable:
And the error message like this is gone:
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.