-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
store/tikv: avoid holding write lock for long time #6880
Conversation
Optimize RegionCache performance on send request failure.
Will this PR improve the performance of sysbench? |
@@ -464,8 +464,8 @@ func (r *Region) removePeer(peerID uint64) { | |||
r.incConfVer() | |||
} | |||
|
|||
func (r *Region) changeLeader(leaderStoreID uint64) { | |||
r.leader = leaderStoreID | |||
func (r *Region) changeLeader(leaderID uint64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the ID of a peer or store?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
peer, it was mistakenly name leaderStoreID
.
if !ok { | ||
// The failed region is dropped already by another request, we don't need to iterate the regions | ||
// and find regions on the failed store to drop. | ||
c.mu.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use defer to unlock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we can do something out of lock at the end of this function.
@@ -581,26 +579,6 @@ func (r *Region) GetContext() *kvrpcpb.Context { | |||
} | |||
} | |||
|
|||
// OnRequestFail records unreachable peer and tries to select another valid peer. | |||
// It returns false if all peers are unreachable. | |||
func (r *Region) OnRequestFail(storeID uint64) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove this? Is this logic useless or moved to another place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's useless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some considerations for using unreachable store list.
Consider a store is down, another peer of the region becomes the leader, but somehow the new leader is not able to send heartbeat to PD in time.
With the unreachable store list, tidb can try the other peers automatically. Otherwise, it will continue to reconnect the down tikv until timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@disksing
I know, but we drop all other regions in the store due to send request failure, keep an unreachable list for only one region doesn't make any difference.
@shenli |
LGTM |
store/tikv/region_cache.go
Outdated
c.dropRegionFromCache(id) | ||
} | ||
} | ||
c.mu.Unlock() | ||
log.Infof("drop regions that on the store %d due to send request fail, err: %v", failedStoreID, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also print "IP:port" address of that failed store?
@zz-jason PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What have you changed? (mandatory)
When a TiKV store failed, there are many concurrent requests will call
OnRequestFail
, all of them will holding the write lock for a while trying to drop the regions on the store.This PR only let one request iterate all regions and drops the regions on the store, others will see that the store is dropped then quit early.
This PR also removed
unreachableStores
property on theRegion
struct, becauseWhat are the type of the changes (mandatory)?
Optimize RegionCache performance on send request failure
How has this PR been tested (mandatory)?
Unit test and benchmark test
Benchmark result if necessary (optional)
from
28221 ns/op
to
125 ns/op