Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store/tikv: avoid holding write lock for long time #6880

Merged
merged 4 commits into from
Jun 25, 2018

Conversation

coocood
Copy link
Member

@coocood coocood commented Jun 22, 2018

What have you changed? (mandatory)

When a TiKV store failed, there are many concurrent requests will call OnRequestFail, all of them will holding the write lock for a while trying to drop the regions on the store.
This PR only let one request iterate all regions and drops the regions on the store, others will see that the store is dropped then quit early.

This PR also removed unreachableStores property on the Region struct, because

  1. it makes things much more complex.
  2. we drop all other regions on the store but keep the failed region if not all of the stores in the region are unreachable, it doesn't make sense.

What are the type of the changes (mandatory)?

  • Improvement (non-breaking change which is an improvement to an existing feature)
    Optimize RegionCache performance on send request failure

How has this PR been tested (mandatory)?

Unit test and benchmark test

Benchmark result if necessary (optional)

from
28221 ns/op
to
125 ns/op

Optimize RegionCache performance on send request failure.
@coocood coocood requested review from disksing and tiancaiamao June 22, 2018 06:38
@coocood
Copy link
Member Author

coocood commented Jun 22, 2018

@disksing @tiancaiamao @zz-jason PTAL

@shenli
Copy link
Member

shenli commented Jun 23, 2018

Will this PR improve the performance of sysbench?

@@ -464,8 +464,8 @@ func (r *Region) removePeer(peerID uint64) {
r.incConfVer()
}

func (r *Region) changeLeader(leaderStoreID uint64) {
r.leader = leaderStoreID
func (r *Region) changeLeader(leaderID uint64) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the ID of a peer or store?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

peer, it was mistakenly name leaderStoreID.

if !ok {
// The failed region is dropped already by another request, we don't need to iterate the regions
// and find regions on the failed store to drop.
c.mu.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use defer to unlock?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we can do something out of lock at the end of this function.

@@ -581,26 +579,6 @@ func (r *Region) GetContext() *kvrpcpb.Context {
}
}

// OnRequestFail records unreachable peer and tries to select another valid peer.
// It returns false if all peers are unreachable.
func (r *Region) OnRequestFail(storeID uint64) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this? Is this logic useless or moved to another place?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's useless.

Copy link
Contributor

@disksing disksing Jun 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some considerations for using unreachable store list.

Consider a store is down, another peer of the region becomes the leader, but somehow the new leader is not able to send heartbeat to PD in time.

With the unreachable store list, tidb can try the other peers automatically. Otherwise, it will continue to reconnect the down tikv until timeout.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@disksing
I know, but we drop all other regions in the store due to send request failure, keep an unreachable list for only one region doesn't make any difference.

@shenli
Copy link
Member

shenli commented Jun 23, 2018

@disksing @zhangjinpeng1987 PTAL

@coocood
Copy link
Member Author

coocood commented Jun 25, 2018

@shenli
This PR only optimize the case when a tikv-server is down, it can not improve sysbench result.

@tiancaiamao
Copy link
Contributor

LGTM

@tiancaiamao tiancaiamao added the status/LGT1 Indicates that a PR has LGTM 1. label Jun 25, 2018
c.dropRegionFromCache(id)
}
}
c.mu.Unlock()
log.Infof("drop regions that on the store %d due to send request fail, err: %v", failedStoreID, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also print "IP:port" address of that failed store?

@coocood
Copy link
Member Author

coocood commented Jun 25, 2018

@zz-jason PTAL

Copy link
Member

@zz-jason zz-jason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zz-jason zz-jason added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Jun 25, 2018
@coocood coocood merged commit 3ac6d3a into pingcap:master Jun 25, 2018
@coocood coocood deleted the region-cache-lock branch June 25, 2018 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants