-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[grpctmclient] Add support for (bounded) per-host connection reuse #8368
Conversation
@@ -72,13 +76,40 @@ type Client struct { | |||
rpcClientMap map[string]chan *tmc | |||
} | |||
|
|||
type dialer interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that for the cachedConnDialer
we don't implement poolDialer
, because every call to dial by definition is "pooled"', just for a slightly different definition of pooled
@@ -14,7 +14,7 @@ See the License for the specific language governing permissions and | |||
limitations under the License. | |||
*/ | |||
|
|||
package grpctmserver | |||
package grpctmserver_test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made this change so i can call RegisterForTest and not generate an import cycle between here and the grpctmclient tests
@@ -45,7 +45,7 @@ import ( | |||
// fakeRPCTM implements tabletmanager.RPCTM and fills in all | |||
// possible values in all APIs | |||
type fakeRPCTM struct { | |||
t *testing.T | |||
t testing.TB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these changes exist to support the one lil' benchmark i wrote for the cached client
Looking at this today! Sorry for the delay, I was busy fighting GRPC itself. 👀 |
OK! Finished reading through the PR! It looks pretty good, I like the direction this is heading. I have two main concerns:
I've put these two H Y P O T H E S I S to the test by, huh, well, implementing and benchmarking them. You can see my changes in be8fede And the benchmark before/after:
Would you mind reviewing my commit and cherry picking it into this PR? I'm hoping you'll agree the code is meaningfully simpler! That's from a perf/correctness point of view. Besides that, the other global refactorings look good to me. Changing the return for the methods into the client + |
I left a couple comments on that commit! Basically I think there's one place where we're risking deadlock (so I'm guessing your benchmark happened to not hit that case), and that there are a few places where we're adjusting There's both a performance and correctness concern with that last point: correctness because we may end up evicting the wrong connection and get the cache into a weird state, and performance because there are places where we should be paying the O(n log n) cost of resorting, but we're not. That might also explain the perf difference between your patch and this. I'm also curious what pool size you're testing on; I was able to run this with a pool size of 1000 and got the following stats after a few rounds of
|
Oh yeah those two comments are on point. Thanks for double checking. I can't push to this branch to fix them though! I'll push a new commit tomorrow morning with proper fixes and benchmark again –- I hope fixing the correctness won't degrade performance. |
if os.Getenv("VT_PPROF_TEST") != "" { | ||
file, err := os.Create(fmt.Sprintf("%s.profile.out", t.Name())) | ||
require.NoError(t, err) | ||
defer file.Close() | ||
if err := pprof.StartCPUProfile(file); err != nil { | ||
t.Errorf("failed to start cpu profile: %v", err) | ||
return | ||
} | ||
defer pprof.StopCPUProfile() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, you can get this exact same behavior by simply passing the -cpuprofile
flag to go test
👍
Unfortunately I didn't have time to look at this, and I'm going to be out on vacation until Tuesday, but I will try this out in some testing environments when I'm back!! |
Sounds great. Have a good time! |
@@ -88,10 +119,11 @@ func (client *Client) dial(tablet *topodatapb.Tablet) (*grpc.ClientConn, tabletm | |||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that you've properly wired up a context.Context
to this API, we can change the grpcclient.Dial
call here to grpcclient.DialContext
, which is going to fix an annoying issue, as seen here: #8387 (comment)
Can you update this PR accordingly? That'll unblock the other PR. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to swap to DialContext, but per my comment on the other PR there's more work needed to remove that hack
Alright, back! I'm going to be testing this out todayyyy |
Okay!! Very happy to report that this works just as well in practice. I was somewhat concerned that dialing outside of the lock and throwing it away would result in too much connection churn, which is what my main goal was in reducing, but I can't make this meaningfully happen in real workloads. I then went ahead and realized that if we evict from the front (it works the same in reverse, it's just easier for me to think about this way 🤷), when we call
Looking forward to your thoughts, I'm thinking tomorrow we can polish this up and get it merged!! |
} | ||
} | ||
|
||
// tryFromCache tries to get a connection from the cache, performing a redial |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll clean up all these comments to reflect the actual state of the code tomorrow
// As a result, it is not safe to reuse a cachedConnDialer after calling Close, | ||
// and you should instead obtain a new one by calling either | ||
// tmclient.TabletManagerClient() with | ||
// TabletManagerProtocol set to "grpc-cached", or by calling | ||
// grpctmclient.NewCachedConnClient directly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, this comment is flat-out not true in the simpler implementation, so I'll clean that up as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I lied, this is partially true, since the closer func still locks the dialer.m
it will still content with actual dials if you reuse a cachedConnDialer after close, but it won't actually mess with the state of the queue. So it's "safe" but not ideal.
// b.Cleanup(shutdown) | ||
defer shutdown() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reminder for me to make a final pass to clean up these tests as well
This is great! Code looks ready to me. Let's fix the last few outdated comments and merge. |
Signed-off-by: Andrew Mason <[email protected]>
- Move grpctmserver tests to a testpackage to allow grpctmclient tests to import grpctmserver (this prevents an import cycle) - Change fakeRPCTM to take a testing.TB which both benchmark and test structs satisfy Signed-off-by: Andrew Mason <[email protected]>
Signed-off-by: Andrew Mason <[email protected]>
Okay! I've finished cleaning up, 1151c6d and b5af9f6 contain those changes if you want to view in isolation. I'm going to clean up the commit history because there are a bunch of commits in this branch related to an older implementation that i threw away, so there's no reason to keep it around. (I am keeping a branch locally pointed at the pre-rebase version in case we want to do anything differently) |
Two bugfixes: - one to prevent leaking open connections (by checking the map again after getting the write lock) - one to prevent leaving around closed connections, resulting in errors on later uses (by deleting the freed conn's addr and not the addr we're attempting to dial) Refactor to not duplicate dialing/queue management I don't want mistaken copy-paste to result in a bug between sections that should be otherwise identical Add one more missing "another goroutine dialed" check Add some stats to the cached conn dialer Remove everything from the old, slower cache implementation, pqueue is the way to go lots and lots and lots and lots of comments Signed-off-by: Andrew Mason <[email protected]>
…recation easier Refactor sections of the main `dial` method to allow all unlocks to be deferrals Add a test case to exercise evictions and full caches Refactor heap operations to make timing them easier Signed-off-by: Andrew Mason <[email protected]>
Signed-off-by: Andrew Mason <[email protected]>
Signed-off-by: Vicent Marti <[email protected]>
Signed-off-by: Andrew Mason <[email protected]>
By putting evictable conns at the front, we know that when we append to the end we don't need to resort, so multiple successive evictions can be faster Signed-off-by: Andrew Mason <[email protected]>
Signed-off-by: Andrew Mason <[email protected]>
Signed-off-by: Andrew Mason <[email protected]>
Signed-off-by: Andrew Mason <[email protected]>
Signed-off-by: Andrew Mason <[email protected]>
b5af9f6
to
b5908a8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great 👍
…t_conns [grpctmclient] Add support for (bounded) per-host connection reuse Signed-off-by: Andrew Mason <[email protected]>
…t_conns [grpctmclient] Add support for (bounded) per-host connection reuse Signed-off-by: Andrew Mason <[email protected]>
…t_conns [grpctmclient] Add support for (bounded) per-host connection reuse
Description
(Issue to come, I mainly want to see how this does in CI, even though I know it works in practice).
tl;dr: the oneshot dialing strategy of the existing grpc implementation of TabletManagerClient results in a lot of connection churn when doing a high volume of tabletmanager RPCs (which happens when doing VReplicationExec calls on workflows for larger keyspaces, for example). In addition, this can cause the number of osthreads to spike as the go runtime has to park more and more goroutines on the syscall queue in order to wait for those connections, which can result in crashes under high enough load.
This aims to fix that by introducing a per-tablet connection cache. The main reason I don't want to expand the use of the
usePool
flag in the old implementation is:tablet_manager_grpc_concurrency
, default 8) connections per tablet, rather than using one connection across all RPCsStrategy:
Phase 1 - Prepare for a new dialer
I refactored the existing cilent to separate the dialing of a connection to the actual making of the rpc call, and then defined an interface both for
dial
anddialPool
methods. These now return both the client interface and anio.Closer
, which in the original case just callsClose
on the grpc ClientConn, but we're going to use this in the cachedconn case to manage reference counting the connections and updating the eviction queue.Phase 2 - Adding the priority queue
I went through a few iterations to arrive at this, but it's fairly performant. It's based on the priority queue example found on
container/heap
docs, and sorts connections by fewest "refs" (number of times they been acquired but not yet released), and breaking ties by older lastAccessTime [note: i had the idea to include a background process to sweep up connections withrefs==0 && time.Since(lastAccessTime) > idleTimeout
but in the first couple passes it was resulting in way too much lock contention and the current implementation is good enough for my uses].Related Issue(s)
None yet.
Checklist
Deployment Notes