Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[grpctmclient] Add support for (bounded) per-host connection reuse #8368

Merged
merged 13 commits into from
Jul 6, 2021

Conversation

ajm188
Copy link
Contributor

@ajm188 ajm188 commented Jun 22, 2021

Description

(Issue to come, I mainly want to see how this does in CI, even though I know it works in practice).

tl;dr: the oneshot dialing strategy of the existing grpc implementation of TabletManagerClient results in a lot of connection churn when doing a high volume of tabletmanager RPCs (which happens when doing VReplicationExec calls on workflows for larger keyspaces, for example). In addition, this can cause the number of osthreads to spike as the go runtime has to park more and more goroutines on the syscall queue in order to wait for those connections, which can result in crashes under high enough load.

This aims to fix that by introducing a per-tablet connection cache. The main reason I don't want to expand the use of the usePool flag in the old implementation is:

  • we still create N (tablet_manager_grpc_concurrency, default 8) connections per tablet, rather than using one connection across all RPCs
  • those connections then exist for the life of the vtctld process, and are never closed, so if you have enough tablets you will end up with a huge number of open connections after a long enough time.

Strategy:

Phase 1 - Prepare for a new dialer

I refactored the existing cilent to separate the dialing of a connection to the actual making of the rpc call, and then defined an interface both for dial and dialPool methods. These now return both the client interface and an io.Closer, which in the original case just calls Close on the grpc ClientConn, but we're going to use this in the cachedconn case to manage reference counting the connections and updating the eviction queue.

Phase 2 - Adding the priority queue

I went through a few iterations to arrive at this, but it's fairly performant. It's based on the priority queue example found on container/heap docs, and sorts connections by fewest "refs" (number of times they been acquired but not yet released), and breaking ties by older lastAccessTime [note: i had the idea to include a background process to sweep up connections with refs==0 && time.Since(lastAccessTime) > idleTimeout but in the first couple passes it was resulting in way too much lock contention and the current implementation is good enough for my uses].

Related Issue(s)

None yet.

Checklist

  • Tests were added or are not required - i did my best! we might want to mark these as flaky, but they pass consistently for me so, idk
  • Documentation was added or is not required - ohhhhhh yeahhhhhhhhh

Deployment Notes

@@ -72,13 +76,40 @@ type Client struct {
rpcClientMap map[string]chan *tmc
}

type dialer interface {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that for the cachedConnDialer we don't implement poolDialer, because every call to dial by definition is "pooled"', just for a slightly different definition of pooled

@@ -14,7 +14,7 @@ See the License for the specific language governing permissions and
limitations under the License.
*/

package grpctmserver
package grpctmserver_test
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made this change so i can call RegisterForTest and not generate an import cycle between here and the grpctmclient tests

@@ -45,7 +45,7 @@ import (
// fakeRPCTM implements tabletmanager.RPCTM and fills in all
// possible values in all APIs
type fakeRPCTM struct {
t *testing.T
t testing.TB
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these changes exist to support the one lil' benchmark i wrote for the cached client

@vmg
Copy link
Collaborator

vmg commented Jun 23, 2021

Looking at this today! Sorry for the delay, I was busy fighting GRPC itself. 👀

@vmg
Copy link
Collaborator

vmg commented Jun 23, 2021

OK! Finished reading through the PR! It looks pretty good, I like the direction this is heading. I have two main concerns:

  • I think the locking is not meaningful: with 3 recursive locking states (read m, write m and rw qMu), I can't demonstrate that the system doesn't deadlock (mostly because I'm not a very smart guy but regardless), and particularly for qMu, it seems like a sub-lock for m-write where the extra granularity couldn't possibly reduce actual lock contention. I have a hunch that removing all the locks and keeping a single sync.Mutex wouldn't affect performance in a measurable way.

  • I think the usage of a full heap is overkill. container/heap in the standard library is a bit of a trainwreck that requires a lot of boilerplate to provide very poor performance (since everything goes through interface{} anyway). My assumption is that simply keeping the eviction queue in a slice and sorting it lazily will provide the same or better performance than a heap (since fixing the heap on every refcount update is prohibitively expensive) and greatly simplify the whole client.

I've put these two H Y P O T H E S I S to the test by, huh, well, implementing and benchmarking them.

You can see my changes in be8fede

And the benchmark before/after:

name                 old time/op  new time/op  delta
CachedConnClient-16   144µs ± 2%   143µs ± 1%   ~     (p=0.113 n=10+9)

Would you mind reviewing my commit and cherry picking it into this PR? I'm hoping you'll agree the code is meaningfully simpler!

That's from a perf/correctness point of view. Besides that, the other global refactorings look good to me. Changing the return for the methods into the client + io.Closer seems like the best way to support pooled and unpooled at the same time. 👍

@ajm188
Copy link
Contributor Author

ajm188 commented Jun 23, 2021

I left a couple comments on that commit! Basically I think there's one place where we're risking deadlock (so I'm guessing your benchmark happened to not hit that case), and that there are a few places where we're adjusting refs and lastAccessTime on a conn, but not marking the eviction queue as needing a resort.

There's both a performance and correctness concern with that last point: correctness because we may end up evicting the wrong connection and get the cache into a weird state, and performance because there are places where we should be paying the O(n log n) cost of resorting, but we're not. That might also explain the perf difference between your patch and this.

I'm also curious what pool size you're testing on; I was able to run this with a pool size of 1000 and got the following stats after a few rounds of GetWorkflows calls (which make 4 VRep rpcs to a tablet in each shard in a keyspace):

{
  "tabletmanagerclient_cachedconn_dial_timeouts": 0,
  "tabletmanagerclient_cachedconn_dialtimings": {
    "TotalCount": 17156,
    "TotalTime": 68510690741,
    "Histograms": {
      "rlock_fast": {
        "500000": 14065,
        "1000000": 232,
        "5000000": 745,
        "10000000": 442,
        "50000000": 854,
        "100000000": 0,
        "500000000": 0,
        "1000000000": 0,
        "5000000000": 0,
        "10000000000": 0,
        "inf": 0,
        "Count": 16338,
        "Time": 20583087505
      },
      "sema_fast": {
        "500000": 31,
        "1000000": 4,
        "5000000": 55,
        "10000000": 45,
        "50000000": 282,
        "100000000": 82,
        "500000000": 1,
        "1000000000": 0,
        "5000000000": 0,
        "10000000000": 0,
        "inf": 0,
        "Count": 500,
        "Time": 16451257094
      },
      "sema_poll": {
        "500000": 5,
        "1000000": 0,
        "5000000": 0,
        "10000000": 0,
        "50000000": 0,
        "100000000": 178,
        "500000000": 135,
        "1000000000": 0,
        "5000000000": 0,
        "10000000000": 0,
        "inf": 0,
        "Count": 318,
        "Time": 31476346142
      }
    }
  },
  "tabletmanagerclient_cachedconn_new": 423,
  "tabletmanagerclient_cachedconn_reuse": 16733
}

@vmg
Copy link
Collaborator

vmg commented Jun 23, 2021

Oh yeah those two comments are on point. Thanks for double checking. I can't push to this branch to fix them though! I'll push a new commit tomorrow morning with proper fixes and benchmark again –- I hope fixing the correctness won't degrade performance.

@vmg
Copy link
Collaborator

vmg commented Jun 24, 2021

@ajm188 I've rebased my patch and fixed your suggestions: 7bc81d8

I think we're now always re-sorting properly and I don't think I can find any actual performance regressions. Would you mind reviewing it again? I would very much prefer to see simplified locking on this connection cache!

Comment on lines 105 to 270
if os.Getenv("VT_PPROF_TEST") != "" {
file, err := os.Create(fmt.Sprintf("%s.profile.out", t.Name()))
require.NoError(t, err)
defer file.Close()
if err := pprof.StartCPUProfile(file); err != nil {
t.Errorf("failed to start cpu profile: %v", err)
return
}
defer pprof.StopCPUProfile()
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, you can get this exact same behavior by simply passing the -cpuprofile flag to go test 👍

@ajm188
Copy link
Contributor Author

ajm188 commented Jun 25, 2021

Unfortunately I didn't have time to look at this, and I'm going to be out on vacation until Tuesday, but I will try this out in some testing environments when I'm back!!

@vmg
Copy link
Collaborator

vmg commented Jun 25, 2021

Sounds great. Have a good time!

@vmg vmg mentioned this pull request Jun 28, 2021
2 tasks
@@ -88,10 +119,11 @@ func (client *Client) dial(tablet *topodatapb.Tablet) (*grpc.ClientConn, tabletm
if err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you've properly wired up a context.Context to this API, we can change the grpcclient.Dial call here to grpcclient.DialContext, which is going to fix an annoying issue, as seen here: #8387 (comment)

Can you update this PR accordingly? That'll unblock the other PR. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to swap to DialContext, but per my comment on the other PR there's more work needed to remove that hack

@ajm188
Copy link
Contributor Author

ajm188 commented Jun 29, 2021

Alright, back! I'm going to be testing this out todayyyy

@ajm188
Copy link
Contributor Author

ajm188 commented Jun 29, 2021

Okay!! Very happy to report that this works just as well in practice. I was somewhat concerned that dialing outside of the lock and throwing it away would result in too much connection churn, which is what my main goal was in reducing, but I can't make this meaningfully happen in real workloads.

I then went ahead and realized that if we evict from the front (it works the same in reverse, it's just easier for me to think about this way 🤷), when we call newdial we know that conn has refs > 0 and therefore would never be sorted to the front of the evict queue, which means that we can just append it to the back and not set the evictSorted=false bit, which will allow us to evict multiple conns in succession without needing to mark the queue as needing a resort. This works about the same in the benchmarks, which I doubt meaningfully capture this special case optimization:

benchstat old.txt new.txt
CachedConnClientSteadyStateRedials-8    56.0ms ± 1%  56.2ms ± 1%   ~     (p=0.400 n=3+3)
CachedConnClientSteadyStateEvictions-8  19.5ms ± 4%  19.4ms ± 3%   ~     (p=0.700 n=3+3)

Looking forward to your thoughts, I'm thinking tomorrow we can polish this up and get it merged!!

}
}

// tryFromCache tries to get a connection from the cache, performing a redial
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll clean up all these comments to reflect the actual state of the code tomorrow

Comment on lines 303 to 316
// As a result, it is not safe to reuse a cachedConnDialer after calling Close,
// and you should instead obtain a new one by calling either
// tmclient.TabletManagerClient() with
// TabletManagerProtocol set to "grpc-cached", or by calling
// grpctmclient.NewCachedConnClient directly.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, this comment is flat-out not true in the simpler implementation, so I'll clean that up as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I lied, this is partially true, since the closer func still locks the dialer.m it will still content with actual dials if you reuse a cachedConnDialer after close, but it won't actually mess with the state of the queue. So it's "safe" but not ideal.

Comment on lines 107 to 71
// b.Cleanup(shutdown)
defer shutdown()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder for me to make a final pass to clean up these tests as well

@vmg
Copy link
Collaborator

vmg commented Jun 30, 2021

I then went ahead and realized that if we evict from the front (it works the same in reverse, it's just easier for me to think about this way shrug), when we call newdial we know that conn has refs > 0 and therefore would never be sorted to the front of the evict queue, which means that we can just append it to the back and not set the evictSorted=false bit, which will allow us to evict multiple conns in succession without needing to mark the queue as needing a resort.

This is great! Code looks ready to me. Let's fix the last few outdated comments and merge.

ajm188 added 3 commits June 30, 2021 14:42
- Move grpctmserver tests to a testpackage to allow grpctmclient tests to
import grpctmserver (this prevents an import cycle)
- Change fakeRPCTM to take a testing.TB which both benchmark and test
structs satisfy

Signed-off-by: Andrew Mason <[email protected]>
@ajm188
Copy link
Contributor Author

ajm188 commented Jun 30, 2021

Okay! I've finished cleaning up, 1151c6d and b5af9f6 contain those changes if you want to view in isolation. I'm going to clean up the commit history because there are a bunch of commits in this branch related to an older implementation that i threw away, so there's no reason to keep it around. (I am keeping a branch locally pointed at the pre-rebase version in case we want to do anything differently)

ajm188 added 2 commits June 30, 2021 14:46
Two bugfixes:

- one to prevent leaking open connections (by checking the map again
after getting the write lock)
- one to prevent leaving around closed connections, resulting in errors
on later uses (by deleting the freed conn's addr and not the addr we're
attempting to dial)

Refactor to not duplicate dialing/queue management

I don't want mistaken copy-paste to result in a bug between sections
that should be otherwise identical

Add one more missing "another goroutine dialed" check

Add some stats to the cached conn dialer

Remove everything from the old, slower cache implementation, pqueue is the way to go

lots and lots and lots and lots of comments

Signed-off-by: Andrew Mason <[email protected]>
…recation easier

Refactor sections of the main `dial` method to allow all unlocks to be deferrals

Add a test case to exercise evictions and full caches

Refactor heap operations to make timing them easier

Signed-off-by: Andrew Mason <[email protected]>
ajm188 and others added 8 commits June 30, 2021 14:47
Signed-off-by: Andrew Mason <[email protected]>
By putting evictable conns at the front, we know that when we append to
the end we don't need to resort, so multiple successive evictions can be
faster

Signed-off-by: Andrew Mason <[email protected]>
Signed-off-by: Andrew Mason <[email protected]>
Signed-off-by: Andrew Mason <[email protected]>
Signed-off-by: Andrew Mason <[email protected]>
@ajm188 ajm188 force-pushed the am_cached_grpctmclient_conns branch from b5af9f6 to b5908a8 Compare June 30, 2021 18:48
Copy link
Collaborator

@vmg vmg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great 👍

@ajm188 ajm188 merged commit a8679df into vitessio:main Jul 6, 2021
ajm188 added a commit to tinyspeck/vitess that referenced this pull request Jul 6, 2021
…t_conns

[grpctmclient] Add support for (bounded) per-host connection reuse

Signed-off-by: Andrew Mason <[email protected]>
ajm188 added a commit to tinyspeck/vitess that referenced this pull request Aug 25, 2021
…t_conns

[grpctmclient] Add support for (bounded) per-host connection reuse

Signed-off-by: Andrew Mason <[email protected]>
vmogilev pushed a commit to tinyspeck/vitess that referenced this pull request Jun 6, 2022
…t_conns

[grpctmclient] Add support for (bounded) per-host connection reuse
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants