Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: experiment with a separate GRPC transport for node liveness #37195

Closed
jordanlewis opened this issue Apr 29, 2019 · 8 comments
Closed

core: experiment with a separate GRPC transport for node liveness #37195

jordanlewis opened this issue Apr 29, 2019 · 8 comments
Assignees
Labels
C-investigation Further steps needed to qualify. C-label will change.

Comments

@jordanlewis
Copy link
Member

... to determine whether GRPC/network has to do with unexpected node liveness fails.

@jordanlewis jordanlewis changed the title core: experiment with a separate GRPC stream for node liveness core: experiment with a separate GRPC transport for node liveness Apr 29, 2019
@andreimatei andreimatei added the C-investigation Further steps needed to qualify. C-label will change. label May 7, 2019
@ajwerner
Copy link
Contributor

Took me a while to get around to this. Preliminary data shows this to be extremely fruitful. On a given workload it's the difference between completely reliable node liveness failures and zero node liveness failures.

That being said, I did notice a much higher rate of failures of query failures which I can't totally explain. Some of those failures are the usual no inbound stream but many are due to failures dial which are generally accompanied by

ajwerner-test-0003> I190722 20:28:58.026165 15943 rpc/nodedialer/nodedialer.go:139  [n3,client=10.142.15.243:58962,user=root,txn=3202a43b] unable to connect to n2: failed to check for ready connection to n2 at ajwerner-test-0002:26257: connection not ready: TRANSIENT_FAILURE

which then trips the breaker.

From here I experimented with increasing the client and server keep-alives in gRPC, which anecdotally did help. I was a little bit frustrated by this situation and realized we hadn't updated gRPC in a while so I upgraded gRPC to 1.21 (the previous release and most recent with a point release, it seems like there may be a bug in the current release grpc/grpc-go#2663 (comment) for the potential issue and commentary on the bug).

After performing the upgrade and running the same experiment with and without the separate connections I no longer see any connections actually break. I observe node liveness failures only on the version without the separate connection.

The code with the separate connection as written is pretty gnarly but I think this is good enough evidence to move forward with cleaning it up. I'll put up a PR to upgrade gRPC first.

@bdarnell
Copy link
Contributor

HTTP/2 has a prioritization feature that is not exposed via GRPC (grpc/grpc-go#1448). I wonder if prioritization would get us the same benefit of a separate connection, and if it would be worth building this into GRPC.

We might also be able to tune the grpc window size parameters. I think our connection window size is low relative to our per-stream window size, increasing the risk that this high-priority traffic gets starved out by low-priority bulk traffic. See also #35161 (cc @asubiotto)

@ajwerner
Copy link
Contributor

I wonder if prioritization would get us the same benefit of a separate connection, and if it would be worth building this into GRPC.

It's certainly possible. Doing it properly feels like a bigger reach for this release than adding a second rpc.Context.

I think our connection window size is low relative to our per-stream window size, increasing the risk that this high-priority traffic gets starved out by low-priority bulk traffic.

I can play with this and see what I see.

@bdarnell
Copy link
Contributor

Yeah, doing prioritization properly in GRPC is definitely a bigger task, but we could at least be a voice of support on that 2-year-old issue.

These liveness connections will also behave a bit differently than our others. We tend towards a fully-connected graph because any node may have to talk to any other, but the liveness connections will only be active to a single node at a time (the liveness range leaseholder). We should make sure that these connections get shut down after a period of idleness so we don't double our connection count.

@ajwerner
Copy link
Contributor

We should make sure that these connections get shut down after a period of idleness so we don't double our connection count.

Good point.

There's also the question of the remote clock tracking. One thing I notice on the primary connection is that the RTT can get artificially gigantic.

ajwerner-test-0003> W190722 22:23:34.590736 680 server/node.go:814  [n3,summaries] health alerts detected: {Alerts:[{StoreID:0 Category:METRICS Description:round-trip-latency-p90 Value:1.006632959e+09}]}       
ajwerner-test-0002> W190722 22:23:34.794345 802 server/node.go:814  [n2,summaries] health alerts detected: {Alerts:[{StoreID:0 Category:METRICS Description:round-trip-latency-p90 Value:2.952790015e+09}]}       
ajwerner-test-0001> W190722 22:23:34.822180 916 server/node.go:814  [n1,summaries] health alerts detected: {Alerts:[{StoreID:0 Category:METRICS Description:round-trip-latency-p90 Value:1.275068415e+09}]} 

@asubiotto
Copy link
Contributor

@ajwerner re the grpc upgrade, just a heads up that client-side keepalive minimums have been bumped up to 10s (grpc/grpc-go#2642)

@ajwerner
Copy link
Contributor

@ajwerner re the grpc upgrade, just a heads up that client-side keepalive minimums have been bumped up to 10s (grpc/grpc-go#2642)

Indeed #39041 (comment)

If only that were the only problem.

@ajwerner
Copy link
Contributor

Closed by #39172.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-investigation Further steps needed to qualify. C-label will change.
Projects
None yet
Development

No branches or pull requests

5 participants