Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch default load balancer policy from "pick_first" to "round_robin" #1318

Closed
jcferretti opened this issue Feb 15, 2024 · 0 comments · Fixed by #1319
Closed

Switch default load balancer policy from "pick_first" to "round_robin" #1318

jcferretti opened this issue Feb 15, 2024 · 0 comments · Fixed by #1319

Comments

@jcferretti
Copy link
Contributor

jcferretti commented Feb 15, 2024

Is your feature request related to a problem? Please describe.

The (default) "pick_first" load balancer policy is "sticky": it tries to send RPCs to the same endpoint as long as that endpoint is connected.

https://github.com/grpc/grpc/blob/master/doc/load-balancing.md#pick_first.

"pick_first" has some desirable properties when things are working, eg, you can pick an order for the endpoints where you privilege a "closer" host, say if the client happens to be running in a machine that also runs one of the etcd servers. However, "pick_first" is problematic in some failure scenarios. Consider a client that is talking to an etcd server that is (becomes) partitioned from the master, and makes an etcd request that requires a master. Eg, a write (put), linearized read (default get without the serialized option set) or watcher with the "required leader" option. In this scenario the RPC will fail with gRPC Status UNAVAILABLE. The issue is, with "pick_first" any retries will be routed to the same etcd server, which most likely is still partitioned. A better strategy is to use "round_robin", which will try to use the next endpoint for the retry. The etcd go client already uses "round_robin".

https://github.com/etcd-io/etcd/blob/840d4869234a94e7ec7b669cc7e9bcb79606bab2/client/v3/internal/resolver/resolver.go#L44

Some of the rationale is described in the documentation for the client balancer. Note that the documentation for the client balancer is written in a way that suggests that etcd client uses a custom load balancer; this doesn't look true to me from the current etcd go client sources, I believe etcd go client uses the stock "round_robin" gRPC-go load balancing policy.

https://etcd.io/docs/v3.6/learning/design-client/#clientv3-grpc123-balancer-overview

Describe the solution you'd like
Use "round_robin" as the default load balancing policy.

Describe alternatives you've considered
A custom load balancer that is "sticky" ("stable" may be a better word) in sending to a single endpoint that is working, like "pick_first", but switches to another endpoint that is already connected when an UNAVAILABLE error occurs would be desirable. However, such a custom load balancer would be additional code to write and maintain, code that interfaces with gRPC APIs that are changing frequently between versions: many of the features around name resolvers and load balancing in gRPC are marked as experimental. Now is probably not a good time to try to write custom load balancers.

Additional context
Another characteristic of "pick_first" is that when a subchannel (TCP connection) fails, the other subchannels are not already connected, so failing over implies trying to make a new connection and takes longer than in "round_robin" (*). In some circumstances, eg trying to reach an ip for a machine or kubernetes pod that is down, trying a new connection can take a long time to fail. If the machine is up but the port is not open, the kernel in the server machine answers back the SYN request and the TCP connection attempt will fail immediately; if the machine is not up, however, TCP SYN retries will happen for about 2 minutes (according to the defaults in Linux) before the connection attempt is failed. In this scenario, "round_robin" is also better in that it does its best to keep connections established to all alternative endpoints.

See https://www.evanjones.ca/tcp-connection-timeouts.html, under the heading "Connecting to a failed process/machine"
(warning: some of the descriptions in that page about gRPC behaviors are out of date and/or innacurate because some of these gRPC behaviors depend on configuration parameters, eg waitForReady, automated gRPC retries, etc; the description about TCP syn retries in that page is relevant however).

(*) this is actually slightly worse than it sounds at first: in "pick_first" if a subchannel connection drops while no RPC is being attempted, the channel is just marked as IDLE; on the next RPC attempt gRPC will try to (1) first re-connect the IDLE channel, and (2) if that fails then try the next address. This all eats from the RPC deadline time budget if one is set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

1 participant