Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the dependency on grpc-go's experimental API (e.g. resolver & balancer) #15145

Open
ahrtr opened this issue Jan 18, 2023 · 30 comments
Open
Assignees
Labels
help wanted priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. stage/tracked type/feature

Comments

@ahrtr
Copy link
Member

ahrtr commented Jan 18, 2023

What would you like to be added?

etcd is depending on some experimental APIs of grpc-go, such as resolver. Once these APIs change in future, it might break etcd. So we should try to remove the dependency on any experimental APIs; instead, let's try to use stable APIs.

References:

Why is this needed?

Improve stability

@dfawley
Copy link

dfawley commented Jan 18, 2023

grpc-go TL here. I'm happy to work with you to find / provide a stable solution, but I would first need to know more about your use cases & requirements.

@ahrtr ahrtr added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Feb 1, 2023
@tjungblu
Copy link
Contributor

tjungblu commented Feb 2, 2023

just to randomly add what I've found in #14501 - once we reconsider the balancer/resolver, we should also add some circuit breaking. There were some ideas by Easwar in grpc/grpc-go#5672.

@ahrtr
Copy link
Member Author

ahrtr commented Feb 20, 2023

Sorry for the late response.

The use cases & requirement seems to be simple. Users may specify multiple endpoints on the client side (see example below), and etcd clientv3 needs to support client side load balancing.

$ ./etcdctl --endpoints=http://127.0.0.1:2379,http://127.0.0.1:22379,http://127.0.0.1:32379 get k1 

Currently the target used when calling grpc.DialContext is a string in format etcd-endpoints://<pointer>/<authority>, e.g. etcd-endpoints://0xc00036e1e0/127.0.0.1:2379. etcd implements a customized resolver , which is passed to gRPC via grpc.WithResolvers. The loadBalancingPolicy is round_robin.

Some historical background:

  1. Previously etcd 3.4 implemented a custom load balancer, and it was changed to gRPC's Resolver and load balancing in 3.5.0 in clientv3: Replace balancer with upstream grpc solution #12671.
  2. But unfortunately clientv3: Replace balancer with upstream grpc solution #12671 introduced a critical issue etcd client v3 v3.5.0 sends invalid :authority header field value #13192 due to invalid :authority header, which was fixed in Fix http2 authority header in single endpoint scenario #13359

Based on https://github.com/grpc/grpc/blob/master/doc/naming.md, it seems the gRPC resolver plugin (although the API is still experimental) is still the right direction? If not, could you please advise the correct approach? cc @dfawley

@ahrtr
Copy link
Member Author

ahrtr commented Mar 12, 2023

@dfawley could you advise how to replace the experimental API mentioned above?

@ahrtr
Copy link
Member Author

ahrtr commented Mar 13, 2023

One more dependency on experimental API: grpcServer.ServeHTTP

@dfawley
Copy link

dfawley commented Mar 30, 2023

Also sorry for the late rely

Based on grpc/grpc@master/doc/naming.md, it seems the gRPC resolver plugin (although the API is still experimental) is still the right direction?

For this you could instead create (N) gRPC clients, one for each address, and do round robin dispatch to them by wrapping them in another object that implements its own grpc.ClientConnInterface and has a pool of the grpc.ClientConns created. This is what is done by Google's cloud client libraries to spread load across multiple connections.

What would be your concerns with such an approach?

@ahrtr
Copy link
Member Author

ahrtr commented Mar 31, 2023

What would be your concerns with such an approach?

Thanks for the response. The only concern is what (etcd or gRPC?) is the best place to implement it. It looks like a generic client side load balancer solution, does it make more sense to get it wrapped in grpc/grpc-go instead of etcd?

Also some APIs (such as resolver.Resolve, balancer.Balancer , etc) were added as experimental APIs about 6 years ago, they are still experimental APIs today. Have you considered to graduate them to stable APIs or deprecate/remove them and provide alternative solution/APIs?

cc @mitake @ptabor @serathius @spzala @liggitt @wojtek-t @lavalamp for opinions.

@dfawley
Copy link

dfawley commented Mar 31, 2023

I was just digging around and saw that C-core has the following name resolver (https://github.com/grpc/grpc/blob/master/doc/naming.md#name-syntax):

  • ipv4:address[:port][,address[:port],...] -- IPv4 addresses
    • Can specify multiple comma-delimited addresses of the form address[:port]:
      • address is the IPv4 address to use.
      • port is the port to use. If not specified, 443 is used.

We could also explore making this a first-class name resolver. Would that take care of all of your need for resolver and balancer APIs?

@ahrtr
Copy link
Member Author

ahrtr commented Mar 31, 2023

We could also explore making this a first-class name resolver. Would that take care of all of your need for resolver and balancer APIs?

My immediate feeling is it looks good. Could you please provide a rough draft API and usage, so that I can evaluate the change & effort in etcd side?

@ptabor
Copy link
Contributor

ptabor commented Apr 1, 2023

One aspect to be aware of is also support for DNS resolution:

I think that currently such connection works:

./etcdctl --endpoints=http://etcd0.domain:2379,http://etcd1.domain:2379,http://etcd2.domain:2379 get k1 

And in case of problems the DNS name get's re-resolved to IP.

Fallback to ipv4:address[:port][,address[:port],...] sound like doing the DNS binding once. In some dynamically scheduled environment the etcd instance can be rescheduled and name be rebound. This would be use-case that we are explicitly dropping.

Alternative is to build in within an etcd client a local xDS policy supplier.

@dfawley
Copy link

dfawley commented Apr 6, 2023

Yeah I was digging a little deeper and it seems you need more than multiple IP address support; you also need DNS names and multiple unix sockets, and mixed versions of all these things.

We have one project on the horizon that is going to require changes to the resolver API, and then I think we will be able to stabilize it. Unfortunately, that will first require a breaking change, which is unavoidable no matter what. In the meantime, I think the only real option would be to maintain a pool of channels, which I agree isn't trivial but also isn't something we probably would put in the main grpc-go repo, and we don't have time to implement for now. You can find another minimal implementation here: https://github.com/googleapis/google-api-go-client/blob/3f4d679d4dae5de7ce124a7db68ef5147c9a3fa2/transport/grpc/pool.go. Ideally it would keep track of the state of the ClientConns and skip the non-READY ones, but that may not be a strict requirement for you either.

@ahrtr
Copy link
Member Author

ahrtr commented Apr 13, 2023

Thanks @dfawley for the feedback.

We have one project on the horizon that is going to require changes to the resolver API, and then I think we will be able to stabilize it.

Could you provide the rough timeline on when the resolve API can be stabilized, and which grpc-go version?

I think the only real option would be to maintain a pool of channels, which I agree isn't trivial but also isn't something we probably would put in the main grpc-go repo, and we don't have time to implement for now.

I think we need to do some investigation & PoC firstly. I do not have time to dig into it for now. So labelled this issue with "help wanted".

@chaochn47
Copy link
Member

chaochn47 commented Jun 7, 2023

Hi @ahrtr I am interested in this and will provide a minimal working example/POC soon. After that, I think we can go through some any necessary processes like design, collect must to have / nice to have requirements, test verification as success criteria before the implementation, production readiness check etc before the actual migration.

For example:

Must to have

  • Supports Unix & DNS & IPV4 & IPV6 with round robin similar load balancing client, initially it is insecure.
  • grpc-go supports DNS resolver (by default), unix socket resolver (only for testing purpose) and accept ipv4 and ipv6 targets with passthrough resolver.
  • Keep track of the state of the ClientConns and skip the non-READY ones.
  • E2E test the above scenarios.
  • Supports secure connection with TLS
  • The new etcd client should have at least the same performance as the old one. Measuring by etcdctl check perf command, etc.
  • Existing e2e test cases like TestAuthority won’t fail unexpectedly.
  • Adding new test cases that current test suite missed.

Nice to have

@ahrtr
Copy link
Member Author

ahrtr commented Jun 7, 2023

Thanks @chaochn47 . Assigned to you.

@chaochn47
Copy link
Member

chaochn47 commented Jun 9, 2023

2023-06-08 update:

I took a little deeper look today and found out that the generated gRPC API is outdated most likely because of #14533

func NewKVClient(cc *grpc.ClientConn) KVClient {
return &kVClient{cc}
}
func (c *kVClient) Range(ctx context.Context, in *RangeRequest, opts ...grpc.CallOption) (*RangeResponse, error) {
out := new(RangeResponse)
err := c.cc.Invoke(ctx, "/etcdserverpb.KV/Range", in, out, opts...)
if err != nil {
return nil, err
}
return out, nil
}
NewKVClient(cc grpc.ClientConn) could have been written with NewKVClient(cc grpc.ClientConnInterface) so that the proposed ConnPool interface and its implementation roundRobinConnPool can be passed into it for customization.

This latter NewKVClient(cc grpc.ClientConnInterface) has been widely used in google-cloud-go repository, for example the spanner client.

So I can start looking into rpc.proto and use latest version of protoc go plugin to re-generate rpc.pb.go

@chaochn47
Copy link
Member

chaochn47 commented Jul 27, 2023

Just discovered

grpccredentials.Bundle
is another gRPC experimental API. However, I don't see why we need this API. etcd does not support mode switching from transport security to per RPC credentials (auth) or backwards from my understanding. It should be a easy clean up.

@dfawley
Copy link

dfawley commented Jul 28, 2023

Could you provide the rough timeline on when the resolve API can be stabilized, and which grpc-go version?

grpc/grpc-go#6472

The breaking changes are upcoming (within ~6 months?), but I'm pretty sure we can find a way to enable a migration path that doesn't break suddenly in a single version, and provide a couple releases of overlap. Full details haven't been planned, but some changes are already underway (pending PRs: grpc/grpc-go#6471 and grpc/grpc-go#6481)

@ahrtr
Copy link
Member Author

ahrtr commented Aug 2, 2023

I took a little deeper look today and found out that the generated gRPC API is outdated most likely because of #14533

We need to bump both protobuf and grpc go plugin eventually. But is this task blocked by it?

EDIT: just noticed that your following comment(pasted below). Indeed, the outdated grpc go plugin has some impact on this task. But it isn't a big problem, we can also change grpc.ClientConnInterface to grpc.ClientConn for now to workaround it.

NewKVClient(cc grpc.ClientConn) could have been written with NewKVClient(cc grpc.ClientConnInterface) so that the proposed ConnPool interface and its implementation roundRobinConnPool can be passed into it for customization.

@ahrtr
Copy link
Member Author

ahrtr commented Aug 2, 2023

grpc/grpc-go#6472

The breaking changes are upcoming (within ~6 months?), but I'm pretty sure we can find a way to enable a migration path that doesn't break suddenly in a single version, and provide a couple releases of overlap. Full details haven't been planned, but some changes are already underway (pending PRs: grpc/grpc-go#6471 and grpc/grpc-go#6481)

Thanks @dfawley for the feedback. It seems that the already planed items in grpc/grpc-go#6472 will not impact etcd at all. But as you mentioned Full details haven't been planned, so in theory, there is still some potential risks.

But etcd's use of the grpc-go's resolver package is really very simple and basic, the golang source file has just dozens of lines of code. Please also see #15145 (comment).

Each time when there is any change (e.g. adding or removing server, or update server's address) on the backend servers, etcd client will construct a new resolver.State object which contains the latest list of server addresses, and then call (*Resolver) UpdateState. Do you still think grpc/grpc-go#6472 might break it in future?

You can find another minimal implementation here: https://github.com/googleapis/google-api-go-client/blob/3f4d679d4dae5de7ce124a7db68ef5147c9a3fa2/transport/grpc/pool.go. Ideally it would keep track of the state of the ClientConns and skip the non-READY ones, but that may not be a strict requirement for you either.

It's non-trivial effort. If we follow this approach, then we might want to create a separate small project something like github.com/etcd-io/grpc-client-load-balancer or github.com/etcd-io/go-client-load-balancer, so as not to complicate or pollute etcd client's implementation. But as I mentioned in #15145 (comment), I really think it's unfortunate if we have to spend too much effort to do it ourselves.

Also reference:

@ahrtr
Copy link
Member Author

ahrtr commented Aug 2, 2023

Just discovered

grpccredentials.Bundle

is another gRPC experimental API. However, I don't see why we need this API. etcd does not support mode switching from transport security to per RPC credentials (auth) or backwards from my understanding. It should be a easy clean up.

Good catch. Just raised a PR to remove it. #16358

@chaochn47
Copy link
Member

chaochn47 commented Aug 2, 2023

Agreed with @ahrtr's point. I think the essential etcd usage of this experimental resolver API is built on top of manual resolver. It's quite simple. gRPC-go should be able to support in my humble opinion. (Even though it is marked as experimental indicating it's subject to change in the future).

There might be other experimental usages like WithResolvers() but etcd should also be able to migrate to register the resolver builder with resolver.Register just like the default DNS resolver.

@chaochn47
Copy link
Member

chaochn47 commented Aug 2, 2023

We need to bump both protobuf and grpc go plugin eventually. But is this task blocked by it?

No, I am just hesitant if we go down in this route proposed in #15145 (comment) as I am implementing, it is essentially re-invent what gRPC-go implements between ClientConn and SubConn about state synchronization, health checking, etc..

@ahrtr
Copy link
Member Author

ahrtr commented Sep 13, 2023

Endpoints was added into State in grpc/grpc-go#6471, and Addresses will be deprecated and replaced fully by Endpoints, so etcd should also make change to our customized resolver accordingly.

Since the gRPC resolver package is still experimental, so we don't need to rush to make the change for now.

@chaochn47
Copy link
Member

chaochn47 commented Sep 20, 2023

Hi @ahrtr Based on recent PoC of gRPC LB health check and outlier detection balancer integration with etcd client, I propose not to remove the usage of gRPC resolver & balancer.

just to randomly add what I've found in #14501 - once we reconsider the balancer/resolver, we should also add some circuit breaking. There were some ideas by Easwar in grpc/grpc-go#5672.

I favor to keep the opportunity of improving etcd client with better availability while keeping an eye on the gRPC API changes.

Note there is an ongoing project in gRPC-go to Promote Resolver/Balancer API to Stable

@ahrtr
Copy link
Member Author

ahrtr commented Sep 20, 2023

I propose not to remove the usage of gRPC resolver & balancer.

Yes, just as I mentioned in #15145 (comment), etcd's usege of the grpc-go's resolver package is really very simple and basic; and it's non-trivial effort to re-invent the wheel. So there is no plan to remove the gRPC resolver & balancer.

But we need to followup the grpc's change, and may need to make change something like #15145 (comment)

@chaochn47
Copy link
Member

But we need to followup the grpc's change, and may need to make change something like #15145 (comment)

+100 Totally agree

@chaochn47
Copy link
Member

Another deprecated API usage.

gRPC v1.44.0 change log, no behavior changes. But it reveals that we should replace all grpc.WithInsecure() with grpc.WithTransportCredentials(insecure.NewCredentials()) based on grpc/grpc-go#5068 in 3.4/3.5/main

@dfawley
Copy link

dfawley commented Oct 25, 2023

gRPC v1.44.0 change log, no behavior changes. But it reveals that we should replace all grpc.WithInsecure() with grpc.WithTransportCredentials(insecure.NewCredentials()) based on grpc/grpc-go#5068 in 3.4/3.5/main

Also consider that you shouldn't use "insecure" credentials at all. Tests should be okay, though they should also be OK to use local credentials. Production code should probably also prefer local credentials (or TLS/etc), which verifies that the address is local and not over a network: https://pkg.go.dev/google.golang.org/[email protected]/credentials/local#NewCredentials

Yes, this is unfortunately also experimental. I believe we can promote it to stable, as I'm not aware of any issues.

Copy link

stale bot commented Mar 17, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Mar 17, 2024
@dfawley
Copy link

dfawley commented Mar 18, 2024

This is still very relevant for us and for your users.

@stale stale bot removed the stale label Mar 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. stage/tracked type/feature
Development

No branches or pull requests

6 participants