Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New implementation of roundrobin and pickfirst #1506

Merged
merged 4 commits into from
Oct 2, 2017

Conversation

menghanl
Copy link
Contributor

@menghanl menghanl commented Sep 6, 2017

ClientConn

  • monitor resolver and balancer updates
  • notify balancer of new updates

New implementations:

  • roundrobin
  • pickfirst
  • testing resolvers (manual and passthrough)

$new-bar-2$

Fixes #1504

@menghanl menghanl requested a review from dfawley September 6, 2017 01:34
@dfawley dfawley self-assigned this Sep 7, 2017
@menghanl menghanl added the Type: API Change Breaking API changes (experimental APIs only!) label Sep 8, 2017
@menghanl menghanl added this to the 1.7 Release milestone Sep 8, 2017
@menghanl menghanl force-pushed the bar_new_implementation branch 2 times, most recently from c248e34 to 6a94105 Compare September 12, 2017 18:52
@menghanl menghanl force-pushed the bar_new_implementation branch 2 times, most recently from 8906133 to 74a3567 Compare September 20, 2017 22:56
Copy link
Member

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. It's a big PR, so it was easy to put off. :)

@@ -182,6 +182,10 @@ type Picker interface {
// the connectivity states.
//
// It also generates and updates the Picker used by gRPC to pick SubConns for RPCs.
//
// HandleSubConnectionStateChange, HandleResolvedAddrs and Close are guaranteed
// to be called sequentially by the same goroutine.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"sequentially" means "in order". I think we want "synchronously from the same goroutine" instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -196,6 +200,7 @@ type Balancer interface {
// An empty address slice and a non-nil error will be passed if the resolver returns
// non-nil error to gRPC.
HandleResolvedAddrs([]resolver.Address, error)
// Close closes the balancer.
// Close closes the balancer. Balancer is expected to call RemoveSubConn for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change: "The balancer is not required to call ClientConn.RemoveSubConn for its existing SubConns."?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

*
*/

// Package roundrobin defines a roundrobin balancer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document how to use this or when/by whom it should be used?

Should this package register itself with grpc when imported, instead of exporting NewBuilder()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments updated, PTAL.

Most users don't need to call NewBuilder. But if they want to have a custom balancer on top of roundrobin, they can call this function to create a builder. (I did this in new grpclb at first, but removed it later).

But there's already another way to get a roundrobin builder: balancer.Get("roundrobin"). I will unexport this function.

type roundrobinBuilder struct{}

func (*roundrobinBuilder) Build(cc balancer.ClientConn, opt balancer.BuildOptions) balancer.Balancer {
b := &roundrobinBalancer{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: return directly instead of creating b.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return
}
grpclog.Infoln("roundrobinBalancer: got new resolved addresses: ", addrs)
// addrsSet is the set converted from addrs, it's used to quick lookup for an address.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"to quickly lookup an address" or "used for quick lookup of an address".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// Unregister removes the resolver builder with the given scheme from the
// resolver map.
// This function is for testing only.
func Unregister(scheme string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UnregisterForTesting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

clientconn.go Outdated
@@ -162,6 +163,14 @@ func WithBalancer(b Balancer) DialOption {
}
}

// WithBalancerBuilder is for testing only and should be removed.
// TODO(bar) remove this or change the comment.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someone once told me not to put TODOs in docstrings... Seems like a reasonable policy since it's for code maintainers and not users? Maybe move this inside the function or above the docstring comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

)
if cc.balancer == nil {
func (cc *ClientConn) getTransport(ctx context.Context, failfast bool) (transport.ClientTransport, func(balancer.DoneInfo), error) {
if cc.balancerWrapper == nil {
// If balancer is nil, there should be only one addrConn available.
cc.mu.RLock()
if cc.conns == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if len(cc.conns) == 0 instead? Then you don't need the "if ac == nil" below.

Why does this return toRPCErr(ErrClientConnClosing) but we return errConnClosing below? We need to clean up this error stuff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc.conns is set to nil in cc.Close, so the error returned when cc.conns == nil is different from the error returned when ac == nil...

Added a TODO for error cleanup.

*
*/

// Package passthrough implements a pass-through resolver.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disclaimer about "for grpc internal use only"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


rb := resolver.Get(scheme)
if rb == nil {
// TODO(bar) return error when DNS becomes the default (implemeneted and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*implemented

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

todos, comments and prints

remove mutex from roundrobin

move todos and comments

fixes in b wrapper

split tuple to scstate tuple and resolver tuple
add default select for done

blocking picker

add blockingpick test
remove picker version

cleanup in grpc files

cleanup and comments in r and b wrapper
@menghanl menghanl force-pushed the bar_new_implementation branch from 74a3567 to c76f122 Compare September 27, 2017 21:56
@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed, please reply here (e.g. I signed it!) and we'll verify. Thanks.


  • If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
  • If your company signed a CLA, they designated a Point of Contact who decides which employees are authorized to participate. You may need to contact the Point of Contact for your company and ask to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the project maintainer to go/cla#troubleshoot.
  • In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again.

@googlebot
Copy link

CLAs look good, thanks!

@menghanl menghanl force-pushed the bar_new_implementation branch from cbb4d08 to b2fe11a Compare September 28, 2017 20:08
@dfawley dfawley assigned menghanl and dfawley and unassigned dfawley and menghanl Sep 28, 2017
Copy link
Contributor Author

@menghanl menghanl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the review! All done. PTAL.

@@ -182,6 +182,10 @@ type Picker interface {
// the connectivity states.
//
// It also generates and updates the Picker used by gRPC to pick SubConns for RPCs.
//
// HandleSubConnectionStateChange, HandleResolvedAddrs and Close are guaranteed
// to be called sequentially by the same goroutine.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -196,6 +200,7 @@ type Balancer interface {
// An empty address slice and a non-nil error will be passed if the resolver returns
// non-nil error to gRPC.
HandleResolvedAddrs([]resolver.Address, error)
// Close closes the balancer.
// Close closes the balancer. Balancer is expected to call RemoveSubConn for
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

*
*/

// Package roundrobin defines a roundrobin balancer.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments updated, PTAL.

Most users don't need to call NewBuilder. But if they want to have a custom balancer on top of roundrobin, they can call this function to create a builder. (I did this in new grpclb at first, but removed it later).

But there's already another way to get a roundrobin builder: balancer.Get("roundrobin"). I will unexport this function.

type roundrobinBuilder struct{}

func (*roundrobinBuilder) Build(cc balancer.ClientConn, opt balancer.BuildOptions) balancer.Balancer {
b := &roundrobinBalancer{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return
}
grpclog.Infoln("roundrobinBalancer: got new resolved addresses: ", addrs)
// addrsSet is the set converted from addrs, it's used to quick lookup for an address.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

pickfirst.go Outdated
sc balancer.SubConn
}

func (p *picker) Pick(ctx context.Context, opts balancer.PickOptions) (conn balancer.SubConn, put func(balancer.DoneInfo), err error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Names removed.

*
*/

// Package passthrough implements a pass-through resolver.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// Unregister removes the resolver builder with the given scheme from the
// resolver map.
// This function is for testing only.
func Unregister(scheme string) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


rb := resolver.Get(scheme)
if rb == nil {
// TODO(bar) return error when DNS becomes the default (implemeneted and
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

stream.go Outdated
for {
t, put, err = cc.getTransport(ctx, gopts)
t, put, err = cc.getTransport(ctx, c.failFast)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's done.

 - balancer subdir
 - conn wrapper
 - picker
 - resolver
 - put -> done

and make grpclb blocking dial
@menghanl menghanl force-pushed the bar_new_implementation branch from b2fe11a to fd903ae Compare September 29, 2017 21:43
if !ok && failfast {
return nil, nil, Errorf(codes.Unavailable, "there is no connection available")
}
if s, ok := bw.connSt[sc]; failfast && (!ok || s.s != connectivity.Ready) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this mostly because the zero value isn't ideal (i.e. an Invalid or Unknown). Otherwise, the code ends up simpler because there are fewer conditions involved.

// subConns is the snapshot of the roundrobin balancer when this picker was
// created. The slice is immutable. Each Get() will do a round robin
// selection from it and return the selected SubConn.
size int // size if the size of subConns.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"is"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// Package roundrobin defines a roundrobin balancer. Roundrobin balancer is
// installed as one of the default balancers in gRPC, users don't need to
// explicitly install this balancer.
package roundrobin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/roundrobin// for package variables/functions/types where possible.

Copy link
Contributor Author

@menghanl menghanl Sep 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type balancer will conflict with the imported balancer package. I renamed them to rrBuilder and rrBalancer...

stream.go Outdated
for {
t, put, err = cc.getTransport(ctx, gopts)
t, put, err = cc.getTransport(ctx, c.failFast)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you did there.

Copy link
Member

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 🎆 🍾

@dfawley dfawley merged commit 4bbdf23 into grpc:master Oct 2, 2017
@menghanl menghanl deleted the bar_new_implementation branch October 2, 2017 23:09
@tamird
Copy link
Contributor

tamird commented Oct 18, 2017

This causes a regression in CockroachDB - we're seeing a number of flaky tests. I haven't yet determined the exact bug, but git bisect points to this merge as the first bad commit.

The symptom appears to be a hanging streaming RPC, though I continue to investigate.

My understanding is that this new implementation was meant to be opt-in, but there's definitely a behaviour change here.

EDIT: to reproduce in CockroachDB run make stressrace TESTS=TestDistSQLRangeCachesIntegrationTest PKG=./pkg/sql; the test fails reliably within 2 minutes.

@menghanl
Copy link
Contributor Author

menghanl commented Oct 18, 2017

It seems failfast is the reason for the flakiness. Some RPCs fail because the connection is down (or not ready at the beginning).

RPCs are failfast (not waitForReady ) by default. I tried to set default failfast to false in gRPC code, and the stress test didn't fail after running for 16 minutes 57 minutes.

The related change in this PR is this line, which actually fixed a broken behavior that failfast didn't fail properly before.

Please try to make all the RPCs non-failfast by using grpc.FailFast(false) call option. (You can set default call option for a ClientConn using WithDefaultCallOptions)

@dfawley
Copy link
Member

dfawley commented Oct 18, 2017

Please try to make all the RPCs non-failfast by using grpc.FailFast(false) call option. (You can set default call option for a ClientConn using WithDefaultCallOptions)

Note that if you're doing unary RPCs, then this could be dangerous right now if they are not idempotent because of #1532. I revived my branch that fixes this problem and hope to have it checked in this week.

EDIT: added "if they are not idempotent"

@tamird
Copy link
Contributor

tamird commented Oct 19, 2017

Help me understand the change that was made here. Previously every call to resetTransport would (incorrectly) set the connection state to connectivity.Connecting, which would in effect cause RPCs to behave as non-failfast, since RPCs will wait for transports in that state. After this change, resetTransport sets the state to connectivity.TransientFailure, which has the correct behaviour of making failfast RPCs not wait.

Now, help me understand what would happen in the following (racy) situation:

  • process 1 starts, calls Dial (or DialContext)
  • process 1's transport gets a "connection refused" because process 2 hasn't started yet
  • process 1 attempts to send an RPC on that connection
  • process 2 starts
  • process 1's RPC fails because the initial "connection refused" kicked it into connectivity.TransientFailure

Is that right? I think that's probably the issue we're seeing in cockroach - multiple nodes are started in non-deterministic order by the test harness and all attempt to talk to each other, but now that GRPC tanks failfast RPCs even before the connection is ever established (after this change) we're seeing those races trigger start-up failures.

Would it be possible for the transport to remain in connectivity.Connecting until a connection is established at all? That seems to be the design intent.

Please correct me if I've misunderstood the behaviour.

@menghanl
Copy link
Contributor Author

What you observed is correct. The first failfast RPC could fail because the connection is not ready yet.
Does it make sense for you to make blocking dial instead? So the connection will be ready at the time you make the RPC.

I will also make a change to skip the TransientFailure in transport if it's the first time connecting.

if len(targetSplitted) >= 2 {
scheme = targetSplitted[0]
}
grpclog.Infof("dialing to target with scheme: %q", scheme)
Copy link

@aybabtme aybabtme Oct 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new log line means that all existing code has grpc emit logs when it dials, and along with line 56 all existing code that dials with a host or ip and no specified scheme gets a second log line about the lack of resolver for "" scheme.

This is a problem because grpc is used in CLI clients and having grpc log these internal, unimportant details makes the UX of these CLIs confusing; random logs are showing up, suggesting that there's connection errors when there aren't.

It would be good if these two log lines could be removed, or reduced to a debug level.

@menghanl menghanl mentioned this pull request Nov 29, 2017
8 tasks
@lock lock bot locked as resolved and limited conversation to collaborators Jan 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Type: API Change Breaking API changes (experimental APIs only!)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pick first load balancer
5 participants