Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

received context error while waiting for new LB policy update: context deadline exceeded #7983

Open
mayurkale22 opened this issue Jan 7, 2025 · 10 comments · Fixed by #8035
Assignees
Labels
Area: Resolvers/Balancers Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities. Status: Requires Reporter Clarification Type: Question

Comments

@mayurkale22
Copy link

We're seeing an intermittent issue. This always happens randomly at app startup time, that prevents app from properly starting up

rpc error: code = DeadlineExceeded desc = received context error while waiting for new LB policy update: context deadline exceeded

how should we interpret this error? Does it signal an issue with connectivity, gRPC server/client configuration or something else entirely. Appreciate any feedback on this.

@eshitachandwani
Copy link
Member

Hey @mayurkale22 , only this error message doesn't provide much information about what could be causing it. The problem could be caused by a number of issues starting from slow network to some error in the application or issue with name resolution. To help identify the cause, you can enable debug logs using

$ export GRPC_GO_LOG_VERBOSITY_LEVEL=99
$ export GRPC_GO_LOG_SEVERITY_LEVEL=info

and that will give us more idea about the root cause.

@itzmanish
Copy link

Hi @mayurkale22 are you using DNS resolver?

I am also getting the same error on the latest GRPC client.
grpc-go v1.67.3

I can see the resolver resolving 3 endpoints that point to my NLB and are correct. The only difference I can see is that it's using LB policy to "pick_first" instead of round-robin.

@purnesh42H purnesh42H added Status: Requires Reporter Clarification Area: Resolvers/Balancers Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities. labels Jan 10, 2025
@itzmanish
Copy link

I have more information on this issue. I got it fixed for my usecase SOMEHOW, not sure if this really fixed it or just mitigated it for the time being.

In my setup I had a DNS endpoint resolving to multiple A address of NLB and my client was using pick_first loadbalancer for some reason (maybe it's default loadbalancer). I changed the loadbalancer to round_robin and I don't find the context deadline issue anymore. My guess is for some reason one of the NLB endpoint was taking longer or GRPC lib was not able update the resolver state for that endpoint.

I am still poking around the library for the flow and how it creates a client connection. I will update here if I find anything else.

@purnesh42H
Copy link
Contributor

purnesh42H commented Jan 15, 2025

@mayurkale22 could you provide more information in following format https://github.com/grpc/grpc-go/issues/new?template=bug.md with debugging enabled?

Please specify if you are using a different load balancing policy or name resolver than the default one. Mention the grpc version you are on and if you are using grpc.NewClient or grpc.Dial

@purnesh42H
Copy link
Contributor

purnesh42H commented Jan 15, 2025

As per your other question of how to interpret rpc error: code = DeadlineExceeded desc = received context error while waiting for new LB policy update: context deadline exceeded

It can happen when either there is no picker to connect to backend or there was a valid picker but it has become invalid now (because balancer detected a change in backend availability). So, its more likely a connectivity issue since you mentioned it happens intermittently. Do you have single backend or multiple?

To give some background, Picker is used by gRPC to pick a SubConn (backend) to send an RPC. Balancer is expected to generate a new picker from its snapshot every time its internal state has changed. Balancer takes input from gRPC, manages SubConns, and collects and aggregates the connectivity states. It also generates and updates the Picker used by gRPC to pick SubConns (backends) for RPCs.

@dfawley
Copy link
Member

dfawley commented Jan 16, 2025

Note that there was a recent change to output this message instead of a more generic "deadline exceeded" error.

If this is happening at startup, then it's almost always going to be that we are still waiting for connections to be established. Maybe the RPC has too short of a deadline?

If there were errors connecting, then those errors would be given to the RPC instead.

I wonder if we can further improve this error so that users don't feel confused by it and need to file issues to learn more.

Copy link

This issue is labeled as requiring an update from the reporter, and no update has been received after 6 days. If no update is provided in the next 7 days, this issue will be automatically closed.

@github-actions github-actions bot added the stale label Jan 22, 2025
@itzmanish
Copy link

My guess is for some reason one of the NLB endpoints was taking longer or GRPC lib was not able to update the resolver state for that endpoint.

@purnesh42H is my guess correct? I think in pick_first only one conn is made and if it gets a deadline exceeded the request fails. While in roundrobin multiple connections are made and a request is made using all of the connections, because this can only explain why my service is working correctly after switching to roundrobin.

@purnesh42H
Copy link
Contributor

is my guess correct? I think in pick_first only one conn is made and if it gets a deadline exceeded the request fails.

No. pick_first also handles failover. It will not give up after a single connection failure or deadline exceeded on a single backend. It will attempt to connect to the next backend in the resolved list.

While in roundrobin multiple connections are made and a request is made using all of the connections, because this can only explain why my service is working correctly after switching to roundrobin.

Not really. roundrobin sends each request to a single backend at a time, determined by its round-robin order. It does maintain connections to multiple backends, but it doesn't send the same request to all of them.

@arjan-bal
Copy link
Contributor

arjan-bal commented Jan 24, 2025

@itzmanish pickfirst tries one address at a time until it finds a healthy backend. pickfirst minimizes the number of active transports. roudrobin tries to create a transport with every backend at the same time. roundrobin reports ready as soon as a single backend is connected. If you have unhealthy backends in the front of the list of addresses produced by DNS, pickfirst will take more time than roundrobin to report ready. On the other hand, roundrobin will create more transports than pickfirst.

If you want to fix the issues you're seeing with pickfirst, consider the following:

  1. Increase the timeout for the context that is used to make RPCs.
  2. If the backends are unreachable, use a custom dialer and set a reasonable timeout for establishing connections, e.g: use net.Dialer{Timeout: 2 * time.Second}. This should make pickfirst move to the next address in the list faster.

@github-actions github-actions bot removed the stale label Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Resolvers/Balancers Includes LB policy & NR APIs, resolver/balancer/picker wrappers, LB policy impls and utilities. Status: Requires Reporter Clarification Type: Question
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants