-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distinguish between ambiguous and unambiguous failures #1443
Comments
Per the gRPC spec, |
Actually, I think the root cause of this issue is that our non-Wait For Ready / FailFast behavior is wrong. We are supposed to block if there is no active connection (i.e. the
At that time, the error returned should be either If you turn off Fail Fast (i.e. if you enable Wait For Ready), then RPCs will always block until the context is canceled or expires, or the If you are maintaining multiple connections to multiple backends, you should do that within a single |
Let's track the wait for ready bug via #1463. Closing this as a duplicate of that, since it is the root cause. |
I don't understand how #1463 is the root cause of this issue. For one thing it looks like the code is already working as you say it should, and only returning fail-fast errors for the transient-failure state: Lines 1186 to 1190 in fa59b77
But even if fail-fast were changed to return errors less often, the errors still wouldn't let us know whether the request was possibly executed or not.
I don't think that's an option for us - we need to choose which backend(s) to use on an RPC-by-RPC basis. Is there some way I'm not seeing to configure gRPC to do this? |
Are you using a load balancer? If so, depending on the implementation (our RoundRobin balancer does this), it will return an error when there are no backends here: Line 866 in fa59b77
If you are not using a balancer, then it does seem that we follow the right behavior (though I was fairly sure I observed the opposite). Is your connection actually in the transient failure state, possibly? Note that when retry support is added (ETA ~1 month), gRPC will transparently retry RPCs that didn't leave the client or were not seen by the server application. So this should not be something you need to distinguish externally after that time.
You should be able to implement a custom balancer to do this if round-robin is not suitable for your needs. (Note that the API is going to change around here pretty significantly in the next month or two - see #1388.) |
No. In part due to the documented experimental/unstable nature of the interface (as mentioned below).
Yes. We want to distinguish between "connection was transient-failed before this rpc was sent, so we didn't try" and "connection failed while this rpc was in flight".
Interesting. That in conjunction with a custom balancer might do the trick, if the interfaces let us do what we want. But in general our usage is atypical enough that we're more comfortable taking manual control here; my first choice is to just get a distinct error code. |
In theory, the errors we return are supposed to match the spec. (We are aware we don't get this right in all instances.) The spec doesn't specify what should be returned in this particular situation. I'm checking with C/Java. This leaves us with either a gRFC to change the error (if it's UNAVAILABLE in the other languages as well), which could be tricky, or a Go-specific feature to help identify whether data was transmitted before any error occurred. I'll reopen this, because it seems it's not related to our "wait for ready" implementation. |
OK, so the spec clearly says we were wrong to assume that I'm not sure if an error code is the right way to handle this, though. If I make an RPC to a server that makes another RPC, does the error code from the inner RPC get propagated back to my client or does it get turned into |
This is entirely up to the implementation of the middle service. If it does: func MyHandler(...) (..., error) {
if _, err := innerSvc.Call(...); err != nil {
return err
}
} Then the status code will be propagated. This is not typically recommended; the middle service should translate the error into something that makes sense to its client -- usually by switching on the status code returned from the inner service. |
Is there any update here? We just upgraded from 1.6 to 1.7.1 and it apparently changed some of the implementation details we were relying on to distinguish ambiguous from unambiguous failures (cockroachdb/cockroach#19708). We're probably going to have to revert to 1.6 for now. I see that the transparent retry change just landed, but remember that this is not enough for us - we also need control over routing. Have the balancer APIs mentioned in #1443 (comment) been stabilized yet? (note that #1388 is still open) |
This is similar to what grpc does internally for fail-fast rpcs, but lets us control the error returned to work around grpc/grpc-go#1443 Fixes cockroachdb#19708 Release note (performance improvement): Reduced the occurrence of ambiguous errors when a node is down.
Closing due to lack of activity and priority. |
What version of gRPC are you using?
What version of Go are you using (
go version
)?1.8.3
What operating system (Linux, Windows, …) and version?
Linux (ubuntu 16.04)
What did you do?
I'm trying to send an rpc with a side effect that should happen at most once. I have connections to multiple backends and use the FailFast option; if one backend is down I want to try another.
What did you expect to see?
This requires that I can distinguish the error returned from a backend that is found to be unavailable before the request is sent from one that fails later. (I had previously assumed that
codes.Unavailable
would be suitable for this, but I was wrong - it's also used when a server is draining, when the request may have already been processed)What did you see instead?
The fail-fast error for a down backend has the same error code as the one used for a draining server (and possibly other uses); it is impossible to distinguish them without looking at the text of the error message.
I would like some way to inspect the error to determine whether the request was possibly processed by the server. This could be a distinct error code, string matching on the error message (if the string to use is documented and guaranteed for future versions), or other attributes of the error. It's ok if this mechanism is not perfectly precise (my priority is at-most-once execution instead of maximizing the chances of success), but I would like to at least be able to distinguish the fail-fast error linked above.
(for further background, our incorrect assumption that an error with
codes.Unavailable
meant that the RPC had not been attempted on the server was the cause of cockroachdb/cockroach#17491)The text was updated successfully, but these errors were encountered: