-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xds: ADS stream failure triggers wildcard subscriptions on new stream #7013
Comments
I want to add another problem that I see in the current behavior of grpc-go (and probably other languages?) for channel transition:
If the data plane is disconnected from the management server when this transition happens, when the management server comes back online, it has no way to know that the client has discarded the corresponding API listener. I'm not sure if this qualifies as a bug or just an oddity of the protocol. Consider the following events:
If the management server comes back online after step 3 or step 4, it has no way to know that the client does not already have a valid API listeners for the subscribed resource, as the version matches the current resources version. The data plane has discarded the resource, the resource has not changed so the version has not changed, and the management server was not exposed to the unsubscription+re-subscription events because it was offline when it occurred. The channel will only recovers from that state when at least one listener resource changes, triggering a full update response. That seems like an inherent problem of the current implementation, and the only fix I could think of would be for the data plane to discard the version anytime it unsubscribed from a resource and has not yet gotten a response with the corresponding nonce for this URL type. This would force the control plane to send all subscribed resources. And reading the protocol spec carefully, this problem is acknowledged:
For context, we use a hash of a sorted list of <resource_name, resource_content> as the resource version for each resource type, for all subscribed resources, with a special value for the content if the resource does not exist. And we originally thought that by being clever and including the list of subscriptions inside the version info, we could safely resume subscription without triggering a full update. @valerian-roche is in the process of making https://github.com/envoyproxy/go-control-plane play better with gRPC by making state of the world stateful, and implemented this trick. So it seems like when the client is a gRPC application, presumably with dynamic subscription per channel, the server should never do this optimization. This bears the question of why gRPC sends a version string on new ADS streams, even though we know that in most cases, it is unsafe for the management server to take this version info into consideration. Wouldn't it have been less error prone to just not include the version? Note that this problem wouldn't happen with incremental, because un-subscription + re-subscriptions are explicit and resource version are per resource (via the https://github.com/envoyproxy/data-plane-api/blob/main/envoy/api/v2/discovery.proto#L171). It'd be nice to see progress on incremental transport for gRPC at some point, mostly because it is simpler. |
Yes, this seems like the proper behavior to me. I'll look into this further.
This is very interesting. We'll talk about this a bit more internally, but one possible workaround would be to not ever send uncached resources in the initial request, and then add subscriptions for them after that. Another option: don't set the version with the initial request if we have any uncached resources being requested. |
@atollena if you could confirm the PR fixes these issues for you, that would be great. Also, will you need a patch release with this? |
I was able to try your fix today, and it does fix the issue of wildcard. Thank you! |
Should there be another issue for the protocol problem I pointed out in #7013 (comment)? |
I don't need a patch release, we're not using this in production yet. |
I am tracking this internally. We need to discuss this cross-language to determine a solution. |
We discussed the other issue today, and this is apparently a known problem:
|
I was aware of this, I actually quoted that same spec section in the issue :) It is imo confusing and error prone that gRPC sends a version number on reconnect when it has discarded resources, because taking it into consideration on the control plane side is definitely going to lead to subtle problems like this one. How do your control planes handle the version on new ADS stream for non wildcard subscriptions? I suppose it always discards it? In that case, what is the point of sending it? |
Good question. tl;dr is that it might be theoretically possible to make use of this, but in practice it's very unlikely. The only one we've been able to come up with is if the server knows its clients never unsubscribe + resubscribe. We probably wouldn't set that in these scenarios if we were starting from scratch, but because that is the current behavior, we've so far decided to leave it in place in case someone is making use of that. |
…when using wildcard watches During testing with grpc-xds, it was noticed that a specific behavior on the client side is not compatible with sotw subscription resumptions. When the last channel is closed, the client disconnects from the control-plane. If the same channel gets reopened later on, the connection is re-established with the same resource subscription and the last version from before is provided. In this case the control-plane currently does not return the response as the version does match, whereas grpc expects the control-plane to reply as it considers it as a "desubscription then resubscription event", which should send the resource again. In the context of wildcard watches this is not an issue, so the behavior is kept. More context on the grpc-xds discussions in this [thread](grpc/grpc-go#7013 (comment)) Signed-off-by: Valerian Roche <[email protected]>
…when using wildcard watches (#16) During testing with grpc-xds, it was noticed that a specific behavior on the client side is not compatible with sotw subscription resumptions. When the last channel is closed, the client disconnects from the control-plane. If the same channel gets reopened later on, the connection is re-established with the same resource subscription and the last version from before is provided. In this case the control-plane currently does not return the response as the version does match, whereas grpc expects the control-plane to reply as it considers it as a "desubscription then resubscription event", which should send the resource again. In the context of wildcard watches this is not an issue, so the behavior is kept. More context on the grpc-xds discussions in this [thread](grpc/grpc-go#7013 (comment)) Signed-off-by: Valerian Roche <[email protected]>
…when using wildcard watches (#16) During testing with grpc-xds, it was noticed that a specific behavior on the client side is not compatible with sotw subscription resumptions. When the last channel is closed, the client disconnects from the control-plane. If the same channel gets reopened later on, the connection is re-established with the same resource subscription and the last version from before is provided. In this case the control-plane currently does not return the response as the version does match, whereas grpc expects the control-plane to reply as it considers it as a "desubscription then resubscription event", which should send the resource again. In the context of wildcard watches this is not an issue, so the behavior is kept. More context on the grpc-xds discussions in this [thread](grpc/grpc-go#7013 (comment)) Signed-off-by: Valerian Roche <[email protected]>
…when using wildcard watches (#16) During testing with grpc-xds, it was noticed that a specific behavior on the client side is not compatible with sotw subscription resumptions. When the last channel is closed, the client disconnects from the control-plane. If the same channel gets reopened later on, the connection is re-established with the same resource subscription and the last version from before is provided. In this case the control-plane currently does not return the response as the version does match, whereas grpc expects the control-plane to reply as it considers it as a "desubscription then resubscription event", which should send the resource again. In the context of wildcard watches this is not an issue, so the behavior is kept. More context on the grpc-xds discussions in this [thread](grpc/grpc-go#7013 (comment)) Signed-off-by: Valerian Roche <[email protected]>
What version of gRPC are you using?
master branch/1.61.1
What version of Go are you using (
go version
)?1.22.0
What operating system (Linux, Windows, …) and version?
Linux
What did you do?
xds:///dest
target.Close()
, or make it enterIDLE
state.resource_names
is empty in the last DiscoveryRequests. This ADS stream is still open. Not that the server will send no response, so it does not have a chance to send a new version to client. This empty subscription is not treated as wildcard by the management server because there was at least one explicit subscriptions on the stream.xds:///dest
as part of it. Note that gRPC discards all resources because it doesn't subscribe to any of them yet.xds:///
is recreated, or exits idle mode. This causes a new explicit subscription to the corresponding listener, but since the management server already sent this LDS resource as part of the wildcard subscription, it considers the client up to date and does not send a response.What did you expect to see?
I expected one of the two following behaviours. Either:
What did you see instead?
gRPC sends discovery requests with empty
resource_names
field upon reconnecting, triggering wildcard subscription.The text was updated successfully, but these errors were encountered: