-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
on-demand, upstream: on-demand cluster discovery #15523
Conversation
Hi @krnowak, welcome and thank you for your contribution. We will try to review your Pull Request as quickly as possible. In the meantime, please take a look at the contribution guidelines if you have not done so already. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, at a very high level this looks like it's directionally correct. Please let me know when this is ready for real review (comments, tests, etc.).
/wait
|
||
cluster_manager.cdm_.ensureCallbacksAreInstalled(); | ||
|
||
auto [handle, discovery_in_progress] = cluster_manager.cdm_.addCallback(name, weak_callback); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it will miss discovery requests for name
initiated in other threads, is this intentional? Could you add a comment then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add more comments, sure.
Note that addCallback
will add a callback anyway. But if the discovery for name
was already initiated in this thread, then we don't need to do anything else, just add the callback and wait for it to be fired. If the discovery for name
was initiated in some other worker thread, then we have no way of knowing that in this worker thread, so we try to initiate the discovery anyway. Whether the discovery will actually be initiated is something that main thread will figure out (see the code where we check pending_cluster_creations_
in main thread).
ProtobufMessage::ValidationVisitor& validation_visitor) | ||
: cm_(cm) { | ||
if (proto_config.has_odcds_config()) { | ||
odcds_ = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This creates an instance of CDS api/subscription in a worker thread, which is different from the rest of xDS that runs on the main thread. I think this is an issue -- there are concurrency (definitely with ADS configured) and synchronisation issues here. I probably missed it -- where is the CDS subscription started?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good question. I'm still in progress of writing the tests, so this will come out soon. But yeah, looks like creating CDS in worker thread calls into cluster manager directly, so looks like I either would need to make it thread safe, or postpone the creation of CDS until I'm in main thread.
Basically I didn't want to start the subscription, because it will result in periodically requesting the management server, which is not what I wanted. I'll have a look what VHDS is doing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did a bit of tracking where this is called from and I have an impression that both createRouteSpecificFilterConfigTyped
and createFilterFactoryFromProtoTyped
are called from the main thread. And inside those functions we create the CDS, so it ought to be fine. Maybe the factories that those functions return are called from worker threads, aren't they?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically I didn't want to start the subscription, because it will result in periodically requesting the management server, which is not what I wanted. I'll have a look what VHDS is doing here.
A subscription is basically a type-specific (i.e. CDS, RDS, VHDS, etc) grpc connection to a management server; in case of ADS all subscriptions are multiplexed over a single grpc connection. The connection needs to be initiated (subscription started) in order to be able to both send and receive xDS updates.
Note that VHDS establishes a separate subscription in addition to the base RDS one and calculates a common state (based on RDS state and VHDS updates). I encourage you to take a closer look at various xDS implementations, as well as muxes. Please also take a look at the docs I linked to earlier.
Would it be possible for you to write a brief explanation re: how you plan to handle configuration and starting of on-demand CDS subscriptions? Did you have a chance to read https://blog.envoyproxy.io/envoy-threading-model-a8d44b922310? |
message OnDemand { | ||
option (udpa.annotations.versioning).previous_message_type = | ||
"envoy.config.filter.http.on_demand.v2.OnDemand"; | ||
|
||
config.core.v3.ConfigSource odcds_config = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the only knob here is the presence of a ConfigSource
, then how will we configure ODCDS when we're using new-style resource names and don't need a ConfigSource
because we're using authority-based routing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the presence of an OCDS PerRouteConfig
would indicate we needed to treat the name as on-demand. The meaning of self
becomes interesting as it might mean two different things for the filter-level config (received from LDS) and the PerRouteConfig
(from RDS).
e2bfaaa
to
6a852da
Compare
Hi, updated the PR. I still have a hitch with the integration test failing. Not sure if this is something wrong with my changes in cluster manager or the test itself is wrong. I have asked on slack on how to enable logs during running the integration tests - this hopefully would help me with debugging the issue. To answer some questions: @dmitri-d: I wrote a separate class instead of using CdsApi{,Impl} - it shares most of the functionality with CdsApiImpl, but it starts the subscription on first use instead of up-front. @markdroth: Unfortunately I'm not very well oriented in matters of new-style cluster naming and authority-based routing. I was pointed to the document on the last community meeting, so I'll try to get through it soon. |
6a852da
to
8ccba60
Compare
Got a bit further (it was a
I tried deferring the destruction to main thread (just to be sure), but it crashes at the same place too. Will investigate. |
8ccba60
to
8e6735c
Compare
Fixed the crash (it was caused by bad member ordering in OdCdsApiImpl). Should be ready for a review. |
92c5505
to
727c613
Compare
We plan to add an OdCdsApiImpl class next to CdsApiImpl. It's config updated callback functionality would be similar to CdsApiImpl's, but with some extra code. To avoid the duplication of code, we move it to a helper class that will be used by both OdCdsApiImpl and CdsApiImpl. Signed-off-by: Krzesimir Nowak <[email protected]>
Each thread using this functionality will need to instantiate their own OdCdsApi instance and pass it together with a callback to ClusterManager's requestOnDemandClusterDiscovery. Cluster manager will use passed OdCdsApi to perform the discovery. When the discovery is finished, ClusterManager invokes the callbacks in the thread context. OdCdsApiImpl starts the subscription on the first update request. The subsequent request uses the on-demand update functionality of the subscription. That way, subscriptions are only started when actually needed. Signed-off-by: Krzesimir Nowak <[email protected]>
This extends the extension to start the on-demand cluster discovery in case of a missing cluster if such discovery is enabled by specifying the config source for the ODCDS in filter's configuration. The filter also started supporting per-route configuration to allow configuring ODCDS for a specific virtual host or route. Signed-off-by: Krzesimir Nowak <[email protected]>
This sidesteps some of the corner cases in the cluster discovery, and matches the expectations of the ODCDS lifetime. Signed-off-by: Krzesimir Nowak <[email protected]>
This greatly simplifies ODCDS since it does not need to track the discovered names any more. Currently the timeout is set arbitrarily to 5 seconds. Signed-off-by: Krzesimir Nowak <[email protected]>
This is to address an issue raised during a review of the CdsApiHelper refactor. Signed-off-by: Krzesimir Nowak <[email protected]>
Signed-off-by: Krzesimir Nowak <[email protected]>
Now that #15806 is merged, do you expect me to merge the main branch into the PR branch and fix the conflicts? |
Yes. I don't know how else we would proceed. :) |
Heh, right. Projects I was working on so far allowed rebasing, so I just wanted to be sure. :) Anyway, I updated the PR. I think I have addressed most (or all) of the initial review issues. |
@@ -0,0 +1,85 @@ | |||
#include "common/upstream/odcds_api_impl.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm a bit late to the party here, but why do we have a different CDS implementation for on-demand vs. regular CDS? In all the other on-demand models (VHDS, SRDS), we reuse the existing machinery and just making it on-demand aware. Here we have diverged CDS implementation. This will probably have implications down the track when we make bug fixes for regular delta CDS but miss out on on OCDS and vice versa.
I would like to see better awareness of on-demand as a horizontal in Envoy that spans across subsystems, rather than doing what is expedient or local to CDS for OCDS here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said offline, I haven't actually reviewed the code in detail yet, just the interfaces. I will review in more detail today. My expectation is most/all of the code is shared. I'm not sure it is possible to implement on-demand in main CDS due to the SoTW wildcard semantics, but we can see (at a certain point though it may just mean moving the same code split to a different location).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm a bit late to the party here, but why do we have a different CDS implementation for on-demand vs. regular CDS?
There is not much of the code difference really - most of the code is shared through the CdsApiHelper. The differences between CDS and ODCDS are mostly about when to start subscription (for CDS is sometime at Envoy startup, for ODCDS is on the first request). I suppose I could make CDS on-demand aware, but I'm afraid that would result in awkward code inside it.
In all the other on-demand models (VHDS, SRDS), we reuse the existing machinery and just making it on-demand aware. Here we have diverged CDS implementation. This will probably have implications down the track when we make bug fixes for regular delta CDS but miss out on on OCDS and vice versa.
I tried to move most of the code into the helper class so such bug fixes would most certainly land in a helper class, not in CDS/ODCDS itself.
I would like to see better awareness of on-demand as a horizontal in Envoy that spans across subsystems, rather than doing what is expedient or local to CDS for OCDS here.
// this name, so nothing more left to do here. | ||
return std::move(handle); | ||
} | ||
dispatcher_.post([this, odcds = std::move(odcds), weak_callback, name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per my other comment, I think this is tricky cross-thread lifetime related code for on-demand that has already been addressed in both VHDS and SRDS. Re-inventing these patterns here is problematic, we are building guaranteed technical debt.
Sorry to be a downer, I appreciate the work being done here, but I know from experience how hard it was to get things like on-demand VHDS right, we need to have common patterns for deferral and resumption across the on-demand xDS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree if we can share this out that would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a look and what VHDS is doing and I can see the differences:
- I'm "deduplicating" the requests, so 2 worker threads asking for the cluster "foo" will result in one request instead of two.
- I'm installing a timeout on the discovery, so the callbacks list won't grow indefinitely in case of cluster "foo" nonexistence.
I guess I can simplify the code, but maybe let's discuss it in the other PR, where I split out the CM changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks for the updates. This PR is still too large. For the next set of changes can we:
- Split out only the cluster manager changes and tests around those changes.
- Save the on-demand filter changes for the next PR.
I think if we do (1) we can more easily evaluate @htuch concerns as well as my high level interface concerns. Thank you!
/wait
/** | ||
* Abstract interface for a On-Demand CDS API provider. | ||
*/ | ||
class OdCdsApi { | ||
public: | ||
virtual ~OdCdsApi() = default; | ||
|
||
/** | ||
* File an on-demand request for a cluster. | ||
*/ | ||
virtual void updateOnDemand(const std::string& cluster_name) PURE; | ||
}; | ||
|
||
using OdCdsApiPtr = std::unique_ptr<OdCdsApi>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat related to @htuch comments, but why is this public interface needed? Conceptually, something asks the cluster manager to fetch a cluster, and gets a callback when complete or failed. I don't think any of this should be exposed "externally" and should be an internal implementation detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the config source for the on-demand discovery is provided in the per-route config, which makes it a separate object than CM has from cds_config
, I need to pass a "CDS" object to CM. It needs to be a public interface, because it's being used in ClusterManager
interface. I had two options here: either extend the CdsApi
interface with the updateOnDemand
function and pass that to CM or create a new interface (OdCdsApi
). I went with the latter, because I found the former to be awkward. It's because CdsApi
provides initialize
, setInitializedCb
and versionInfo
and for the on-demand discovery I would use none of it.
// this name, so nothing more left to do here. | ||
return std::move(handle); | ||
} | ||
dispatcher_.post([this, odcds = std::move(odcds), weak_callback, name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree if we can share this out that would be good.
I'd like a bit more info on exactly which problem we're trying to solve here. As I mentioned in this comment thread in the design doc, I think there are two closely related but distinct problems in this space, and it's not clear to me what our strategy is for each of the two. I'm concerned that we may be conflating them here in a way that may limit flexibility. |
Hi,
Please see #15857.
I suppose I could reuse this PR to add the on-demand filter changes after #15857 settles.
|
What I'm trying to solve here is to enable (or disable) on-demand cluster discovery on connection manager or virtual host or route level, especially together with cluster_header route action. What I'm not covering in this PR is the case where I statically specify a cluster name for some route and expect it to be discovered at a time of request handling instead of at Envoy startup. |
@krnowak Sorry, I don't quite follow. If this will not work for routes with static cluster names, then isn't it only for the cluster_header route action? The only three options for any route action are a cluster name, a |
I'd imagine that "static" cluster names (or weighted clusters) would work if they were provided by RDS or VHDS, for example. But honestly, I haven't checked that yet. I think the reason I have put a boolean flag or a config source inside the route in my first version of the code was to have an access to them when checking the static clusters here - I could basically disable the check and discover the static cluster at a request time. But now that this setting was moved to the extension config I had no clue how to check if the route has the on-demand discovery enabled from that place. |
This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions! |
I'll open a new one with the extension changes after we settle #15857. |
This enables a client to send a special resource named * to a server as a way to express an interest in a wildcard subscription. That way the wildcard subscription can coexist with subscriptions to other resources on the same stream. The wildcard resource, like any other resource, can be subscribed to and unsubscribed from at any time. The legacy way of enabling wildcard subscription (by sending an empty initial message) is still supported, but otherwise behaves just like a subscription to *. This behavior is gated behind the enabled-by-default "envoy.restart_features.explicit_wildcard_resource" runtime guard. Risk Level: High, changes in wildcard mode. Testing: As a part of #15523. Also done internally. Docs Changes: Updated version history. Release Notes: xds: implemented the wildcard resource, which can be used by xDS to express an interest in wildcard subscription in a way that allows subscriptions to other resources coexist on the same stream instead of being ignored. This means that the resource name * is reserved. See :ref:wildcard mode description <xds_protocol_resource_hints> and :ref:unsubscribing from resources <xds_protocol_unsubscribing> for details. This behavior is gated behind the envoy.restart_features.explicit_wildcard_resource runtime guard. Platform Specific Features: N/A Runtime guard: envoy.restart_features.explicit_wildcard_resource enabled by default. Signed-off-by: Krzesimir Nowak <[email protected]>
I'm filing an work-in-progress PR for the feature to gather some initial feedback on the code. Right now I'm working on writing tests.
Commit Message:
This PR adds an on-demand cluster discovery feature to Envoy. The process of the discovery is driven through the on-demand filter. The on-demand filter is updated to also use a per-route configuration to enable ODCDS.
Additional Description:
There is a design document for the feature - https://docs.google.com/document/d/1AqhQnrX_7SS6PAkNwoaA2peCtk9htgwSBTRg1gzFPNw/edit#
I haven't yet implemented checking if CDS is using delta, as ODCDS should not be used together with SotW CDS.
Risk Level:
Medium (new feature)
Testing:
Currently writing tests, will also do some manual testing with cluster_header in RouteAction.
Docs Changes:
TODO (should be both API docs and some usage documentation in on-demand filter, I think).
Release Notes:
New feature: on_demand: On demand discovery of the clusters.
Platform Specific Features:
N/A
[Optional Runtime guard:]
[Optional Fixes #Issue]
Partially addresses #2500 (the clusters part).
[Optional Deprecated:]
[Optional API Considerations:]