-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CDS Updates with many clusters often fail #12138
Comments
Are these clusters changing on a regular basis? CDS updates are known to be somewhat expensive (though I wouldn't expect them to block the main thread for as long as you're describing). If most of the clusters are the same then each update should just be a hash of the proto + compare with existing, which still isn't free but >100s update latency sounds like a lot. I'm not sure if we've been looking much into optimizing CDS cluster load time (maybe @jmarantz knows?), but the biggest improvement you'll get is most likely from using Delta CDS, where only the updated clusters are sent instead of the entire state of the world. |
Yeah, my understanding is the Project Contour doesn’t use this API yet and
that there would be significant work required to make use of the delta API.
I’m also a bit unclear on the underlying details, but however the XDS apis
are being used in Contour, it seems like some kinds of typical/normal
changes in the state of the kubernetes cluster objects prompt a CDS
evaluation, if not any real update, because a cluster that’s in steady
state as far as objects goes (but not necessarily pods/endpoints) can cause
envoy to get deadlocked like I described with 7000+ clusters.
I believe that the expensiveness of the CDS and possibly something in the
gRPC sessions are combining to cause the issue.
On Fri, Jul 17, 2020 at 6:13 PM Snow Pettersen ***@***.***> wrote:
Are these clusters changing on a regular basis? CDS updates are known to
be somewhat expensive (though I wouldn't expect them to block the main
thread for as long as you're describing). If most of the clusters are the
same then each update should just be a hash of the proto + compare with
existing, which still isn't free but >100s update latency sounds like a lot.
I'm not sure if we've been looking much into optimizing CDS cluster load
time (maybe @jmarantz <https://github.com/jmarantz> knows?), but the
biggest improvement you'll get is most likely from using Delta CDS, where
only the updated clusters are sent instead of the entire state of the world.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#12138 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAOT7S7TAYWLWYHLDS6UU3DR4DSK7ANCNFSM4O46FJUQ>
.
--
-Jonathan Huff
|
CDS cluster initialization is definitely taking time. For 800 clusters, we have seen |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions. |
A similar behaviour happened with some of our nodes. The node I investigated received 280 clusters and removed 1, as shown in this log:
And, as you can see, the last CDS update time is approximately to that time:
After that, the pending EDS responses were processed and Envoy got stuck. This issue started since we started to using the GRPC to communicate with our Controlplane. We used REST before that. The issue only solves when we restart Envoy. We are using version 1.14.5. EDIT: This behaviour was detected around 13h on October 13th. EDIT 2: The issue was caused by closing the connection between Envoy and Controlplane. In this context, we haven't enabled TCP keepalive on connections to Controlplane. So, when the firewall drops the packets from this TCP connection, Envoy was not able to detect that and waited indefinitely for the xDS response from Controlplane. Thanks |
@htuch @mattklein123 can we please reopen this? This is still an issue |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions. |
We have similar understanding around cluster scalability. I think it would be helpful to have some flamegraphs from folks where possible so we can validate whether these are the same root cause. @pgenera is working on landing #14167 to benchmark. We have known overheads in stats processing that @jmarantz has landed some recent PRs to address, see #14028 and #14312. |
See also #14439 . Moreover, for any issues with CDS update performance, flame-graphs would be most helpful! |
I experimented a bit with a fake xDS server and tried to update 10000 routes containing RE2-based matching rules. It took 1-2 secs to compile them all. Then I tried to load a The flamegraph points to The fix seems to be trivial, but I wonder if the reporters use |
Nice catch! Any downside to just making that fix? |
The fix I tried is --- a/source/common/upstream/strict_dns_cluster.h
+++ b/source/common/upstream/strict_dns_cluster.h
@@ -38,7 +38,7 @@ private:
uint32_t port_;
Event::TimerPtr resolve_timer_;
HostVector hosts_;
- const envoy::config::endpoint::v3::LocalityLbEndpoints locality_lb_endpoint_;
+ const envoy::config::endpoint::v3::LocalityLbEndpoints& locality_lb_endpoint_;
const envoy::config::endpoint::v3::LbEndpoint lb_endpoint_;
HostMap all_hosts_;
}; With it applied Envoy seems to load the update instantly, but |
Yeah I was wondering about lifetime issues. In a quick scan it looks like there isn't a ton of data pulled from the endpoints structure in this class; maybe it could be pulled out individually in the ctor rather than copying the whole thing? |
A deeper fix might be to move away from using the proto representation for endpoints inside DNS clusters. For convenience, I think we've inherited this view of endpoints that mostly makes sense for static or EDS endpoints, but the merge costs are non-trivial at scale in this case. |
Submitted #15013 as an attempt to fix it. |
@rojkov Thanks for that fix. But this happens also when majority of the clusters are EDS. So your PR will not fully fix this issue. |
@ramaraochavali Alright, I'll drop the close tag from the PR's description to keep this issue open for now. |
…get (#15013) Currently Envoy::Upstream::StrictDnsClusterImpl::ResolveTarget when instantiated for every endpoint also creates a full copy of envoy::config::endpoint::v3::LocalityLbEndpoints the endpoint belongs to. Given the message contains all the endpoints defined for it this leads to exponential growth of consumed memory as the number of endpoints increases. Even though those copies of endpoints are not used. Instead of creating a copy of envoy::config::endpoint::v3::LocalityLbEndpoints use a reference to a single copy stored in Envoy::Upstream::StrictDnsClusterImpl and accessible from all resolve targets during their life span. Risk Level: Low Testing: unit tests Docs Changes: N/A Release Notes: N/A Platform Specific Features: N/A May contribute to #12138, #14993 Signed-off-by: Dmitry Rozhkov <[email protected]>
I've played a bit more with CDS and EDS: loaded 10000 clusters with EDS load assignments. For Envoy compiled with But anyway it seems in the process of a CDS update there are three phases:
The CPU profiles for these phases are very different. Here's the first phase. By the end of this phase Envoy consumes about 500M. Then the second phase starts Along this phase Envoy starts to consume memory faster until it allocates about 3800M. Here's the heap profile just before the third phase. Then Envoy sends the request to my fake ADS server. At this point I see a burst in the network traffic: about 450M is sent from Envoy to ADS. After that Envoy deallocates memory, its consumption drops back to 500M. So, perhaps in congested cluster networks the third phase may take longer. Adding a single cluster to these 10000 in a new CDS update doesn't seem to put any significant load on CPU (at least comparing with the initial loading of 10000). |
Thanks for collecting these graphs. I assume from the numbers above that these were collected with. "-c dbg", right? Are you reluctant to collect them with "-c opt" because the names in the flame-graphs are less helpful, or you need line info somewhere? One thing I use sometimes is "OptDebug": |
There's some pretty strange stuff in the pprof heap profile. E.g. |
Maybe the image is rendering "alloc_space" (+X bytes for malloc(x), +0 for free()). @rojkov can you confirm? |
No, that's "inuse". This is how the profiler log looked like:
|
New findings:
// Remove the previous cluster before the cluster object is destroyed.
secondary_init_clusters_.remove_if(
[name_to_remove = cluster.info()->name()](ClusterManagerCluster* cluster_iter) {
return cluster_iter->cluster().info()->name() == name_to_remove;
});
secondary_init_clusters_.push_back(&cm_cluster); Here |
That’s excellent work! It would be awesome if a data structure change could bring down the operational time by a lot. |
You can use the |
Commit Message: upstream: avoid double hashing of protos in CDS init Additional Description: Currently Cluster messages are hashed unconditionally upon instantiation of ClusterManagerImpl::ClusterData even if their hashes are known already. Calculate the hashes outside of ClusterManagerImpl::ClusterData ctor to make use of already calculated ones. Risk Level: Low Testing: unit tests, manual tests Docs Changes: N/A Release Notes: N/A Platform Specific Features: N/A Contributes to #12138 Signed-off-by: Dmitry Rozhkov <[email protected]>
I was curious why an update with 10k EDS clusters turns into >100M traffic from Envoy to the server and sniffed it with Wireshark. So, every EDS discovery request for a cluster includes a copy of AFAIU the Node message is needed to identify the client. I tried to limit the amount of data included in a request to the Node's @@ -63,13 +64,20 @@ void GrpcMuxImpl::sendDiscoveryRequest(const std::string& type_url) {
if (api_state.must_send_node_ || !skip_subsequent_node_ || first_stream_request_) {
// Node may have been cleared during a previous request.
- request.mutable_node()->CopyFrom(local_info_.node());
+ envoy::config::core::v3::Node n;
+ n.set_id(local_info_.node().id());
+ n.set_cluster(local_info_.node().cluster());
+ request.mutable_node()->CopyFrom(n);
api_state.must_send_node_ = false;
} else {
@@ -93,7 +103,11 @@ GrpcMuxWatchPtr GrpcMuxImpl::addWatch(const std::string& type_url,
// TODO(gsagula): move TokenBucketImpl params to a config.
if (!apiStateFor(type_url).subscribed_) {
apiStateFor(type_url).request_.set_type_url(type_url);
- apiStateFor(type_url).request_.mutable_node()->MergeFrom(local_info_.node());
+ envoy::config::core::v3::Node n;
+ n.set_id(local_info_.node().id());
+ n.set_cluster(local_info_.node().cluster());
+ apiStateFor(type_url).request_.mutable_node()->MergeFrom(n);
apiStateFor(type_url).subscribed_ = true;
subscriptions_.emplace_back(type_url);
if (enable_type_url_downgrade_and_upgrade_) { Also it helped to slash a couple of seconds off With full Node:
With limited Node:
Looks like Envoy is bound to perform slightly worse with every new release as the number of extensions enumerated in |
@rojkov this is already supported for SotW (but not delta) xDS, see
I'm thinking this is something else we would prefer default true (but need to do a deprecation dance to move to @adisuissa this might be another potential cause of buffer bloat in the issue you are looking at. |
envoy/source/common/upstream/cds_api_impl.cc
Lines 52 to 101 in 2966597
This is a performance issue, not a bug per se.
When doing CDS updates with many clusters, Envoy will often get "stuck" evaluating the CDS update. This manifests as EDS failing, and in more extreme cases, the envoy ceases to receive any XDS updates. When this happens, Envoy needs to be restarted to get it updating again.
In our case we're seeing issues with the current implementation of
void CdsApiImpl::onConfigUpdate
with a number of clusters in the 3000-7000 range. If envoy could speedily evaluate a CDS update with 10000 clusters this would represent a HUGE improvement in Envoy's behavior for us. Right now, only around 2500 clusters in a CDS update seems to evaluate in reasonable amount of time.Because the function
void CdsApiImpl::onConfigUpdate
pauses EDS while doing CDS evaluation, envoy's config will drift. With many clusters in CDS, this can mean envoy is hundreds of seconds behind what is current, and results in 503's.Some context:
Contour currently doesn't use the incremental APIs of Envoy, so when K8s services change in the cluster it sends ALL of the current config again to Envoy, which means a small change like adding or removing a K8s SVC which maps to an envoy cluster results in Envoy having to re-evaluate ALL clusters.
With enough K8s Services behind an Ingress (7000+) envoy can spontaneously cease to receive any new updates indefinitely, and will fail to do EDS because it gets stuck in the CDS evaluation.
Given a high enough number of clusters this would be a Very Hard Problem, but given that we're in the low thousands, I'm hoping there are some things that could be done to improve performance without resorting to exotic methods.
Please let me know if there's any information I can provide that could help!
The text was updated successfully, but these errors were encountered: