CDS Updates with many clusters often fail #12138

Starblade42 · 2020-07-16T21:09:07Z

envoy/source/common/upstream/cds_api_impl.cc

Lines 52 to 101 in 2966597

    
           void CdsApiImpl::onConfigUpdate(const std::vector<Config::DecodedResourceRef>& added_resources, 
        
                                           const Protobuf::RepeatedPtrField<std::string>& removed_resources, 
        
                                           const std::string& system_version_info) { 
        
             Config::ScopedResume maybe_resume_eds; 
        
             if (cm_.adsMux()) { 
        
               const auto type_urls = 
        
                   Config::getAllVersionTypeUrls<envoy::config::endpoint::v3::ClusterLoadAssignment>(); 
        
               maybe_resume_eds = cm_.adsMux()->pause(type_urls); 
        
             } 
        
             ENVOY_LOG(info, "cds: add {} cluster(s), remove {} cluster(s)", added_resources.size(), 
        
                       removed_resources.size()); 
        
             std::vector<std::string> exception_msgs; 
        
             std::unordered_set<std::string> cluster_names; 
        
             bool any_applied = false; 
        
             for (const auto& resource : added_resources) { 
        
               envoy::config::cluster::v3::Cluster cluster; 
        
               try { 
        
                 cluster = dynamic_cast<const envoy::config::cluster::v3::Cluster&>(resource.get().resource()); 
        
                 if (!cluster_names.insert(cluster.name()).second) { 
        
                   // NOTE: at this point, the first of these duplicates has already been successfully applied. 
        
                   throw EnvoyException(fmt::format("duplicate cluster {} found", cluster.name())); 
        
                 } 
        
                 if (cm_.addOrUpdateCluster(cluster, resource.get().version())) { 
        
                   any_applied = true; 
        
                   ENVOY_LOG(info, "cds: add/update cluster '{}'", cluster.name()); 
        
                 } else { 
        
                   ENVOY_LOG(debug, "cds: add/update cluster '{}' skipped", cluster.name()); 
        
                 } 
        
               } catch (const EnvoyException& e) { 
        
                 exception_msgs.push_back(fmt::format("{}: {}", cluster.name(), e.what())); 
        
               } 
        
             } 
        
             for (const auto& resource_name : removed_resources) { 
        
               if (cm_.removeCluster(resource_name)) { 
        
                 any_applied = true; 
        
                 ENVOY_LOG(info, "cds: remove cluster '{}'", resource_name); 
        
               } 
        
             } 
        
             if (any_applied) { 
        
               system_version_info_ = system_version_info; 
        
             } 
        
             runInitializeCallbackIfAny(); 
        
             if (!exception_msgs.empty()) { 
        
               throw EnvoyException( 
        
                   fmt::format("Error adding/updating cluster(s) {}", absl::StrJoin(exception_msgs, ", "))); 
        
             } 
        
           }

This is a performance issue, not a bug per se.

When doing CDS updates with many clusters, Envoy will often get "stuck" evaluating the CDS update. This manifests as EDS failing, and in more extreme cases, the envoy ceases to receive any XDS updates. When this happens, Envoy needs to be restarted to get it updating again.

In our case we're seeing issues with the current implementation of void CdsApiImpl::onConfigUpdate with a number of clusters in the 3000-7000 range. If envoy could speedily evaluate a CDS update with 10000 clusters this would represent a HUGE improvement in Envoy's behavior for us. Right now, only around 2500 clusters in a CDS update seems to evaluate in reasonable amount of time.

Because the function void CdsApiImpl::onConfigUpdate pauses EDS while doing CDS evaluation, envoy's config will drift. With many clusters in CDS, this can mean envoy is hundreds of seconds behind what is current, and results in 503's.

Some context:

Envoy in my test environment is being run without constraints on 8 core VMs that are running at a max of 30% CPU utilization and max 40% memory (of 32 GB):

        resources:
          limits:
            memory: "32212254720"
          requests:
            cpu: 100m
            memory: 256M

We're using Project Contour as our Ingress Controller in K8s.
- https://projectcontour.io/
- https://github.com/projectcontour/contour

Contour currently doesn't use the incremental APIs of Envoy, so when K8s services change in the cluster it sends ALL of the current config again to Envoy, which means a small change like adding or removing a K8s SVC which maps to an envoy cluster results in Envoy having to re-evaluate ALL clusters.

With enough K8s Services behind an Ingress (7000+) envoy can spontaneously cease to receive any new updates indefinitely, and will fail to do EDS because it gets stuck in the CDS evaluation.

Given a high enough number of clusters this would be a Very Hard Problem, but given that we're in the low thousands, I'm hoping there are some things that could be done to improve performance without resorting to exotic methods.

Please let me know if there's any information I can provide that could help!

The text was updated successfully, but these errors were encountered:

snowp · 2020-07-18T00:13:22Z

Are these clusters changing on a regular basis? CDS updates are known to be somewhat expensive (though I wouldn't expect them to block the main thread for as long as you're describing). If most of the clusters are the same then each update should just be a hash of the proto + compare with existing, which still isn't free but >100s update latency sounds like a lot.

I'm not sure if we've been looking much into optimizing CDS cluster load time (maybe @jmarantz knows?), but the biggest improvement you'll get is most likely from using Delta CDS, where only the updated clusters are sent instead of the entire state of the world.

Starblade42 · 2020-07-18T03:58:43Z

Yeah, my understanding is the Project Contour doesn’t use this API yet and that there would be significant work required to make use of the delta API. I’m also a bit unclear on the underlying details, but however the XDS apis are being used in Contour, it seems like some kinds of typical/normal changes in the state of the kubernetes cluster objects prompt a CDS evaluation, if not any real update, because a cluster that’s in steady state as far as objects goes (but not necessarily pods/endpoints) can cause envoy to get deadlocked like I described with 7000+ clusters. I believe that the expensiveness of the CDS and possibly something in the gRPC sessions are combining to cause the issue.

On Fri, Jul 17, 2020 at 6:13 PM Snow Pettersen ***@***.***> wrote: Are these clusters changing on a regular basis? CDS updates are known to be somewhat expensive (though I wouldn't expect them to block the main thread for as long as you're describing). If most of the clusters are the same then each update should just be a hash of the proto + compare with existing, which still isn't free but >100s update latency sounds like a lot. I'm not sure if we've been looking much into optimizing CDS cluster load time (maybe @jmarantz <https://github.com/jmarantz> knows?), but the biggest improvement you'll get is most likely from using Delta CDS, where only the updated clusters are sent instead of the entire state of the world. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#12138 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAOT7S7TAYWLWYHLDS6UU3DR4DSK7ANCNFSM4O46FJUQ> .

-- -Jonathan Huff

ramaraochavali · 2020-07-18T11:05:11Z

CDS cluster initialization is definitely taking time. For 800 clusters, we have seen 20s to initialize - while Delta CDS helps to update new clusters, this slowness is actually affecting the envoy readiness during startup.

stale · 2020-08-23T07:26:18Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

stale · 2020-09-02T13:05:42Z

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.

alandiegosantos · 2020-10-13T13:05:57Z

A similar behaviour happened with some of our nodes. The node I investigated received 280 clusters and removed 1, as shown in this log:

[2020-10-12 18:00:14.160][54554][info][upstream] [external/envoy/source/common/upstream/cds_api_impl.cc:77] cds: add 280 cluster(s), remove 1 cluster(s)

And, as you can see, the last CDS update time is approximately to that time:

# TYPE envoy_cluster_manager_cds_update_time gauge
envoy_cluster_manager_cds_update_time{envoy_dc="someplace"} 1602518414207

After that, the pending EDS responses were processed and Envoy got stuck. This issue started since we started to using the GRPC to communicate with our Controlplane. We used REST before that.

The issue only solves when we restart Envoy. We are using version 1.14.5.

EDIT: This behaviour was detected around 13h on October 13th.

EDIT 2: The issue was caused by closing the connection between Envoy and Controlplane. In this context, we haven't enabled TCP keepalive on connections to Controlplane. So, when the firewall drops the packets from this TCP connection, Envoy was not able to detect that and waited indefinitely for the xDS response from Controlplane.

Thanks

ramaraochavali · 2020-10-13T13:35:56Z

@htuch @mattklein123 can we please reopen this? This is still an issue

github-actions · 2020-12-09T17:27:55Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

htuch · 2020-12-10T00:41:29Z

We have similar understanding around cluster scalability. I think it would be helpful to have some flamegraphs from folks where possible so we can validate whether these are the same root cause. @pgenera is working on landing #14167 to benchmark. We have known overheads in stats processing that @jmarantz has landed some recent PRs to address, see #14028 and #14312.

jmarantz · 2020-12-21T00:50:59Z

See also #14439 . Moreover, for any issues with CDS update performance, flame-graphs would be most helpful!

rojkov · 2021-02-09T15:37:12Z

I experimented a bit with a fake xDS server and tried to update 10000 routes containing RE2-based matching rules. It took 1-2 secs to compile them all. Then I tried to load a STRICT_DNS cluster with 5000 endpoints and it took more than 20 secs on my machine.

The flamegraph points to Envoy::Upstream::StrictDnsClusterImpl::ResolveTarget constructor as the main offender:

Turned out that every ResolveTarget creates also a full copy of envoy::config::endpoint::v3::LocalityLbEndpoints.

The fix seems to be trivial, but I wonder if the reporters use STRICT_DNS as well. @Starblade42, @ramaraochavali can you confirm? If not then could you post a sample config you load with CDS.

jmarantz · 2021-02-09T15:54:02Z

Nice catch! Any downside to just making that fix?

rojkov · 2021-02-09T16:12:50Z

The fix I tried is

--- a/source/common/upstream/strict_dns_cluster.h
+++ b/source/common/upstream/strict_dns_cluster.h
@@ -38,7 +38,7 @@ private:
     uint32_t port_;
     Event::TimerPtr resolve_timer_;
     HostVector hosts_;
-    const envoy::config::endpoint::v3::LocalityLbEndpoints locality_lb_endpoint_;
+    const envoy::config::endpoint::v3::LocalityLbEndpoints& locality_lb_endpoint_;
     const envoy::config::endpoint::v3::LbEndpoint lb_endpoint_;
     HostMap all_hosts_;
   };

With it applied Envoy seems to load the update instantly, but //test/common/upstream:upstream_impl_test and //test/integration:hotrestart_test start to fail. So, probably it's not that trivial :) I'll take a look tomorrow.

jmarantz · 2021-02-09T16:27:33Z

Yeah I was wondering about lifetime issues. In a quick scan it looks like there isn't a ton of data pulled from the endpoints structure in this class; maybe it could be pulled out individually in the ctor rather than copying the whole thing?

htuch · 2021-02-10T03:33:48Z

A deeper fix might be to move away from using the proto representation for endpoints inside DNS clusters. For convenience, I think we've inherited this view of endpoints that mostly makes sense for static or EDS endpoints, but the merge costs are non-trivial at scale in this case.

rojkov · 2021-02-10T17:20:12Z

Submitted #15013 as an attempt to fix it.

ramaraochavali · 2021-02-11T05:55:18Z

@rojkov Thanks for that fix. But this happens also when majority of the clusters are EDS. So your PR will not fully fix this issue.

rojkov · 2021-02-11T14:34:28Z

@ramaraochavali Alright, I'll drop the close tag from the PR's description to keep this issue open for now.

…get (#15013) Currently Envoy::Upstream::StrictDnsClusterImpl::ResolveTarget when instantiated for every endpoint also creates a full copy of envoy::config::endpoint::v3::LocalityLbEndpoints the endpoint belongs to. Given the message contains all the endpoints defined for it this leads to exponential growth of consumed memory as the number of endpoints increases. Even though those copies of endpoints are not used. Instead of creating a copy of envoy::config::endpoint::v3::LocalityLbEndpoints use a reference to a single copy stored in Envoy::Upstream::StrictDnsClusterImpl and accessible from all resolve targets during their life span. Risk Level: Low Testing: unit tests Docs Changes: N/A Release Notes: N/A Platform Specific Features: N/A May contribute to #12138, #14993 Signed-off-by: Dmitry Rozhkov <[email protected]>

rojkov · 2021-02-19T14:17:06Z

I've played a bit more with CDS and EDS: loaded 10000 clusters with EDS load assignments. For Envoy compiled with -c opt it took about 15 secs. In case of -c dbg - about 150 secs. So, compilation options can be very impactful.

But anyway it seems in the process of a CDS update there are three phases:

stats generation (~80 of 150 sec),
preparing to send gRPC discovery requests for EDS (~60 of 150 secs).
sending the prepared requests to xDS and opening 10000 gRPC streams for EDS (~10 of 150 secs).

The CPU profiles for these phases are very different. Here's the first phase.

By the end of this phase Envoy consumes about 500M.

Then the second phase starts

Along this phase Envoy starts to consume memory faster until it allocates about 3800M. Here's the heap profile just before the third phase.

Then Envoy sends the request to my fake ADS server. At this point I see a burst in the network traffic: about 450M is sent from Envoy to ADS. After that Envoy deallocates memory, its consumption drops back to 500M.

So, perhaps in congested cluster networks the third phase may take longer.

Adding a single cluster to these 10000 in a new CDS update doesn't seem to put any significant load on CPU (at least comparing with the initial loading of 10000).

jmarantz · 2021-02-19T14:29:23Z

Thanks for collecting these graphs. I assume from the numbers above that these were collected with. "-c dbg", right? Are you reluctant to collect them with "-c opt" because the names in the flame-graphs are less helpful, or you need line info somewhere?

One thing I use sometimes is "OptDebug": --compilation_mode=opt --cxxopt=-g --cxxopt=-ggdb3. That might be worth a shot.

rojkov · 2021-02-19T15:42:46Z

Oh, thank you! That's helpful. I used "-c dbg" to get more meaningful function names indeed. With the OptDebug options the call stacks are less deep, but still meaningful:

htuch · 2021-02-19T21:44:56Z

There's some pretty strange stuff in the pprof heap profile. E.g. Node MergeFrom is allocating 2GB? What's going on there? :-D

lambdai · 2021-02-20T05:49:10Z

There's some pretty strange stuff in the pprof heap profile. E.g. Node MergeFrom is allocating 2GB? What's going on there? :-D

Maybe the image is rendering "alloc_space" (+X bytes for malloc(x), +0 for free()). @rojkov can you confirm?

rojkov · 2021-02-22T08:30:43Z

Maybe the image is rendering "alloc_space" (+X bytes for malloc(x), +0 for free()). @rojkov can you confirm?

No, that's "inuse".

This is how the profiler log looked like:

Dumping heap profile to /tmp/orig.heapprof.0001.heap (100 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0002.heap (200 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0003.heap (300 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0004.heap (400 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0005.heap (500 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0006.heap (600 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0007.heap (700 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0008.heap (800 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0009.heap (900 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0010.heap (1000 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0011.heap (1100 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0012.heap (1200 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0013.heap (1300 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0014.heap (1400 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0015.heap (1500 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0016.heap (1600 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0017.heap (1700 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0018.heap (1800 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0019.heap (1900 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0020.heap (2000 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0021.heap (2100 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0022.heap (2200 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0023.heap (2300 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0024.heap (2400 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0025.heap (2500 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0026.heap (2600 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0027.heap (2700 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0028.heap (2800 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0029.heap (2900 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0030.heap (3000 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0031.heap (3100 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0032.heap (3200 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0033.heap (3300 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0034.heap (3400 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0035.heap (3500 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0036.heap (8610 MB allocated cumulatively, 3538 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0037.heap (3638 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0038.heap (3738 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0039.heap (3838 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0040.heap (3938 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0041.heap (4038 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0042.heap (10176 MB allocated cumulatively, 3739 MB currently in use) <- Here one can see a spike in network traffic from Envoy to xDS
Dumping heap profile to /tmp/orig.heapprof.0043.heap (11207 MB allocated cumulatively, 543 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0044.heap (12231 MB allocated cumulatively, 546 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0045.heap (13263 MB allocated cumulatively, 543 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0046.heap (14287 MB allocated cumulatively, 546 MB currently in use)
Dumping heap profile to /tmp/orig.heapprof.0047.heap (15319 MB allocated cumulatively, 543 MB currently in use)

rojkov · 2021-02-25T16:58:58Z

New findings:

when Envoy restarts it triggers this code path upon CDS initialization

    // Remove the previous cluster before the cluster object is destroyed.
    secondary_init_clusters_.remove_if(
        [name_to_remove = cluster.info()->name()](ClusterManagerCluster* cluster_iter) {
          return cluster_iter->cluster().info()->name() == name_to_remove;
        });
    secondary_init_clusters_.push_back(&cm_cluster);

Here secondary_init_clusters_ is std::list. In case Envoy loads 10k clusters this secondary_init_clusters_.remove_if() line takes 4 sec, in case of 30k it takes about 70 sec. Probably hash map would be a better choice here.

Hashes for every Cluster message are calculated twice. The first time is here and the second time is here. Though performance impact is not that huge. The initial hashing of 30k messages takes 400msec. When an update arrives it takes about 90msec for the same 30k messages.

Starblade42 · 2021-02-25T18:41:49Z

That’s excellent work! It would be awesome if a data structure change could bring down the operational time by a lot.

yanavlasov · 2021-03-02T00:38:09Z

You can use the --ignore-unknown-dynamic-fields command line flag to disable detection of deprecated fields in the config. This will remove the cost of VersionUtil::scrubHiddenEnvoyDeprecated.
It also looks like your control plane uses v2 API. If so converting to v3 should eliminate VersionConverter::... costs.

Commit Message: upstream: avoid double hashing of protos in CDS init Additional Description: Currently Cluster messages are hashed unconditionally upon instantiation of ClusterManagerImpl::ClusterData even if their hashes are known already. Calculate the hashes outside of ClusterManagerImpl::ClusterData ctor to make use of already calculated ones. Risk Level: Low Testing: unit tests, manual tests Docs Changes: N/A Release Notes: N/A Platform Specific Features: N/A Contributes to #12138 Signed-off-by: Dmitry Rozhkov <[email protected]>

rojkov · 2021-03-05T11:48:47Z

I was curious why an update with 10k EDS clusters turns into >100M traffic from Envoy to the server and sniffed it with Wireshark. So, every EDS discovery request for a cluster includes a copy of Node and that adds up ~11k: 10k * 11k ~= 110M.

AFAIU the Node message is needed to identify the client. I tried to limit the amount of data included in a request to the Node's id and cluster fields. It reduced the traffic to 6M:

@@ -63,13 +64,20 @@ void GrpcMuxImpl::sendDiscoveryRequest(const std::string& type_url) {

   if (api_state.must_send_node_ || !skip_subsequent_node_ || first_stream_request_) {
     // Node may have been cleared during a previous request.
-    request.mutable_node()->CopyFrom(local_info_.node());
+    envoy::config::core::v3::Node n;
+    n.set_id(local_info_.node().id());
+    n.set_cluster(local_info_.node().cluster());
+    request.mutable_node()->CopyFrom(n);
     api_state.must_send_node_ = false;
   } else {
@@ -93,7 +103,11 @@ GrpcMuxWatchPtr GrpcMuxImpl::addWatch(const std::string& type_url,
   // TODO(gsagula): move TokenBucketImpl params to a config.
   if (!apiStateFor(type_url).subscribed_) {
     apiStateFor(type_url).request_.set_type_url(type_url);
-    apiStateFor(type_url).request_.mutable_node()->MergeFrom(local_info_.node());
+    envoy::config::core::v3::Node n;
+    n.set_id(local_info_.node().id());
+    n.set_cluster(local_info_.node().cluster());
+    apiStateFor(type_url).request_.mutable_node()->MergeFrom(n);
     apiStateFor(type_url).subscribed_ = true;
     subscriptions_.emplace_back(type_url);
     if (enable_type_url_downgrade_and_upgrade_) {

Also it helped to slash a couple of seconds off GrpcMuxImpl::sendDiscoveryRequest():

With full Node:

Duration(us)  # Calls  Mean(ns)  StdDev(ns)  Min(ns)  Max(ns)  Category             Description
     2332069    40020     58272     13650.1    19552   382876  done                 GrpcMuxImpl::sendDiscoveryRequest
      291161    10006     29098      5528.3    16645   184649  done                 GrpcMuxImpl::addWatch

With limited Node:

Duration(us)  # Calls  Mean(ns)  StdDev(ns)  Min(ns)  Max(ns)  Category             Description
      225534    40020      5635     3212.79     2367   124869  done                 GrpcMuxImpl::sendDiscoveryRequest
       22091    10006      2207     1740.46      974    81893  done                 GrpcMuxImpl::addWatch

Looks like Envoy is bound to perform slightly worse with every new release as the number of extensions enumerated in Node increases. Can we announce node capabilities only once per gRPC session?

htuch · 2021-03-05T11:55:47Z

@rojkov this is already supported for SotW (but not delta) xDS, see

envoy/api/envoy/config/core/v3/config_source.proto

Line 107 in 07c4c17

    
           // Skip the node identifier in subsequent discovery requests for streaming gRPC config types.

.

I'm thinking this is something else we would prefer default true (but need to do a deprecation dance to move to BoolValue).

@adisuissa this might be another potential cause of buffer bloat in the issue you are looking at.

junr03 added the question Questions that are neither investigations, bugs, nor enhancements label Jul 17, 2020

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Aug 23, 2020

stale bot closed this as completed Sep 2, 2020

snowp reopened this Oct 13, 2020

stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Oct 13, 2020

mattklein123 added area/cluster_manager area/perf investigate Potential bug that needs verification and removed question Questions that are neither investigations, bugs, nor enhancements labels Oct 13, 2020

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Dec 9, 2020

htuch added help wanted Needs help! and removed stale stalebot believes this issue/PR has not been touched recently labels Dec 10, 2020

jmarantz assigned rojkov and unassigned rojkov Dec 20, 2020

rojkov mentioned this issue Feb 10, 2021

Envoy is getting killed automatically when at around 56K routes and 26K clusters #14993

Closed

rojkov mentioned this issue Feb 10, 2021

upstream: avoid copies of all cluster endpoints for every resolve target #15013

Merged

This was referenced Mar 1, 2021

upstream: avoid quadratic time complexity in server initialization #15237

Merged

upstream: avoid double hashing of protos in CDS init #15241

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CDS Updates with many clusters often fail #12138

CDS Updates with many clusters often fail #12138

Starblade42 commented Jul 16, 2020

snowp commented Jul 18, 2020

Starblade42 commented Jul 18, 2020 via email

ramaraochavali commented Jul 18, 2020 •

edited

Loading

stale bot commented Aug 23, 2020

stale bot commented Sep 2, 2020

alandiegosantos commented Oct 13, 2020 •

edited

Loading

ramaraochavali commented Oct 13, 2020

github-actions bot commented Dec 9, 2020

htuch commented Dec 10, 2020

jmarantz commented Dec 21, 2020

rojkov commented Feb 9, 2021

jmarantz commented Feb 9, 2021

rojkov commented Feb 9, 2021

jmarantz commented Feb 9, 2021

htuch commented Feb 10, 2021

rojkov commented Feb 10, 2021

ramaraochavali commented Feb 11, 2021

rojkov commented Feb 11, 2021

rojkov commented Feb 19, 2021

jmarantz commented Feb 19, 2021

rojkov commented Feb 19, 2021

htuch commented Feb 19, 2021

lambdai commented Feb 20, 2021

rojkov commented Feb 22, 2021

rojkov commented Feb 25, 2021

Starblade42 commented Feb 25, 2021

yanavlasov commented Mar 2, 2021

rojkov commented Mar 5, 2021

htuch commented Mar 5, 2021

CDS Updates with many clusters often fail #12138

CDS Updates with many clusters often fail #12138

Comments

Starblade42 commented Jul 16, 2020

snowp commented Jul 18, 2020

Starblade42 commented Jul 18, 2020 via email

ramaraochavali commented Jul 18, 2020 • edited Loading

stale bot commented Aug 23, 2020

stale bot commented Sep 2, 2020

alandiegosantos commented Oct 13, 2020 • edited Loading

ramaraochavali commented Oct 13, 2020

github-actions bot commented Dec 9, 2020

htuch commented Dec 10, 2020

jmarantz commented Dec 21, 2020

rojkov commented Feb 9, 2021

jmarantz commented Feb 9, 2021

rojkov commented Feb 9, 2021

jmarantz commented Feb 9, 2021

htuch commented Feb 10, 2021

rojkov commented Feb 10, 2021

ramaraochavali commented Feb 11, 2021

rojkov commented Feb 11, 2021

rojkov commented Feb 19, 2021

jmarantz commented Feb 19, 2021

rojkov commented Feb 19, 2021

htuch commented Feb 19, 2021

lambdai commented Feb 20, 2021

rojkov commented Feb 22, 2021

rojkov commented Feb 25, 2021

Starblade42 commented Feb 25, 2021

yanavlasov commented Mar 2, 2021

rojkov commented Mar 5, 2021

htuch commented Mar 5, 2021

ramaraochavali commented Jul 18, 2020 •

edited

Loading

alandiegosantos commented Oct 13, 2020 •

edited

Loading