cluster manager: implement cluster warming #2774

mattklein123 · 2018-03-10T22:51:32Z

Once the server has initialized, the cluster manager will begin
warming clusters. This occurs for both new and updated clusters.
This will ensure that once a worker sees the new cluster it has
undergone both DNS, EDS, and health checking, if applicable.

Risk Level: Medium/High (high risk, covered by both unit and ADS
integration test).

Testing: Both unit/integration.

Docs Changes: envoyproxy/data-plane-api#541

Release Notes: N/A

Fixes #1930

Once the server has initialized, the cluster manager will begin warming clusters. This occurs for both new and updated clusters. This will ensure that once a worker sees the new cluster it has undergone both DNS, EDS, and health checking, if applicable. Signed-off-by: Matt Klein <[email protected]>

mattklein123 · 2018-03-10T22:51:58Z

@andraxylia @PiotrSikora @kyessenov please test if possible. Thank you!

htuch

Nice, surprised so few changes required, will be interesting to see if this fixes the Istio issues.

htuch · 2018-03-11T20:41:24Z

test/common/upstream/cluster_manager_impl_test.cc

@@ -113,6 +113,15 @@ class ClusterManagerImplTest : public testing::Test {
        factory_.local_info_, log_manager_, factory_.dispatcher_));


Do you also want to modify ads_integration_test to force warming behavior?

The ADS integration test already does warming AFAICT because it sends a new CDS response followed by EDS. Do you mean to just verify warming by waiting on warming stats? Yes I can add this.

htuch · 2018-03-11T20:43:31Z

source/common/upstream/cluster_manager_impl.cc

-    return;
-  }
+  // The init helper is only used during initial server load due to the overall complexity of
+  // first time initialization. After that point, the cluster manager shifts to managing warming


I think I had this same question for LDS, but can you remind me why the ClusterManager can't handle both the initial server and dynamic cluster addition cases without the need for explicit initialization state here?

It probably can with a substantial refactor, but the code is so complicated as it is that this seemed like the easier to understand way of accomplishing it. Basically, once the server is fully initialized, each cluster is warmed independently without having to worry about overall server/CM init. I will add more comments and maybe a TODO.

htuch · 2018-03-11T20:50:52Z

source/common/upstream/cluster_manager_impl.cc

+  const auto existing_warming_cluster = warming_clusters_.find(cluster_name);
+  const uint64_t new_hash = MessageUtil::hash(cluster);
+  if ((existing_primary_cluster != primary_clusters_.end() &&
+       existing_primary_cluster->second->blockUpdate(new_hash)) ||


There's at least two definitions for "primary cluster" in the ClusterManager world. The first is here:

envoy/include/envoy/upstream/cluster_manager.h

Line 33 in e6ff690

/**

, i.e. the non-TLS cluster definition. The second is here:

envoy/source/common/upstream/cluster_manager_impl.cc

Line 196 in e6ff690

// Cluster loading happens in two phases: first all the primary clusters are loaded, and then all

.

Since we're referencing different classes of clusters in this PR, this might be a good opportunity to disambiguate to reduce confusion for the reader.

Sure, let me see what I can do.

Signed-off-by: Matt Klein <[email protected]>

andraxylia · 2018-03-12T18:08:22Z

thanks @mattklein123 will pick it up and test it.

jsedgwick · 2018-03-13T15:46:36Z

source/common/upstream/cluster_manager_impl.h

  void setInitializedCb(std::function<void()> callback) override {
    init_helper_.setInitializedCb(callback);
  }
  ClusterInfoMap clusters() override {
+    // TODO(mattklein123): Add ability to see warming clusters in admin output.


(adding reference to #2172 for tracking)

htuch · 2018-03-13T19:39:49Z

source/common/upstream/cluster_manager_impl.cc

@@ -336,28 +338,72 @@ void ClusterManagerImpl::onClusterInit(Cluster& cluster) {
  }
 }

-bool ClusterManagerImpl::addOrUpdatePrimaryCluster(const envoy::api::v2::Cluster& cluster) {
+bool ClusterManagerImpl::addOrUpdateCluster(const envoy::api::v2::Cluster& cluster) {
  // First we need to see if this new config is new or an update to an existing dynamic cluster.
  // We don't allow updates to statically configured clusters in the main configuration.


This comment probably warrants an update.

htuch · 2018-03-13T19:40:49Z

source/common/upstream/cluster_manager_impl.cc

-    init_helper_.removeCluster(*existing_cluster->second.cluster_);
+  if (existing_active_cluster != active_clusters_.end() ||
+      existing_warming_cluster != warming_clusters_.end()) {
+    init_helper_.removeCluster(*existing_active_cluster->second->cluster_);


How is this valid if we get here due to the warming_cluster_ clause in the conditional?

There is a guard that protects it. I will make it more clear.

htuch · 2018-03-13T19:52:50Z

source/common/upstream/cluster_manager_impl.cc

+    init_helper_.addCluster(*cluster_entry->cluster_);
+  } else {
+    auto& cluster_entry = warming_clusters_.at(cluster_name);
+    ENVOY_LOG(info, "add/update cluster {} starting warming", cluster_name);


I know this probably also applies to listener warming, but I'm wondering how useful warming really is beyond the brief patching around the issues that Istio needs due to not having implemented ADS. Specifically, it seems that a config can be accepted for a set of new clusters, but left in perpetually warming state. It's kind of weird to think that (1) the management server doesn't know if the new config is really active yet (except via proving with a route update and validate_clusters) and we don't have anyway to cleanly dump the actual active state. This not really an actionable comment, more food for thought.

For listeners, I think warming is absolutely required for realtime updates while serving traffic. For clusters, I think this is not a real concern in production, but given the small amount of code required I don't mind supporting.

The "infinite warming" situation is a real concern. We could potentially report status back to the management server in a more coherent way in the future if needed.

Signed-off-by: Matt Klein <[email protected]>

mattklein123 · 2018-03-13T21:03:14Z

@htuch updated and added doc PR link.

htuch

Rad.

mattklein123 · 2018-03-14T00:07:37Z

@andraxylia did you get a chance to test this by any chance? I'm going to run a quick smoke test on this tomorrow at Lyft before merging given the risk, but otherwise plan on merging tomorrow.

andraxylia · 2018-03-14T15:45:26Z

@mattklein123 I am in the process of testing it, but please merge it, it actually simplifies the testing since I can use the regular build workflow.

mattklein123 · 2018-03-14T16:00:58Z

@andraxylia OK I'm going to run a small smoke test later today then will merge.

This reverts commit f2559ba.

andraxylia · 2018-03-20T19:31:55Z

@mattklein123 @htuch Thanks a lot for this fix! I merged the almost latest Envoy in Istio and I was able to test with Istio that cluster warming works, by following these steps:

I engineered pilot to force a 1 minute delay in the EDS response
I created new weighted clusters - verified the pilot EDS is called
for the next minute, I checked the committed clusters in Envoy using localhost:15000/clusters. The weighted clusters showed up only after 1 minute, after EDS was complete.

Unfortunately, we are still peeling the onion with the 503s errors when changing configs, due to some other issue with the connection, that @lizan will follow up on. The logs are like this:

2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:250] [C3287][S2947538408704053684] cluster 'out.echosrv.istio-system.svc.cluster.local|http-echo|version=v2' match for URL '/andra'

[2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] ':authority':'192.168.99.100:32398'

[2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'user-agent':'istio/fortio-0.8.0-pre'

[2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] ':path':'/andra'

[2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] ':method':'GET'

[2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'x-forwarded-for':'172.17.0.1'

[2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'x-forwarded-proto':'http'

[2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'x-envoy-internal':'true'

[2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'x-request-id':'05ba10e0-aeb1-94ed-b117-e0005bc1d7b8'

[2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'x-envoy-decorator-operation':'weighted-route-1'

[2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'x-b3-traceid':'b3e6c244a8f65e27'

[2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'x-b3-spanid':'b3e6c244a8f65e27'

[2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'x-b3-sampled':'1'

[2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'x-ot-span-context':'b3e6c244a8f65e27;b3e6c244a8f65e27;0000000000000000'

[2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'x-istio-attributes':'CkgKCnNvdXJjZS51aWQSOhI4a3ViZXJuZXRlczovL2lzdGlvLWluZ3Jlc3MtNmRjNWQ0OGM2Zi1yenRqcy5pc3Rpby1zeXN0ZW0='

[2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common

503error.txt/router/router.cc:298] [C3287][S2947538408704053684] ':scheme':'http'
[2018-03-19 23:17:53.558][31][debug][pool] external/envoy/source/common/http/http1/conn_pool.cc:73] creating a new connection

[2018-03-19 23:17:53.558][31][debug][client] external/envoy/source/common/http/codec_client.cc:23] [C3289] connecting

[2018-03-19 23:17:53.558][31][debug][connection] external/envoy/source/common/network/connection_impl.cc:568] [C3289] connecting to 172.17.0.10:8080

[2018-03-19 23:17:53.559][31][debug][connection] external/envoy/source/common/network/connection_impl.cc:577] [C3289] connection in progress

[2018-03-19 23:17:53.559][31][debug][pool] external/envoy/source/common/http/http1/conn_pool.cc:99] queueing request due to no available connections

[2018-03-19 23:17:53.559][31][debug][connection] external/envoy/source/common/network/connection_impl.cc:473] [C3289] delayed connection error: 111

[2018-03-19 23:17:53.559][31][debug][connection] external/envoy/source/common/network/connection_impl.cc:134] [C3289] closing socket: 0

[2018-03-19 23:17:53.559][31][debug][client] external/envoy/source/common/http/codec_client.cc:70] [C3289] disconnect. resetting 0 pending requests

[2018-03-19 23:17:53.559][31][debug][pool] external/envoy/source/common/http/http1/conn_pool.cc:115] [C3289] client disconnected

[2018-03-19 23:17:53.559][31][debug][router] external/envoy/source/common/router/router.cc:464] [C3287][S2947538408704053684] upstream reset

[2018-03-19 23:17:53.559][31][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:939] [C3287][S2947538408704053684] encoding headers via codec (end_stream=false):

[2018-03-19 23:17:53.559][31][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:944] [C3287][S2947538408704053684] ':status':'503'

* increase life span for data in tcp cluster rewrite * fix tests * test * typo

htuch suggested changes Mar 11, 2018

View reviewed changes

htuch self-assigned this Mar 11, 2018

mattklein123 added 2 commits March 11, 2018 17:00

Merge remote-tracking branch 'origin/master' into cds_warming

2d79d35

Signed-off-by: Matt Klein <[email protected]>

comments

310b976

Signed-off-by: Matt Klein <[email protected]>

jsedgwick reviewed Mar 13, 2018

View reviewed changes

htuch reviewed Mar 13, 2018

View reviewed changes

mattklein123 added 2 commits March 13, 2018 16:45

Merge branch 'master' into cds_warming

d3520ea

Signed-off-by: Matt Klein <[email protected]>

comments

5792150

Signed-off-by: Matt Klein <[email protected]>

htuch approved these changes Mar 13, 2018

View reviewed changes

mattklein123 merged commit f2559ba into master Mar 14, 2018

mattklein123 deleted the cds_warming branch March 14, 2018 21:33

mattklein123 added a commit that referenced this pull request Mar 14, 2018

Revert "cluster manager: implement cluster warming (#2774)"

498163a

This reverts commit f2559ba.

tcnghia mentioned this pull request Apr 18, 2018

Adding new Istio routerule causes small window of 503s knative/serving#348

Closed

Shikugawa pushed a commit to Shikugawa/envoy that referenced this pull request Mar 28, 2020

increase life span for data in tcp cluster rewrite (envoyproxy#2774)

e8adf2a

* increase life span for data in tcp cluster rewrite * fix tests * test * typo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster manager: implement cluster warming #2774

cluster manager: implement cluster warming #2774

mattklein123 commented Mar 10, 2018 •

edited

Loading

mattklein123 commented Mar 10, 2018

htuch left a comment

htuch Mar 11, 2018

mattklein123 Mar 11, 2018

htuch Mar 11, 2018

mattklein123 Mar 11, 2018

htuch Mar 11, 2018

mattklein123 Mar 11, 2018

andraxylia commented Mar 12, 2018

jsedgwick Mar 13, 2018

htuch Mar 13, 2018

htuch Mar 13, 2018

mattklein123 Mar 13, 2018

htuch Mar 13, 2018

mattklein123 Mar 13, 2018

mattklein123 commented Mar 13, 2018

htuch left a comment

mattklein123 commented Mar 14, 2018

andraxylia commented Mar 14, 2018

mattklein123 commented Mar 14, 2018

andraxylia commented Mar 20, 2018

		@@ -113,6 +113,15 @@ class ClusterManagerImplTest : public testing::Test {
		factory_.local_info_, log_manager_, factory_.dispatcher_));

cluster manager: implement cluster warming #2774

cluster manager: implement cluster warming #2774

Conversation

mattklein123 commented Mar 10, 2018 • edited Loading

mattklein123 commented Mar 10, 2018

htuch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andraxylia commented Mar 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattklein123 commented Mar 13, 2018

htuch left a comment

Choose a reason for hiding this comment

mattklein123 commented Mar 14, 2018

andraxylia commented Mar 14, 2018

mattklein123 commented Mar 14, 2018

andraxylia commented Mar 20, 2018

mattklein123 commented Mar 10, 2018 •

edited

Loading