Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster manager: implement cluster warming #2774

Merged
merged 5 commits into from
Mar 14, 2018
Merged

cluster manager: implement cluster warming #2774

merged 5 commits into from
Mar 14, 2018

Conversation

mattklein123
Copy link
Member

@mattklein123 mattklein123 commented Mar 10, 2018

Once the server has initialized, the cluster manager will begin
warming clusters. This occurs for both new and updated clusters.
This will ensure that once a worker sees the new cluster it has
undergone both DNS, EDS, and health checking, if applicable.

Risk Level: Medium/High (high risk, covered by both unit and ADS
integration test).

Testing: Both unit/integration.

Docs Changes: envoyproxy/data-plane-api#541

Release Notes: N/A

Fixes #1930

Once the server has initialized, the cluster manager will begin
warming clusters. This occurs for both new and updated clusters.
This will ensure that once a worker sees the new cluster it has
undergone both DNS, EDS, and health checking, if applicable.

Signed-off-by: Matt Klein <[email protected]>
@mattklein123
Copy link
Member Author

@andraxylia @PiotrSikora @kyessenov please test if possible. Thank you!

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, surprised so few changes required, will be interesting to see if this fixes the Istio issues.

@@ -113,6 +113,15 @@ class ClusterManagerImplTest : public testing::Test {
factory_.local_info_, log_manager_, factory_.dispatcher_));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you also want to modify ads_integration_test to force warming behavior?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ADS integration test already does warming AFAICT because it sends a new CDS response followed by EDS. Do you mean to just verify warming by waiting on warming stats? Yes I can add this.

return;
}
// The init helper is only used during initial server load due to the overall complexity of
// first time initialization. After that point, the cluster manager shifts to managing warming
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I had this same question for LDS, but can you remind me why the ClusterManager can't handle both the initial server and dynamic cluster addition cases without the need for explicit initialization state here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably can with a substantial refactor, but the code is so complicated as it is that this seemed like the easier to understand way of accomplishing it. Basically, once the server is fully initialized, each cluster is warmed independently without having to worry about overall server/CM init. I will add more comments and maybe a TODO.

const auto existing_warming_cluster = warming_clusters_.find(cluster_name);
const uint64_t new_hash = MessageUtil::hash(cluster);
if ((existing_primary_cluster != primary_clusters_.end() &&
existing_primary_cluster->second->blockUpdate(new_hash)) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's at least two definitions for "primary cluster" in the ClusterManager world. The first is here:

, i.e. the non-TLS cluster definition. The second is here:
// Cluster loading happens in two phases: first all the primary clusters are loaded, and then all
.

Since we're referencing different classes of clusters in this PR, this might be a good opportunity to disambiguate to reduce confusion for the reader.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me see what I can do.

@htuch htuch self-assigned this Mar 11, 2018
@andraxylia
Copy link

thanks @mattklein123 will pick it up and test it.

void setInitializedCb(std::function<void()> callback) override {
init_helper_.setInitializedCb(callback);
}
ClusterInfoMap clusters() override {
// TODO(mattklein123): Add ability to see warming clusters in admin output.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(adding reference to #2172 for tracking)

@@ -336,28 +338,72 @@ void ClusterManagerImpl::onClusterInit(Cluster& cluster) {
}
}

bool ClusterManagerImpl::addOrUpdatePrimaryCluster(const envoy::api::v2::Cluster& cluster) {
bool ClusterManagerImpl::addOrUpdateCluster(const envoy::api::v2::Cluster& cluster) {
// First we need to see if this new config is new or an update to an existing dynamic cluster.
// We don't allow updates to statically configured clusters in the main configuration.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment probably warrants an update.

init_helper_.removeCluster(*existing_cluster->second.cluster_);
if (existing_active_cluster != active_clusters_.end() ||
existing_warming_cluster != warming_clusters_.end()) {
init_helper_.removeCluster(*existing_active_cluster->second->cluster_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this valid if we get here due to the warming_cluster_ clause in the conditional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a guard that protects it. I will make it more clear.

init_helper_.addCluster(*cluster_entry->cluster_);
} else {
auto& cluster_entry = warming_clusters_.at(cluster_name);
ENVOY_LOG(info, "add/update cluster {} starting warming", cluster_name);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this probably also applies to listener warming, but I'm wondering how useful warming really is beyond the brief patching around the issues that Istio needs due to not having implemented ADS. Specifically, it seems that a config can be accepted for a set of new clusters, but left in perpetually warming state. It's kind of weird to think that (1) the management server doesn't know if the new config is really active yet (except via proving with a route update and validate_clusters) and we don't have anyway to cleanly dump the actual active state. This not really an actionable comment, more food for thought.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For listeners, I think warming is absolutely required for realtime updates while serving traffic. For clusters, I think this is not a real concern in production, but given the small amount of code required I don't mind supporting.

The "infinite warming" situation is a real concern. We could potentially report status back to the management server in a more coherent way in the future if needed.

Signed-off-by: Matt Klein <[email protected]>
@mattklein123
Copy link
Member Author

@htuch updated and added doc PR link.

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rad.

@mattklein123
Copy link
Member Author

@andraxylia did you get a chance to test this by any chance? I'm going to run a quick smoke test on this tomorrow at Lyft before merging given the risk, but otherwise plan on merging tomorrow.

@andraxylia
Copy link

@mattklein123 I am in the process of testing it, but please merge it, it actually simplifies the testing since I can use the regular build workflow.

@mattklein123
Copy link
Member Author

@andraxylia OK I'm going to run a small smoke test later today then will merge.

@mattklein123 mattklein123 merged commit f2559ba into master Mar 14, 2018
@mattklein123 mattklein123 deleted the cds_warming branch March 14, 2018 21:33
mattklein123 added a commit that referenced this pull request Mar 14, 2018
@andraxylia
Copy link

@mattklein123 @htuch Thanks a lot for this fix! I merged the almost latest Envoy in Istio and I was able to test with Istio that cluster warming works, by following these steps:

  • I engineered pilot to force a 1 minute delay in the EDS response
  • I created new weighted clusters - verified the pilot EDS is called
  • for the next minute, I checked the committed clusters in Envoy using localhost:15000/clusters. The weighted clusters showed up only after 1 minute, after EDS was complete.

Unfortunately, we are still peeling the onion with the 503s errors when changing configs, due to some other issue with the connection, that @lizan will follow up on. The logs are like this:

2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:250] [C3287][S2947538408704053684] cluster 'out.echosrv.istio-system.svc.cluster.local|http-echo|version=v2' match for URL '/andra'

[2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] ':authority':'192.168.99.100:32398'

[2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'user-agent':'istio/fortio-0.8.0-pre'

[2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] ':path':'/andra'

[2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] ':method':'GET'

[2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'x-forwarded-for':'172.17.0.1'

[2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'x-forwarded-proto':'http'

[2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'x-envoy-internal':'true'

[2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'x-request-id':'05ba10e0-aeb1-94ed-b117-e0005bc1d7b8'

[2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'x-envoy-decorator-operation':'weighted-route-1'

[2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'x-b3-traceid':'b3e6c244a8f65e27'

[2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'x-b3-spanid':'b3e6c244a8f65e27'

[2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'x-b3-sampled':'1'

[2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'x-ot-span-context':'b3e6c244a8f65e27;b3e6c244a8f65e27;0000000000000000'

[2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'x-istio-attributes':'CkgKCnNvdXJjZS51aWQSOhI4a3ViZXJuZXRlczovL2lzdGlvLWluZ3Jlc3MtNmRjNWQ0OGM2Zi1yenRqcy5pc3Rpby1zeXN0ZW0='

[2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common​

503error.txt​/router/router.cc:298] [C3287][S2947538408704053684] ':scheme':'http'
[2018-03-19 23:17:53.558][31][debug][pool] external/envoy/source/common/http/http1/conn_pool.cc:73] creating a new connection

[2018-03-19 23:17:53.558][31][debug][client] external/envoy/source/common/http/codec_client.cc:23] [C3289] connecting

[2018-03-19 23:17:53.558][31][debug][connection] external/envoy/source/common/network/connection_impl.cc:568] [C3289] connecting to 172.17.0.10:8080

[2018-03-19 23:17:53.559][31][debug][connection] external/envoy/source/common/network/connection_impl.cc:577] [C3289] connection in progress

[2018-03-19 23:17:53.559][31][debug][pool] external/envoy/source/common/http/http1/conn_pool.cc:99] queueing request due to no available connections

[2018-03-19 23:17:53.559][31][debug][connection] external/envoy/source/common/network/connection_impl.cc:473] [C3289] delayed connection error: 111

[2018-03-19 23:17:53.559][31][debug][connection] external/envoy/source/common/network/connection_impl.cc:134] [C3289] closing socket: 0

[2018-03-19 23:17:53.559][31][debug][client] external/envoy/source/common/http/codec_client.cc:70] [C3289] disconnect. resetting 0 pending requests

[2018-03-19 23:17:53.559][31][debug][pool] external/envoy/source/common/http/http1/conn_pool.cc:115] [C3289] client disconnected

[2018-03-19 23:17:53.559][31][debug][router] external/envoy/source/common/router/router.cc:464] [C3287][S2947538408704053684] upstream reset

[2018-03-19 23:17:53.559][31][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:939] [C3287][S2947538408704053684] encoding headers via codec (end_stream=false):

[2018-03-19 23:17:53.559][31][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:944] [C3287][S2947538408704053684] ':status':'503'

Shikugawa pushed a commit to Shikugawa/envoy that referenced this pull request Mar 28, 2020
* increase life span for data in tcp cluster rewrite

* fix tests

* test

* typo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants