-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster manager: implement cluster warming #2774
Conversation
Once the server has initialized, the cluster manager will begin warming clusters. This occurs for both new and updated clusters. This will ensure that once a worker sees the new cluster it has undergone both DNS, EDS, and health checking, if applicable. Signed-off-by: Matt Klein <[email protected]>
@andraxylia @PiotrSikora @kyessenov please test if possible. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, surprised so few changes required, will be interesting to see if this fixes the Istio issues.
@@ -113,6 +113,15 @@ class ClusterManagerImplTest : public testing::Test { | |||
factory_.local_info_, log_manager_, factory_.dispatcher_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you also want to modify ads_integration_test
to force warming behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ADS integration test already does warming AFAICT because it sends a new CDS response followed by EDS. Do you mean to just verify warming by waiting on warming stats? Yes I can add this.
return; | ||
} | ||
// The init helper is only used during initial server load due to the overall complexity of | ||
// first time initialization. After that point, the cluster manager shifts to managing warming |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I had this same question for LDS, but can you remind me why the ClusterManager can't handle both the initial server and dynamic cluster addition cases without the need for explicit initialization state here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably can with a substantial refactor, but the code is so complicated as it is that this seemed like the easier to understand way of accomplishing it. Basically, once the server is fully initialized, each cluster is warmed independently without having to worry about overall server/CM init. I will add more comments and maybe a TODO.
const auto existing_warming_cluster = warming_clusters_.find(cluster_name); | ||
const uint64_t new_hash = MessageUtil::hash(cluster); | ||
if ((existing_primary_cluster != primary_clusters_.end() && | ||
existing_primary_cluster->second->blockUpdate(new_hash)) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's at least two definitions for "primary cluster" in the ClusterManager world. The first is here:
/** |
// Cluster loading happens in two phases: first all the primary clusters are loaded, and then all |
Since we're referencing different classes of clusters in this PR, this might be a good opportunity to disambiguate to reduce confusion for the reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me see what I can do.
Signed-off-by: Matt Klein <[email protected]>
Signed-off-by: Matt Klein <[email protected]>
thanks @mattklein123 will pick it up and test it. |
void setInitializedCb(std::function<void()> callback) override { | ||
init_helper_.setInitializedCb(callback); | ||
} | ||
ClusterInfoMap clusters() override { | ||
// TODO(mattklein123): Add ability to see warming clusters in admin output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(adding reference to #2172 for tracking)
@@ -336,28 +338,72 @@ void ClusterManagerImpl::onClusterInit(Cluster& cluster) { | |||
} | |||
} | |||
|
|||
bool ClusterManagerImpl::addOrUpdatePrimaryCluster(const envoy::api::v2::Cluster& cluster) { | |||
bool ClusterManagerImpl::addOrUpdateCluster(const envoy::api::v2::Cluster& cluster) { | |||
// First we need to see if this new config is new or an update to an existing dynamic cluster. | |||
// We don't allow updates to statically configured clusters in the main configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment probably warrants an update.
init_helper_.removeCluster(*existing_cluster->second.cluster_); | ||
if (existing_active_cluster != active_clusters_.end() || | ||
existing_warming_cluster != warming_clusters_.end()) { | ||
init_helper_.removeCluster(*existing_active_cluster->second->cluster_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this valid if we get here due to the warming_cluster_ clause in the conditional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a guard that protects it. I will make it more clear.
init_helper_.addCluster(*cluster_entry->cluster_); | ||
} else { | ||
auto& cluster_entry = warming_clusters_.at(cluster_name); | ||
ENVOY_LOG(info, "add/update cluster {} starting warming", cluster_name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this probably also applies to listener warming, but I'm wondering how useful warming really is beyond the brief patching around the issues that Istio needs due to not having implemented ADS. Specifically, it seems that a config can be accepted for a set of new clusters, but left in perpetually warming state. It's kind of weird to think that (1) the management server doesn't know if the new config is really active yet (except via proving with a route update and validate_clusters
) and we don't have anyway to cleanly dump the actual active state. This not really an actionable comment, more food for thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For listeners, I think warming is absolutely required for realtime updates while serving traffic. For clusters, I think this is not a real concern in production, but given the small amount of code required I don't mind supporting.
The "infinite warming" situation is a real concern. We could potentially report status back to the management server in a more coherent way in the future if needed.
Signed-off-by: Matt Klein <[email protected]>
Signed-off-by: Matt Klein <[email protected]>
@htuch updated and added doc PR link. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rad.
@andraxylia did you get a chance to test this by any chance? I'm going to run a quick smoke test on this tomorrow at Lyft before merging given the risk, but otherwise plan on merging tomorrow. |
@mattklein123 I am in the process of testing it, but please merge it, it actually simplifies the testing since I can use the regular build workflow. |
@andraxylia OK I'm going to run a small smoke test later today then will merge. |
This reverts commit f2559ba.
@mattklein123 @htuch Thanks a lot for this fix! I merged the almost latest Envoy in Istio and I was able to test with Istio that cluster warming works, by following these steps:
Unfortunately, we are still peeling the onion with the 503s errors when changing configs, due to some other issue with the connection, that @lizan will follow up on. The logs are like this: 2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:250] [C3287][S2947538408704053684] cluster 'out.echosrv.istio-system.svc.cluster.local|http-echo|version=v2' match for URL '/andra' [2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] ':authority':'192.168.99.100:32398' [2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'user-agent':'istio/fortio-0.8.0-pre' [2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] ':path':'/andra' [2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] ':method':'GET' [2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'x-forwarded-for':'172.17.0.1' [2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'x-forwarded-proto':'http' [2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'x-envoy-internal':'true' [2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'x-request-id':'05ba10e0-aeb1-94ed-b117-e0005bc1d7b8' [2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'x-envoy-decorator-operation':'weighted-route-1' [2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'x-b3-traceid':'b3e6c244a8f65e27' [2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'x-b3-spanid':'b3e6c244a8f65e27' [2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'x-b3-sampled':'1' [2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'x-ot-span-context':'b3e6c244a8f65e27;b3e6c244a8f65e27;0000000000000000' [2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common/router/router.cc:298] [C3287][S2947538408704053684] 'x-istio-attributes':'CkgKCnNvdXJjZS51aWQSOhI4a3ViZXJuZXRlczovL2lzdGlvLWluZ3Jlc3MtNmRjNWQ0OGM2Zi1yenRqcy5pc3Rpby1zeXN0ZW0=' [2018-03-19 23:17:53.558][31][debug][router] external/envoy/source/common 503error.txt/router/router.cc:298] [C3287][S2947538408704053684] ':scheme':'http' [2018-03-19 23:17:53.558][31][debug][client] external/envoy/source/common/http/codec_client.cc:23] [C3289] connecting [2018-03-19 23:17:53.558][31][debug][connection] external/envoy/source/common/network/connection_impl.cc:568] [C3289] connecting to 172.17.0.10:8080 [2018-03-19 23:17:53.559][31][debug][connection] external/envoy/source/common/network/connection_impl.cc:577] [C3289] connection in progress [2018-03-19 23:17:53.559][31][debug][pool] external/envoy/source/common/http/http1/conn_pool.cc:99] queueing request due to no available connections [2018-03-19 23:17:53.559][31][debug][connection] external/envoy/source/common/network/connection_impl.cc:473] [C3289] delayed connection error: 111 [2018-03-19 23:17:53.559][31][debug][connection] external/envoy/source/common/network/connection_impl.cc:134] [C3289] closing socket: 0 [2018-03-19 23:17:53.559][31][debug][client] external/envoy/source/common/http/codec_client.cc:70] [C3289] disconnect. resetting 0 pending requests [2018-03-19 23:17:53.559][31][debug][pool] external/envoy/source/common/http/http1/conn_pool.cc:115] [C3289] client disconnected [2018-03-19 23:17:53.559][31][debug][router] external/envoy/source/common/router/router.cc:464] [C3287][S2947538408704053684] upstream reset [2018-03-19 23:17:53.559][31][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:939] [C3287][S2947538408704053684] encoding headers via codec (end_stream=false): [2018-03-19 23:17:53.559][31][debug][http] external/envoy/source/common/http/conn_manager_impl.cc:944] [C3287][S2947538408704053684] ':status':'503' |
* increase life span for data in tcp cluster rewrite * fix tests * test * typo
Once the server has initialized, the cluster manager will begin
warming clusters. This occurs for both new and updated clusters.
This will ensure that once a worker sees the new cluster it has
undergone both DNS, EDS, and health checking, if applicable.
Risk Level: Medium/High (high risk, covered by both unit and ADS
integration test).
Testing: Both unit/integration.
Docs Changes: envoyproxy/data-plane-api#541
Release Notes: N/A
Fixes #1930