-
Notifications
You must be signed in to change notification settings - Fork 558
Do not check for nodes Ready in CSE on masters #2650
Conversation
Per @jackfrancis request, this is a test that demonstrate current all nodes check is flaky in upgrade/scale situations.
This will fail the entire upgrade process. |
Thanks for adding that @EPinci. If you're able to walk through the above scenario using this PR and see a different (successful) outcome, that will provide evidence that we should prioritize this work. |
Yup, I'll run through it on my test cluster and report back. On a second thought, however, instead of totally skipping the node check, can it be beneficial to just check the local node readiness?
|
@EPinci That would largely amount to the same thing. The ARM deployment will return CSE success based on the set of all CSE executions having a zero exit code. There is a difference between "waiting a certain amount of time on each master node for all nodes to be ready" and "waiting a certain amount of time on each node (masters + agents) for itself to check in as ready with the control plane", but that difference is not meaningful from the perspective of "validate that all the nodes are Ready". (I'm also eliding over the fact that we don't currently set up Spoken generally, what we're checking for now is "up to a certain period of time after provisioning, do we see a functional k8s cluster w/ the full set of Ready nodes participating behind the apiserver at least once?" Admittedly, checking something like that is arguably a k8s anti-pattern, as clusters are designed to be 100% functional with node participation fluidity; but, doing that check exposes valuable behavioral data about the real-world effects of our post-provisioning, app-layer configuration implementation. We want 100% end state outcome achievement for known-working cluster configurations, and so in order to know and protect against novel edge cases, we need such a check to be operable over time. However, it is also arguable that such a check does not need to concern a user's experience building a new cluster. Should we rather return control to the user when their cluster has a functioning apiserver and nodes that are able to schedule workloads? In addition to giving users a few minutes of their time back (especially for clusters with high node counts), it is arguably more k8s idiomatic to do so. However (part deux), I would argue that we don't want to make that change until the project has confidence that CI/CD is continually doing the comprehensive end state outcome validation for all code changes going forward. Without that check, the resilience of our cluster provision scripts will decay and leave users with working clusters, but with intermittent non-working nodes (which they may be paying for if we successfully provision a VM but don't successfully join it to the cluster). |
I run this PR thru a test. Knocked out my test cluster and I could not understand why.
and the deployment failure is "subnet full" and that is weird because I have a dedicated /24 subnet for three master with ipAddressCount set to 20. |
@@ -233,7 +233,7 @@ func (cli *CLIProvisioner) FetchProvisioningMetrics(path string, cfg *config.Con | |||
agentFiles := []string{"/var/log/azure/cluster-provision.log", "/var/log/cloud-init.log", | |||
"/var/log/cloud-init-output.log", "/var/log/syslog", "/var/log/azure/custom-script/handler.log", | |||
"/opt/m", "/opt/azure/containers/kubelet.sh", "/opt/azure/containers/provision.sh", | |||
"/opt/azure/provision-ps.log", "/var/log/azure/dnsdump.pcap"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/var/log/azure/dnsdump.pcap
is no longer a file generated by provision scripts
retrycmd_if_failure 100 1 10 systemctl daemon-reload && systemctl restart kubelet | ||
fi | ||
retrycmd_if_failure 10d 1 10 systemctl daemon-reload && systemctl restart kubelet | ||
retrycmd_if_failure 10 1 3 systemctl status kubelet --no-pager -l > /var/log/azure/kubelet-status.log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider doing this for other daemons as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
What this PR does / why we need it:
Remove master nodes "check for all nodes Ready" in CSE. The objective with this change is to fail faster when any agent node does not have a healthy kubelet after provisioning. We introduce a new kubelet status check on all nodes, which will fail CSE if not healthy.
By waiting n minutes for master node provisioning to return CSE status (until nodes Ready check times out), we add unnecessary delay to the eventual non-zero ARM deployment result.
This PR also includes a new, automatic "vanilla kubernetes" E2E test (default config, single, small agent pool) as a tactic towards speeding up the debug process for problematic code changes; and adding standardized coverage for acs-engine kubernetes default config.
Which issue this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close that issue when PR gets merged): fixes #Special notes for your reviewer:
If applicable:
Release note: