Skip to content
This repository has been archived by the owner on Jan 11, 2023. It is now read-only.

Do not check for nodes Ready in CSE on masters #2650

Merged
merged 8 commits into from
Apr 12, 2018

Conversation

jackfrancis
Copy link
Member

@jackfrancis jackfrancis commented Apr 10, 2018

What this PR does / why we need it:

Remove master nodes "check for all nodes Ready" in CSE. The objective with this change is to fail faster when any agent node does not have a healthy kubelet after provisioning. We introduce a new kubelet status check on all nodes, which will fail CSE if not healthy.

By waiting n minutes for master node provisioning to return CSE status (until nodes Ready check times out), we add unnecessary delay to the eventual non-zero ARM deployment result.

This PR also includes a new, automatic "vanilla kubernetes" E2E test (default config, single, small agent pool) as a tactic towards speeding up the debug process for problematic code changes; and adding standardized coverage for acs-engine kubernetes default config.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #

Special notes for your reviewer:

If applicable:

  • documentation
  • unit tests
  • tested backward compatibility (ie. deploy with previous version, upgrade with this branch)

Release note:

Do not check for nodes Ready in CSE on masters

@EPinci
Copy link
Contributor

EPinci commented Apr 10, 2018

Per @jackfrancis request, this is a test that demonstrate current all nodes check is flaky in upgrade/scale situations.

  • Deploy a kubernetes cluster with at least 2 nodes in a single nodepool
    In my case: kubernetes v1.9.3 with 3 masters and 3 nodes in single nodepool
    Cluster needs to have useInstanceMetadata set to false for the autoscaler to work
  • Deploy to the cluster the official kubernetes autoscaler v1.2 to the cluster
    Autoscaler needs to be configured to allow nodepool to shrink to 1
    Since cluster has no load, the nodepool will be reduced to a single node
  • Use ACS-Engine to upgrade the cluster
    In my case: to v1.9.6
    At the end of the redeployment of the first VM, script will be stuck for about with the below log.
    Note that the reported Ready nodes is 4 (all of them) but the check is testing against 6 (original size of the cluster).
+ for i in '{1..1800}'
++ wc -l
++ grep Ready
++ /usr/local/bin/kubectl get nodes
+ nodes=4
+ '[' 4 -eq 6 ']'
+ sleep 1
+ '[' 1 -ne 0 ']'
+ echo 'still waiting for active nodes after 1800 seconds'
still waiting for active nodes after 1800 seconds
+ exit 3

This will fail the entire upgrade process.
This will also impact #2637.

@jackfrancis
Copy link
Member Author

Thanks for adding that @EPinci. If you're able to walk through the above scenario using this PR and see a different (successful) outcome, that will provide evidence that we should prioritize this work.

@EPinci
Copy link
Contributor

EPinci commented Apr 10, 2018

Yup, I'll run through it on my test cluster and report back.

On a second thought, however, instead of totally skipping the node check, can it be beneficial to just check the local node readiness?
Something like:

nodeready=$(${KUBECTL} get nodes 2>/dev/null | grep $HOSTNAME | grep 'Ready' | wc -l) 
  if [ $nodeready -eq "1" ] 

@jackfrancis
Copy link
Member Author

jackfrancis commented Apr 10, 2018

@EPinci That would largely amount to the same thing. The ARM deployment will return CSE success based on the set of all CSE executions having a zero exit code. There is a difference between "waiting a certain amount of time on each master node for all nodes to be ready" and "waiting a certain amount of time on each node (masters + agents) for itself to check in as ready with the control plane", but that difference is not meaningful from the perspective of "validate that all the nodes are Ready".

(I'm also eliding over the fact that we don't currently set up kubectl w/ cluster context on the agent nodes, so that alternate check would involve introducing that.)

Spoken generally, what we're checking for now is "up to a certain period of time after provisioning, do we see a functional k8s cluster w/ the full set of Ready nodes participating behind the apiserver at least once?" Admittedly, checking something like that is arguably a k8s anti-pattern, as clusters are designed to be 100% functional with node participation fluidity; but, doing that check exposes valuable behavioral data about the real-world effects of our post-provisioning, app-layer configuration implementation. We want 100% end state outcome achievement for known-working cluster configurations, and so in order to know and protect against novel edge cases, we need such a check to be operable over time.

However, it is also arguable that such a check does not need to concern a user's experience building a new cluster. Should we rather return control to the user when their cluster has a functioning apiserver and nodes that are able to schedule workloads? In addition to giving users a few minutes of their time back (especially for clusters with high node counts), it is arguably more k8s idiomatic to do so.

However (part deux), I would argue that we don't want to make that change until the project has confidence that CI/CD is continually doing the comprehensive end state outcome validation for all code changes going forward. Without that check, the resilience of our cluster provision scripts will decay and leave users with working clusters, but with intermittent non-working nodes (which they may be paying for if we successfully provision a VM but don't successfully join it to the cluster).

@EPinci
Copy link
Contributor

EPinci commented Apr 10, 2018

I run this PR thru a test. Knocked out my test cluster and I could not understand why.
I redeployed my original setup (3 masters, 3 nodes, kube v1.9.3) and manually destroyed two nodes.
Upgrade run for master0 and master1 but immediately failed for master2 with the following:

INFO[1873] Starting ARM Deployment (master-18-04-10T23.13.13-781252198). This will take some time...
INFO[1919] Finished ARM Deployment (master-18-04-10T23.13.13-781252198). Error: resources.DeploymentsClient#CreateOrUpdate: Failure sending request: StatusCode=200 -- Original Error: Long running operation terminated with status 'Failed': Code="DeploymentFailed" Message="At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-debug for usage details."
�[36mINFO�[0m[1920] Error creating upgraded master VM: k8s-master-93283987-2
FATA[1920] Error upgrading cluster: resources.DeploymentsClient#CreateOrUpdate: Failure sending request: StatusCode=200 -- Original Error: Long running operation terminated with status 'Failed': Code="DeploymentFailed" Message="At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-debug for usage details."

and the deployment failure is "subnet full" and that is weird because I have a dedicated /24 subnet for three master with ipAddressCount set to 20.
I checked the subnet and it has, in fact, got full because master0 and master1 now have allocated 111 address each. I take it something else is going really wrong...

@jackfrancis jackfrancis changed the title WIP: CSE all nodes Ready check is optional CSE all nodes Ready check is opt-in (KubernetesDebug) Apr 10, 2018
@jackfrancis jackfrancis changed the title CSE all nodes Ready check is opt-in (KubernetesDebug) Do not check for nodes Ready in CSE on masters Apr 12, 2018
@@ -233,7 +233,7 @@ func (cli *CLIProvisioner) FetchProvisioningMetrics(path string, cfg *config.Con
agentFiles := []string{"/var/log/azure/cluster-provision.log", "/var/log/cloud-init.log",
"/var/log/cloud-init-output.log", "/var/log/syslog", "/var/log/azure/custom-script/handler.log",
"/opt/m", "/opt/azure/containers/kubelet.sh", "/opt/azure/containers/provision.sh",
"/opt/azure/provision-ps.log", "/var/log/azure/dnsdump.pcap"}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/var/log/azure/dnsdump.pcap is no longer a file generated by provision scripts

retrycmd_if_failure 100 1 10 systemctl daemon-reload && systemctl restart kubelet
fi
retrycmd_if_failure 10d 1 10 systemctl daemon-reload && systemctl restart kubelet
retrycmd_if_failure 10 1 3 systemctl status kubelet --no-pager -l > /var/log/azure/kubelet-status.log
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider doing this for other daemons as well

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants