From 610fadb2fc14b816b38798897e73d360f5001bb9 Mon Sep 17 00:00:00 2001 From: CecileRobertMichon Date: Tue, 29 May 2018 15:14:58 -0700 Subject: [PATCH 1/3] update troubleshooting k8s doc --- docs/kubernetes/troubleshooting.md | 58 ++++++++++++++++++++++++++---- 1 file changed, 52 insertions(+), 6 deletions(-) diff --git a/docs/kubernetes/troubleshooting.md b/docs/kubernetes/troubleshooting.md index f5367429ec..82ee143246 100644 --- a/docs/kubernetes/troubleshooting.md +++ b/docs/kubernetes/troubleshooting.md @@ -1,12 +1,58 @@ -## Troubleshooting +# Troubleshooting -### Scaling up or down +## VMExtensionProvisioningError or VMExtensionProvisioningTimeout -Scaling your cluster up or down requires different parameters and template than the create. More details here [Scale up](../../examples/scale-up/README.md) +The two above VMExtensionProvisioning— errors tell us that a vm in the cluster failed installing required application prerequisites after CRP provisioned the VM into the resource group. When acs-engine creates a new Kubernetes cluster, a series of shell scripts runs to install prereq's like docker, etcd, Kubernetes runtime, and various other host OS packages that support the Kubernetes application layer. *Usually* this indicates one of the following: -If your cluster is not reachable, you can run the following command to check for common failures. +1. Something about the cluster configuration is pathological. For example, perhaps the cluster config includes a custom version of a particular software dependency that doesn't exist. Or, another example, for a cluster created inside a custom VNET (i.e., a user-provided, pre-existing VNET), perhaps that custom VNET does not have general outbound internet access, and so apt, docker pull, etc is not able to execute successfully. +2. A transient Azure environmental error caused the shell script operation to timeout, or exceed its retry count. For example, the shell script may attempt to download a required package (e.g., etcd), and if the Azure networking environment for the newly provisioned vm is flaky for a period of time, then the shell script may retry several times, but eventually timeout and fail. -### Misconfigured Service Principal +For classification #1 above, the appropriate strategic response is to figure out what about the cluster configuration is incorrect, and to fix it. We expect such scenarios to always fail in the above way: cluster deployments will not be successful until the cluster configuration is made to be correct. + +For classification #2 above, the appropriate strategic response is to retry a few times. If a 2nd or 3rd attempt succeeds, it is a hint that a transient environmental condition is the cause of the initial failure. + +### What is CSE? + +CSE stands for CustomScriptExtension, and is just a way of expressing: "a script that executes as part of the VM provisioning process, and that must exit 0 (i.e., successfully) in order for that VM provisioning process to succeed". Basically it's another way of expressing the VMExtensionProvisioning— concept above. + +To summarize, the way that acs-engine implements Kubernetes on Azure is a collection of (1) Azure VM configuration + (2) shell script execution. Both are implemented as a single operational unit, and when #2 fails, we consider the entire VM provisioning operation to be a failure; more importantly, if only one VM in the cluster deployment fails, we consider the entire cluster operation to be a failure. + +### How To Debug CSE errors + +In order to troubleshoot a cluster that failed in the above way(s), we need to grab the CSE logs from the host VM itself. + +• from a vm node that did not provision successfully: + ○ grab the entire file at `/var/log/azure/cluster-provision.log` + ○ grab the entire file at `/var/log/cloud-init-output.log` + +How to determine the above? + +• from a working master: kubectl get nodes + ○ are there any missing master or agent nodes? + § if so, that node vm probably failed CSE: grab the log file above from that vm + ○ are there no working master nodes? + § if so, then all node vms probably failed CSE: grab the log file above from any node vm + +CSE Exit Codes + +``` +"code": "VMExtensionProvisioningError" +"message": "VM has reported a failure when processing extension 'cse1'. Error message: "Enable failed: failed to +execute command: command terminated with exit status=20\n[stdout]\n\n[stderr]\n"." +``` + +Look for the exit code. In the above example, the exit code is `20`. The list of exit codes and their meaning can be found [here](../../parts/k8s/kubernetescustomscript.sh). + +If after following the above you are still unable to troubleshoot your deployment error, please open a Github issue with title `CSE error: exit code ` and include the following in the description: + +1. The apimodel json used to deploy the cluster (aka your cluster config). **Please make sure you remove all secrets and keys before posting it on GitHub.** + +2. The output of `kubectl get nodes` + +3. The content of `/var/log/azure/cluster-provision.log` and `/var/log/cloud-init-output.log` + + +# Misconfigured Service Principal If your Service Principal is misconfigured, none of the Kubernetes components will come up in a healthy manner. You can check to see if this the problem: @@ -21,4 +67,4 @@ read and **write** permissions to the target Subscription. `Nov 10 16:35:22 k8s-master-43D6F832-0 docker[3177]: E1110 16:35:22.840688 3201 kubelet_node_status.go:69] Unable to construct api.Node object for kubelet: failed to get external ID from cloud provider: autorest#WithErrorUnlessStatusCode: POST https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/oauth2/token?api-version=1.0 failed with 400 Bad Request: StatusCode=400` -3. [Link](../serviceprincipal.md) to documentation on how to create/configure a service principal for an ACS-Engine Kubernetes cluster. +[This documentation](../serviceprincipal.md) explains how to create/configure a service principal for an ACS-Engine Kubernetes cluster. From 0343ed68e925c475367a3148d40fbce7f18beb66 Mon Sep 17 00:00:00 2001 From: CecileRobertMichon Date: Tue, 29 May 2018 15:22:16 -0700 Subject: [PATCH 2/3] format --- docs/kubernetes/troubleshooting.md | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/docs/kubernetes/troubleshooting.md b/docs/kubernetes/troubleshooting.md index 82ee143246..290fef49e7 100644 --- a/docs/kubernetes/troubleshooting.md +++ b/docs/kubernetes/troubleshooting.md @@ -21,19 +21,22 @@ To summarize, the way that acs-engine implements Kubernetes on Azure is a collec In order to troubleshoot a cluster that failed in the above way(s), we need to grab the CSE logs from the host VM itself. -• from a vm node that did not provision successfully: - ○ grab the entire file at `/var/log/azure/cluster-provision.log` - ○ grab the entire file at `/var/log/cloud-init-output.log` +From a vm node that did not provision successfully: + +- grab the entire file at `/var/log/azure/cluster-provision.log` + +- grab the entire file at `/var/log/cloud-init-output.log` How to determine the above? -• from a working master: kubectl get nodes - ○ are there any missing master or agent nodes? - § if so, that node vm probably failed CSE: grab the log file above from that vm - ○ are there no working master nodes? - § if so, then all node vms probably failed CSE: grab the log file above from any node vm +From a working master: `kubectl get nodes` + +- Are there any missing master or agent nodes? + - if so, that node vm probably failed CSE: grab the log file above from that vm +- Are there no working master nodes? + - if so, then all node vms probably failed CSE: grab the log file above from any node vm -CSE Exit Codes +#### CSE Exit Codes ``` "code": "VMExtensionProvisioningError" @@ -43,7 +46,7 @@ execute command: command terminated with exit status=20\n[stdout]\n\n[stderr]\n" Look for the exit code. In the above example, the exit code is `20`. The list of exit codes and their meaning can be found [here](../../parts/k8s/kubernetescustomscript.sh). -If after following the above you are still unable to troubleshoot your deployment error, please open a Github issue with title `CSE error: exit code ` and include the following in the description: +If after following the above you are still unable to troubleshoot your deployment error, please open a Github issue with title "CSE error: exit code " and include the following in the description: 1. The apimodel json used to deploy the cluster (aka your cluster config). **Please make sure you remove all secrets and keys before posting it on GitHub.** From 1305674854baec789f003dc36e1bc8f92e7e6ed7 Mon Sep 17 00:00:00 2001 From: CecileRobertMichon Date: Tue, 29 May 2018 15:27:42 -0700 Subject: [PATCH 3/3] Correct how to get logs from right vm --- docs/kubernetes/troubleshooting.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/docs/kubernetes/troubleshooting.md b/docs/kubernetes/troubleshooting.md index 290fef49e7..d09a5e08d2 100644 --- a/docs/kubernetes/troubleshooting.md +++ b/docs/kubernetes/troubleshooting.md @@ -29,12 +29,14 @@ From a vm node that did not provision successfully: How to determine the above? -From a working master: `kubectl get nodes` +1. Look at the deployment error message. The error should include which VM extension failed the deployment. For example, `cse-master-0` means that the CSE extension of VM master 0 failed. + +2. From a master node: `kubectl get nodes` - Are there any missing master or agent nodes? - - if so, that node vm probably failed CSE: grab the log file above from that vm -- Are there no working master nodes? - - if so, then all node vms probably failed CSE: grab the log file above from any node vm + - if so, that node vm probably failed CSE: grab the log files above from that vm +- Are there no working nodes? + - if so, grab the log files above from the master vm you are on #### CSE Exit Codes