From 610fadb2fc14b816b38798897e73d360f5001bb9 Mon Sep 17 00:00:00 2001
From: CecileRobertMichon <cerobert@microsoft.com>
Date: Tue, 29 May 2018 15:14:58 -0700
Subject: [PATCH 1/3] update troubleshooting k8s doc

---
 docs/kubernetes/troubleshooting.md | 58 ++++++++++++++++++++++++++----
 1 file changed, 52 insertions(+), 6 deletions(-)

diff --git a/docs/kubernetes/troubleshooting.md b/docs/kubernetes/troubleshooting.md
index f5367429ec..82ee143246 100644
--- a/docs/kubernetes/troubleshooting.md
+++ b/docs/kubernetes/troubleshooting.md
@@ -1,12 +1,58 @@
-## Troubleshooting
+# Troubleshooting
 
-### Scaling up or down
+## VMExtensionProvisioningError or VMExtensionProvisioningTimeout
 
-Scaling your cluster up or down requires different parameters and template than the create. More details here [Scale up](../../examples/scale-up/README.md)
+The two above VMExtensionProvisioning— errors tell us that a vm in the cluster failed installing required application prerequisites after CRP provisioned the VM into the resource group. When acs-engine creates a new Kubernetes cluster, a series of shell scripts runs to install prereq's like docker, etcd, Kubernetes runtime, and various other host OS packages that support the Kubernetes application layer. *Usually* this indicates one of the following:
 
-If your cluster is not reachable, you can run the following command to check for common failures.
+1. Something about the cluster configuration is pathological. For example, perhaps the cluster config includes a custom version of a particular software dependency that doesn't exist. Or, another example, for a cluster created inside a custom VNET (i.e., a user-provided, pre-existing VNET), perhaps that custom VNET does not have general outbound internet access, and so apt, docker pull, etc is not able to execute successfully.
+2. A transient Azure environmental error caused the shell script operation to timeout, or exceed its retry count. For example, the shell script may attempt to download a required package (e.g., etcd), and if the Azure networking environment for the newly provisioned vm is flaky for a period of time, then the shell script may retry several times, but eventually timeout and fail.
 
-### Misconfigured Service Principal
+For classification #1 above, the appropriate strategic response is to figure out what about the cluster configuration is incorrect, and to fix it. We expect such scenarios to always fail in the above way: cluster deployments will not be successful until the cluster configuration is made to be correct.
+
+For classification #2 above, the appropriate strategic response is to retry a few times. If a 2nd or 3rd attempt succeeds, it is a hint that a transient environmental condition is the cause of the initial failure.
+
+### What is CSE?
+
+CSE stands for CustomScriptExtension, and is just a way of expressing: "a script that executes as part of the VM provisioning process, and that must exit 0 (i.e., successfully) in order for that VM provisioning process to succeed". Basically it's another way of expressing the VMExtensionProvisioning— concept above.
+
+To summarize, the way that acs-engine implements Kubernetes on Azure is a collection of (1) Azure VM configuration + (2) shell script execution. Both are implemented as a single operational unit, and when #2 fails, we consider the entire VM provisioning operation to be a failure; more importantly, if only one VM in the cluster deployment fails, we consider the entire cluster operation to be a failure.
+
+### How To Debug CSE errors
+
+In order to troubleshoot a cluster that failed in the above way(s), we need to grab the CSE logs from the host VM itself.
+
+• from a vm node that did not provision successfully:
+    ○ grab the entire file at `/var/log/azure/cluster-provision.log`
+    ○ grab the entire file at `/var/log/cloud-init-output.log`
+
+How to determine the above?
+
+• from a working master: kubectl get nodes
+    ○ are there any missing master or agent nodes?
+        § if so, that node vm probably failed CSE: grab the log file above from that vm
+    ○ are there no working master nodes?
+        § if so, then all node vms probably failed CSE: grab the log file above from any node vm
+
+CSE Exit Codes
+
+```
+"code": "VMExtensionProvisioningError"
+"message": "VM has reported a failure when processing extension 'cse1'. Error message: "Enable failed: failed to
+execute command: command terminated with exit status=20\n[stdout]\n\n[stderr]\n"."
+```
+
+Look for the exit code. In the above example, the exit code is `20`. The list of exit codes and their meaning can be found [here](../../parts/k8s/kubernetescustomscript.sh).
+
+If after following the above you are still unable to troubleshoot your deployment error, please open a Github issue with title `CSE error: exit code <INSERT_YOUR_EXIT_CODE>` and include the following in the description:
+
+1. The apimodel json used to deploy the cluster (aka your cluster config). **Please make sure you remove all secrets and keys before posting it on GitHub.**
+
+2. The output of `kubectl get nodes`
+
+3. The content of `/var/log/azure/cluster-provision.log` and `/var/log/cloud-init-output.log`
+
+
+# Misconfigured Service Principal
 
 If your Service Principal is misconfigured, none of the Kubernetes components will come up in a healthy manner.
 You can check to see if this the problem:
@@ -21,4 +67,4 @@ read and **write** permissions to the target Subscription.
 
 `Nov 10 16:35:22 k8s-master-43D6F832-0 docker[3177]: E1110 16:35:22.840688    3201 kubelet_node_status.go:69] Unable to construct api.Node object for kubelet: failed to get external ID from cloud provider: autorest#WithErrorUnlessStatusCode: POST https://login.microsoftonline.com/72f988bf-86f1-41af-91ab-2d7cd011db47/oauth2/token?api-version=1.0 failed with 400 Bad Request: StatusCode=400`
 
-3. [Link](../serviceprincipal.md) to documentation on how to create/configure a service principal for an ACS-Engine Kubernetes cluster.
+[This documentation](../serviceprincipal.md) explains how to create/configure a service principal for an ACS-Engine Kubernetes cluster.

From 0343ed68e925c475367a3148d40fbce7f18beb66 Mon Sep 17 00:00:00 2001
From: CecileRobertMichon <cerobert@microsoft.com>
Date: Tue, 29 May 2018 15:22:16 -0700
Subject: [PATCH 2/3] format

---
 docs/kubernetes/troubleshooting.md | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/docs/kubernetes/troubleshooting.md b/docs/kubernetes/troubleshooting.md
index 82ee143246..290fef49e7 100644
--- a/docs/kubernetes/troubleshooting.md
+++ b/docs/kubernetes/troubleshooting.md
@@ -21,19 +21,22 @@ To summarize, the way that acs-engine implements Kubernetes on Azure is a collec
 
 In order to troubleshoot a cluster that failed in the above way(s), we need to grab the CSE logs from the host VM itself.
 
-• from a vm node that did not provision successfully:
-    ○ grab the entire file at `/var/log/azure/cluster-provision.log`
-    ○ grab the entire file at `/var/log/cloud-init-output.log`
+From a vm node that did not provision successfully:
+
+- grab the entire file at `/var/log/azure/cluster-provision.log`
+
+- grab the entire file at `/var/log/cloud-init-output.log`
 
 How to determine the above?
 
-• from a working master: kubectl get nodes
-    ○ are there any missing master or agent nodes?
-        § if so, that node vm probably failed CSE: grab the log file above from that vm
-    ○ are there no working master nodes?
-        § if so, then all node vms probably failed CSE: grab the log file above from any node vm
+From a working master: `kubectl get nodes`
+
+- Are there any missing master or agent nodes?
+  - if so, that node vm probably failed CSE: grab the log file above from that vm
+- Are there no working master nodes?
+  - if so, then all node vms probably failed CSE: grab the log file above from any node vm
 
-CSE Exit Codes
+#### CSE Exit Codes
 
 ```
 "code": "VMExtensionProvisioningError"
@@ -43,7 +46,7 @@ execute command: command terminated with exit status=20\n[stdout]\n\n[stderr]\n"
 
 Look for the exit code. In the above example, the exit code is `20`. The list of exit codes and their meaning can be found [here](../../parts/k8s/kubernetescustomscript.sh).
 
-If after following the above you are still unable to troubleshoot your deployment error, please open a Github issue with title `CSE error: exit code <INSERT_YOUR_EXIT_CODE>` and include the following in the description:
+If after following the above you are still unable to troubleshoot your deployment error, please open a Github issue with title "CSE error: exit code <INSERT_YOUR_EXIT_CODE>" and include the following in the description:
 
 1. The apimodel json used to deploy the cluster (aka your cluster config). **Please make sure you remove all secrets and keys before posting it on GitHub.**
 

From 1305674854baec789f003dc36e1bc8f92e7e6ed7 Mon Sep 17 00:00:00 2001
From: CecileRobertMichon <cerobert@microsoft.com>
Date: Tue, 29 May 2018 15:27:42 -0700
Subject: [PATCH 3/3] Correct how to get logs from right vm

---
 docs/kubernetes/troubleshooting.md | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/docs/kubernetes/troubleshooting.md b/docs/kubernetes/troubleshooting.md
index 290fef49e7..d09a5e08d2 100644
--- a/docs/kubernetes/troubleshooting.md
+++ b/docs/kubernetes/troubleshooting.md
@@ -29,12 +29,14 @@ From a vm node that did not provision successfully:
 
 How to determine the above?
 
-From a working master: `kubectl get nodes`
+1. Look at the deployment error message. The error should include which VM extension failed the deployment. For example, `cse-master-0` means that the CSE extension of VM master 0 failed.
+
+2. From a master node: `kubectl get nodes`
 
 - Are there any missing master or agent nodes?
-  - if so, that node vm probably failed CSE: grab the log file above from that vm
-- Are there no working master nodes?
-  - if so, then all node vms probably failed CSE: grab the log file above from any node vm
+  - if so, that node vm probably failed CSE: grab the log files above from that vm
+- Are there no working nodes?
+  - if so, grab the log files above from the master vm you are on
 
 #### CSE Exit Codes