From e4d439a7158d34f06696e17321bcf9c83c244334 Mon Sep 17 00:00:00 2001
From: Zihong Zheng <zihongz@google.com>
Date: Fri, 5 May 2017 10:05:44 -0700
Subject: [PATCH] Add proposal for GCE L4 load-balancer health check (#552)

---
 gce-l4-loadbalancer-healthcheck.md | 61 ++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)
 create mode 100644 gce-l4-loadbalancer-healthcheck.md

diff --git a/gce-l4-loadbalancer-healthcheck.md b/gce-l4-loadbalancer-healthcheck.md
new file mode 100644
index 00000000000..22e6d17b12b
--- /dev/null
+++ b/gce-l4-loadbalancer-healthcheck.md
@@ -0,0 +1,61 @@
+# GCE L4 load-balancers' health checks for nodes
+
+## Goal
+Set up health checks for GCE L4 load-balancer to ensure it is only
+targeting healthy nodes.
+
+## Motivation
+On cloud providers which support external load balancers, setting the
+type field to "LoadBalancer" will provision a L4 load-balancer for the
+service ([doc](https://kubernetes.io/docs/concepts/services-networking/service/#type-loadbalancer)),
+which load-balances traffic to k8s nodes. As of k8s 1.6, we don't
+create health check for L4 load-balancer by default, which means all
+traffic will be forwarded to any one of the nodes blindly.
+
+This is undesired in cases:
+- k8s components including kubelet dead on nodes. Nodes will be flipped
+to unhealthy after a long propagation (~40s), even if we remove nodes
+from target pool at that point it is too slow.
+- kube-proxy dead on nodes while kubelet is still alive. Requests will
+be continually forwarded to nodes that may not be able to properly route
+traffic.
+
+For now, the only case health check will be created is for
+[OnlyLocal Service](https://kubernetes.io/docs/tutorials/services/source-ip/#source-ip-for-services-with-typeloadbalancer).
+We should have a node-level health check for load balancers that are used
+by non-OnlyLocal services.
+
+## Design
+Healthchecking the kube-proxys seems to be the best choice:
+- kube-proxy runs on every nodes and it is the pivot for service traffic
+routing.
+- Port 10249 on nodes is currently used for both kube-proxy's healthz and
+pprof.
+- We already have a similar mechanism for healthchecking OnlyLocal services
+in kube-proxy.
+
+The plan is to enable health check on all LoadBalancer services (if use GCP
+as cloud provider).
+
+## Implementation
+kube-proxy
+- Separate healthz from pprof (/metrics) to use a different port and bind it
+to 0.0.0.0. As we will only allow traffic from load-balancer source IPs, this
+wouldn't be a big security concern.
+- Make healthz check timestamp in iptables mode while always returns "ok" in
+other modes.
+
+GCE cloud provider (through kube-controller-manager)
+- Manage `k8s-l4-healthcheck` firewall and healthcheck resources.
+These two resources should be shared among all non-OnlyLocal LoadBalancer
+services.
+- Add a new flag to pipe in the healthz port num as it is configurable on
+kube-proxy.
+
+Version skew:
+- Running higher version master (with L4 health check feature enabled) with
+lower version nodes (without kube-proxy exposing healthz port) should fall
+back to the original behavior (no health check).
+- Rollback shouldn't be a big issue. Even if health check is left on Network
+load-balancer, it will fail on all nodes and fall back to blindly forwarding
+traffic.