Merge pull request #1354 from satyasm/iptables-no-cluster-cidr

KEP: kube-proxy detect "local" traffic w/o cluster CIDR
kubernetes · Dec 13, 2019 · 1b2b1e9 · 1b2b1e9
2 parents 5df0ee9 + 8c032f6
commit 1b2b1e9
Showing 1 changed file with 372 additions and 0 deletions.
diff --git a/keps/sig-network/20191104-iptables-no-cluster-cidr.md b/keps/sig-network/20191104-iptables-no-cluster-cidr.md
@@ -0,0 +1,372 @@
+---
+title: Remove knowledge of pod cluster CIDR from iptables rules
+authors:
+  - "@satyasm"
+owning-sig: sig-network
+participating-sigs:
+reviewers:
+  - "@thockin"
+  - "@caseydavenport"
+  - "@mikespreitzer"
+  - "@aojea"
+  - "@fasaxc"
+  - "@squeed"
+  - "@bowei"
+  - "@dcbw"
+  - "@darwinship"
+approvers:
+  - "@thockin"
+editor: TBD
+creation-date: 2019-11-04
+last-updated: 2019-11-27
+status: implementable
+see-also:
+replaces:
+superseded-by:
+---
+
+# Removing Knowledge of pod cluster CIDR from iptables rules
+
+## Table of Contents
+
+<!-- toc -->
+- [Summary](#summary)
+- [Motivation](#motivation)
+  - [Goals](#goals)
+  - [Non-Goals](#non-goals)
+- [Proposal](#proposal)
+  - [iptables - masquerade off cluster traffic to services by node IP](#iptables---masquerade-off-cluster-traffic-to-services-by-node-ip)
+  - [iptables - redirecting pod traffic to external loadbalancer VIP to cluster IP](#iptables---redirecting-pod-traffic-to-external-loadbalancer-vip-to-cluster-ip)
+  - [iptables - accepting traffic after first packet, after being accepted by kubernetes rules](#iptables---accepting-traffic-after-first-packet-after-being-accepted-by-kubernetes-rules)
+  - [ipvs - masquerade off cluster traffic to services by node IP](#ipvs---masquerade-off-cluster-traffic-to-services-by-node-ip)
+  - [ipvs - accepting traffic after first packet, after being accepted by kubernetes rules](#ipvs---accepting-traffic-after-first-packet-after-being-accepted-by-kubernetes-rules)
+  - [Risks and Mitigations](#risks-and-mitigations)
+- [Design Details](#design-details)
+  - [Graduation Criteria](#graduation-criteria)
+- [Implementation History](#implementation-history)
+- [Drawbacks [optional]](#drawbacks-optional)
+- [Alternatives [optional]](#alternatives-optional)
+  - [Multiple cluster CIDR rules](#multiple-cluster-cidr-rules)
+  - [ip-masq-agent like behavior](#ip-masq-agent-like-behavior)
+<!-- /toc -->
+
+## Summary
+
+The iptables implementation of kube-proxy today references the cluster CIDR for pods in three places for the following reasons.
+
+   1. [Masquerade off cluster traffic to services by node IP](https://github.com/kubernetes/kubernetes/blob/v1.17.0/pkg/proxy/iptables/proxier.go#L965-L970)
+   2. [Redirecting pods traffic to external loadbalancer VIP to cluster IP](https://github.com/kubernetes/kubernetes/blob/v1.17.0/pkg/proxy/iptables/proxier.go#L1327-L1339)
+   3. [Accepting traffic after first packet, after being accepted by kubernetes rules](https://github.com/kubernetes/kubernetes/blob/v1.17.0/pkg/proxy/iptables/proxier.go#L1468-L1490)
+
+In addition, the ipvs implementation also references it in two places for similar purposes
+
+   1. [Masquerade off cluster traffic to services by node IP](https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/ipvs/proxier.go#L1649-L1654)
+   2. [Accepting traffic after first packet, after being accepted by kubernetes](https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/ipvs/proxier.go#L1726-L1745)
+
+This enhancement proposes ways to achieve similar goals without tracking the pod cluster CIDR to do so.
+
+## Motivation
+
+The idea that makes kubernetes networking model unique and powerful is the concept of each pod having its own IP, 
+with all the pod IPs being natively routable within the cluster. The service chains in iptable rules depend on this 
+capability by assuming that they can treat all the endpoints of a cluster as being equivalent and load balance service 
+traffic across all the endpoints, by just translating destination to the pod IP address.
+
+While this is powerful, it also means pod IP addresses are in many cases the constraining resource for cluster creation
+and scale. It would be valuable for implementations to have different strategies for managing pod IP addresses that can
+adapt to different environment needs.
+
+Some examples of use cases:
+
+   * Creating a cluster out of many disjoint ranges instead of a single range.
+   * Expanding a cluster with more disjoint ranges after initial creation.
+
+Not having to depend on the cluster pod CIDR for routing service traffic would effectively de-couple pod IP management
+and allocation strategies from service management and routing. Which in turn would mean that it would be far cheaper 
+to evolve the IP allocation schemes while sharing the same service implementation, thus significantly lowering the bar
+for adoption of alternate schemes.
+
+Alternate implementations that don’t use iptables could also adopt this same reasoning to not have to track the cluster
+CIDR for routing cluster traffic.
+
+### Goals
+
+   * Not having to depend on the cluster pod CIDR for iptable rules and cluster traffic routing.
+
+### Non-Goals
+
+   * Providing alternate models of IP allocation schemes for pod CIDR.
+   * Enhancing current allocators to handle disjoint ranges.
+   * Enhancing current allocators to add additional ranges after cluster creation.
+   * Changing current assumptions around having a single pod CIDR per node.
+
+## Proposal
+
+As stated above, the goal is to re-implement the functionality called out in the summary, but in a 
+way that does not depend on a pod cluster CIDR. The essence of the proposal is that for the 
+first two cases in iptables implementation and first case in ipvs, we can replace the `-s proxier.clusterCIDR` with 
+some notion of node local pod traffic.
+
+The core logic in these cases is “how to determine” cluster originated traffic from non-cluster originated ones. 
+The proposal is that tracking pod traffic generated from within the node is sufficient to determine cluster originated 
+traffic. For the first two use cases in iptables and first use case in ipvs, we provide alternatives to using 
+proxier.clusterCIDR in one of the following ways to determine cluster originated traffic
+
+   1. `-s node.podCIDR` (where node podCIDR is used for allocating pod IPs within the node)
+   2. `--in-interface prefix+` (where all pod interfaces start with same prefix,
+      or where all pod traffic appears to come from a single bridge or other interface)
+   3. `-m physdev --physdev-is-in` (for kubenet if we don’t want to depend on node podCIDR)
+
+Note the above are equivalent definitions, when considering only pod traffic originating from within the node.
+
+Given that this kep only addresses usage of the cluster CIDR (for pods), and that pods with hostNetwork are not 
+impacted by this, the assumption is that hostNetwork pod behavior will continue to work as is.
+
+For the last use case, note above, in iptables and ipvs, the proposal is to drop the reference to the cluster CIDR.
+
+The reasoning behind why this works are as follows.
+
+### iptables - masquerade off cluster traffic to services by node IP
+
+The [rule here currently](
+  https://github.com/kubernetes/kubernetes/blob/v1.17.0/pkg/proxy/iptables/proxier.go#L965-L970
+) looks as follows
+
+```go
+// This masquerades off-cluster traffic to a service VIP.  The idea
+// is that you can establish a static route for your Service range,
+// routing to any node, and that node will bridge into the Service
+// for you.  Since that might bounce off-node, we masquerade here.
+// If/when we support "Local" policy for VIPs, we should update this.
+writeLine(proxier.natRules, append(args, "! -s", proxier.clusterCIDR, "-j", string(KubeMarkMasqChain))...)
+```
+
+The logic is that if the source IP is not part of the cluster CIDR range,
+then it must have originated from outside the cluster. Hence we add a rule to masquerade by
+the node IP so that we can send traffic to any pod within the cluster.
+
+One key insight when thinking about this data path though is the fact that the iptable rules run
+at _every_ node boundary. So when a pod sends a traffic to a service IP, it gets translated to
+one of the pod IPs _before_ it leaves the node at the node boundary. So it's highly unlikely to 
+receive traffic at a node, whose destination is the service cluster IP, that is initiated by pods
+within the cluster, but not scheduled within that node.
+
+Going by the above reasoning, if we receive traffic destined to a service whose source is not within the node 
+generated pod traffic, we can say with very high confidence that the traffic originated from outside the cluster. 
+So we can rewrite the rule in terms of just the pod identity within the node (node CIDR, interface prefix or bridge).
+This would be the simplest change with respect to re-writing the rule without any assumptions on how pod 
+networking is setup.
+
+### iptables - redirecting pod traffic to external loadbalancer VIP to cluster IP
+
+The [rule here currently](
+  https://github.com/kubernetes/kubernetes/blob/v1.17.0/pkg/proxy/iptables/proxier.go#L1327-L1339
+) looks as follows
+
+```go
+// First rule in the chain redirects all pod -> external VIP traffic to the
+// Service's ClusterIP instead. This happens whether or not we have local
+// endpoints; only if clusterCIDR is specified
+if len(proxier.clusterCIDR) > 0 {
+  args = append(args[:0],
+    "-A", string(svcXlbChain),
+    "-m", "comment", "--comment",
+    `"Redirect pods trying to reach external loadbalancer VIP to clusterIP"`,
+    "-s", proxier.clusterCIDR,
+    "-j", string(svcChain),
+  )
+  writeLine(proxier.natRules, args...)
+}
+```
+
+The logic here is that if the source IP is part of cluster CIDR and we detect that is being
+sent to a load balancer IP for a service, we short circuit it by jumping directly to the
+service chain instead of having the packet go out of the cluster, get routed back and then
+translated to one of the backends.
+
+Given that iptable rules are applied at the node boundary before any traffic from pods within
+that node leave the node, the same arguments above apply here for replacing the cluster CIDR
+with a representation of pod's nodeCIDR or it's interfaces.
+
+### iptables - accepting traffic after first packet, after being accepted by kubernetes rules
+
+The [rule here currently](https://github.com/kubernetes/kubernetes/blob/v1.17.0/pkg/proxy/iptables/proxier.go#L1468-L1490)
+looks as follows
+
+```go
+// The following rules can only be set if clusterCIDR has been defined.
+if len(proxier.clusterCIDR) != 0 {
+  // The following two rules ensure the traffic after the initial packet
+  // accepted by the "kubernetes forwarding rules" rule above will be
+  // accepted, to be as specific as possible the traffic must be sourced
+  // or destined to the clusterCIDR (to/from a pod).
+  writeLine(proxier.filterRules,
+    "-A", string(kubeForwardChain),
+    "-s", proxier.clusterCIDR,
+    "-m", "comment", "--comment", `"kubernetes forwarding conntrack pod source rule"`,
+    "-m", "conntrack",
+    "--ctstate", "RELATED,ESTABLISHED",
+    "-j", "ACCEPT",
+  )
+  writeLine(proxier.filterRules,
+    "-A", string(kubeForwardChain),
+    "-m", "comment", "--comment", `"kubernetes forwarding conntrack pod destination rule"`,
+    "-d", proxier.clusterCIDR,
+    "-m", "conntrack",
+    "--ctstate", "RELATED,ESTABLISHED",
+    "-j", "ACCEPT",
+  )
+}
+```
+
+The interesting part of this rule that it already matches conntrack state to "RELATED,ESTABLISHED", 
+which means that it does not apply to the initial packet, but after the connection has been setup and accepted.
+
+In this case, dropping the `-d proxier.clusterCIDR` rule should have minimal impact on it behavior.
+We would just be saying that if any connection is already established or related, just accept it.
+
+In addition, since this rule is written after the rule to drop packets marked by `KUBE-MARK-DROP`,
+by the time we reach this rule, packets marked to dropped by kubernetes would already have been dropped.
+So it should not break any kubernetes specific logic.
+
+Unfortunately in this case, it's not possible replace the cluster CIDR rule with local CIDR as
+the traffic could be getting forwarded through this node to another node.
+
+### ipvs - masquerade off cluster traffic to services by node IP
+
+The [rule here currently](https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/ipvs/proxier.go#L1649-L1654)
+looks as follows.
+
+```go
+// This masquerades off-cluster traffic to a service VIP.  The idea
+// is that you can establish a static route for your Service range,
+// routing to any node, and that node will bridge into the Service
+// for you.  Since that might bounce off-node, we masquerade here.
+// If/when we support "Local" policy for VIPs, we should update this.
+writeLine(proxier.natRules, append(args, "dst,dst", "! -s", proxier.clusterCIDR, "-j", string(KubeMarkMasqChain))...)
+```
+
+By the same logic used in the first case for iptables, we can replace references to clusterCIDR with equivalent
+node specific pod identification (node.podCIDR, interface prefix or bridge) to determine whether the traffic originated
+from within the cluster or not.
+
+### ipvs - accepting traffic after first packet, after being accepted by kubernetes rules
+
+The [rule here currently](https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/ipvs/proxier.go#L1726-L1745)
+looks as follows
+
+```go
+// The following two rules ensure the traffic after the initial packet
+// accepted by the "kubernetes forwarding rules" rule above will be
+// accepted, to be as specific as possible the traffic must be sourced
+// or destined to the clusterCIDR (to/from a pod).
+writeLine(proxier.filterRules,
+	"-A", string(KubeForwardChain),
+	"-s", proxier.clusterCIDR,
+	"-m", "comment", "--comment", `"kubernetes forwarding conntrack pod source rule"`,
+	"-m", "conntrack",
+	"--ctstate", "RELATED,ESTABLISHED",
+	"-j", "ACCEPT",
+)
+writeLine(proxier.filterRules,
+	"-A", string(KubeForwardChain),
+	"-m", "comment", "--comment", `"kubernetes forwarding conntrack pod destination rule"`,
+	"-d", proxier.clusterCIDR,
+	"-m", "conntrack",
+	"--ctstate", "RELATED,ESTABLISHED",
+	"-j", "ACCEPT",
+)
+```
+Again, applying similar logic to the last rule for iptables, the proposal here is to simplify this drop reference to
+the proxy.clusterCIDR and just match on the connection state.
+
+### Risks and Mitigations
+
+The biggest risk we have is that we are expanding the scope of the last rule to potentially include non-kubernetes
+traffic. This is considered mostly safe as it does not break any of the intended drop behavior. Plus once the initial
+connection has been accepted, assuming nodes are used for kubernetes workloads, it's highly unlikely that we would
+need to not accept it later.
+
+## Design Details
+
+The idea of ‘determine cluster originated traffic’ would be captured in a new go interface type within kube-proxy,
+with different implementations of the interface. The kube-proxy implementation itself would just call method on
+interface to get the match criteria to write in the rule.
+
+The new behavior can be opted-in using two flags. The first to determine the mode to use for detection, and the
+other (optionally) being the value to use in that mode. This separation of mode and value has the nice property
+that if we default the mode to "cluster-cidr", then the current `--cluster-cidr` flag can be used as is to get
+the current behaviour. So upgrades with no changes retain current behavior.
+
+```
+--detect-local={cluster-cidr | node-cidr | pod-interface-prefix  | bridge}
+
+  the mode to use for detection local traffic. The default is cluster-cidr (current behavior)
+
+--cluster-cidr="cidr[,cidr,..]"
+
+  the current --cluster-cidr flag. It will be enhanced to read a comma separated list of CIDRs so that more
+  than one can be specified if necessary. kube-proxy considers traffic as local if source is one
+  of the CIDR values. This is only used if `--detect-local=cluster-cidr` .
+
+--node-cidr[="cidr[,cidr,..]"]
+
+  (optional) list of node CIDRs as a comma separated list. kube-proxy considers traffic as local if source is one
+  of the CIDR values. If value is not specified, or flag is omitted,  defaults to node.podCIDR property on the node.
+  This is only used if `--detect-local=node-cidr` .
+
+--pod-interface-prefix="prefix[,prefix,..]"
+
+  kube-proxy considers traffic as local if originating from an interface which matches one of given
+  prefixes. string argument is a comma separated list of interface prefix names, without the ending '+'.
+  This is only used if `--detect-local=pod-interface-prefix` or `--detect-local=bridge`. In the case of
+  latter, the prefix is used as option to `--physdev-in name` match instead of just `--physdev-in` in
+  iptables.
+```
+
+Given that we are handling a list of rules, the jump to `KUBE-MARK-MARQ` will be implemented with a
+jump to a new chain `KUBE-MASQ-IF-NOT-LOCAL` which will then either return or jump to `KUBE-MARK-MASQ`
+as appriate. For example:
+
+```
+-A WHEREVER -blah -blah -blah -j MARK-MASQ-IF-NOT-LOCAL
+
+-A MARK-MASQ-IF-NOT-LOCAL -s 10.0.1.0/24 -j RETURN
+-A MARK-MASQ-IF-NOT-LOCAL -s 10.0.3.0/24 -j RETURN
+-A MARK-MASQ-IF-NOT-LOCAL -s 10.0.5.0/24 -j RETURN
+-A MARK-MASQ-IF-NOT-LOCAL -j KUBE-MARK-MASQ
+```
+
+Future changes to detection of local traffic (say using things like mark etc) can be done by adding more options
+to the `--detect-local` mode flag with any appropriate additional flags.
+
+### Graduation Criteria
+
+These additional flags will go through alpha, beta etc graduation as for any feature.
+
+## Implementation History
+
+2019-11-04 - Creation of the KEP
+2019-11-27 - Revision with Implementation Details
+
+## Drawbacks [optional]
+
+The main caveat in this KEP is the relaxation of the accept rule for "ESTABLISHED,RELATED" packets. The other two rules
+have equivalent implementations, as long as we continue to guarantee that pod traffic is routed at the node boundary
+on _every_ and _all_ nodes that makes up the kubernetes cluster. This would not work if that assumption were to change.
+
+## Alternatives [optional]
+
+### Multiple cluster CIDR rules
+One alternative to consider is to explicitly track a list of cluster CIDRs in the ip table rules. If we 
+want to do this, we might want to consider making the cluster CIDR a first class resource, which we want to avoid.
+
+Instead in most cases, where the interface prefix is mostly fixed or we are using the `node.spec.podCIDR` attribute,
+changes to the cluster CIDR does not need any change to the kube-proxy arguments or a restart, which we believe 
+is of benefit when managing clusters.
+
+### ip-masq-agent like behavior
+The other alternative is to have kube-proxy never track it and instead use something like 
+[ip-masq-agent](https://kubernetes.io/docs/tasks/administer-cluster/ip-masq-agent/) to track what we masquerade
+or not. In this case, it assumes more knowledge from the users, but it does provide for a single place to update
+these cidrs using existing tooling.