Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move noEncap/Hybrid/Policy-Only mode to use Antrea-proxy #1015

Merged
merged 1 commit into from
Aug 7, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions build/yamls/antrea-eks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -665,7 +665,7 @@ data:
# Enable antrea proxy which provides ServiceLB for in-cluster services in antrea agent.
# It should be enabled on Windows, otherwise NetworkPolicy will not take effect on
# Service traffic.
# AntreaProxy: false
AntreaProxy: true
# Enable traceflow which provides packet tracing feature to diagnose network issue.
# Traceflow: false
# Enable flowexporter which exports polled conntrack connections as IPFIX flow records from each agent to a configured collector.
Expand Down Expand Up @@ -770,7 +770,7 @@ metadata:
annotations: {}
labels:
app: antrea
name: antrea-config-hhthk4g2f4
name: antrea-config-h7cg6t86ht
namespace: kube-system
---
apiVersion: v1
Expand Down Expand Up @@ -876,7 +876,7 @@ spec:
key: node-role.kubernetes.io/master
volumes:
- configMap:
name: antrea-config-hhthk4g2f4
name: antrea-config-h7cg6t86ht
name: antrea-config
- name: antrea-controller-tls
secret:
Expand Down Expand Up @@ -1093,7 +1093,7 @@ spec:
operator: Exists
volumes:
- configMap:
name: antrea-config-hhthk4g2f4
name: antrea-config-h7cg6t86ht
name: antrea-config
- hostPath:
path: /etc/cni/net.d
Expand Down
8 changes: 4 additions & 4 deletions build/yamls/antrea-gke.yml
Original file line number Diff line number Diff line change
Expand Up @@ -665,7 +665,7 @@ data:
# Enable antrea proxy which provides ServiceLB for in-cluster services in antrea agent.
# It should be enabled on Windows, otherwise NetworkPolicy will not take effect on
# Service traffic.
# AntreaProxy: false
AntreaProxy: true
# Enable traceflow which provides packet tracing feature to diagnose network issue.
# Traceflow: false
# Enable flowexporter which exports polled conntrack connections as IPFIX flow records from each agent to a configured collector.
Expand Down Expand Up @@ -770,7 +770,7 @@ metadata:
annotations: {}
labels:
app: antrea
name: antrea-config-mbkmc9bb22
name: antrea-config-db6h57cm79
namespace: kube-system
---
apiVersion: v1
Expand Down Expand Up @@ -876,7 +876,7 @@ spec:
key: node-role.kubernetes.io/master
volumes:
- configMap:
name: antrea-config-mbkmc9bb22
name: antrea-config-db6h57cm79
name: antrea-config
- name: antrea-controller-tls
secret:
Expand Down Expand Up @@ -1091,7 +1091,7 @@ spec:
operator: Exists
volumes:
- configMap:
name: antrea-config-mbkmc9bb22
name: antrea-config-db6h57cm79
name: antrea-config
- hostPath:
path: /etc/cni/net.d
Expand Down
145 changes: 14 additions & 131 deletions docs/policy-only.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,13 @@ primary CNI.

## Design

Antrea is designed to work as NetworkPolicy plug-in to work together with a routed CNIs.
Antrea is designed to work as NetworkPolicy plug-in to work together with a routed CNIs.
For as long as a CNI implementation fits into this model, Antrea may be inserted to enforce
NetworkPolicy in that CNI's environment using Open VSwitch(OVS).

In addition, Antrea working as NetworkPolicy plug-in automatically enables Antrea-proxy, because
it requires Antrea-proxy to load balance Pod-to-Service traffic.

<img src="/docs/assets/policy-only-cni.svg" width="600" alt="Antrea Switched CNI">

The above diagram depicts a routed CNI network topology on the left, and what it looks like
Expand All @@ -24,7 +27,7 @@ incoming traffic is received on this PtP device. This is a spoke-and-hub model,
traffic, even within the same worker Node must traverse first to the host network and be
suwang48404 marked this conversation as resolved.
Show resolved Hide resolved
routed by it.

When a Pod is instantiated, the container runtime first calls the primary CNI to configure Pod's
When the container runtime instantiates a Pod, it first calls the primary CNI to configure Pod's
IP, route table, DNS etc, and then connects Pod to host network with a PtP device such as a
veth-pair. When Antrea is chained with this primary CNI, container runtime then calls
Antrea Agent, and the Antrea Agent attaches Pod's PtP device to the OVS bridge, and moves the host
Expand All @@ -34,137 +37,17 @@ illustrated by the diagram on the right.
Antrea needs to satisfy that
1. All IP packets, sent on ``antrea-gw0`` in the host network, are received by the Pods exactly the same
as if the OVS bridge had not been inserted.
1. Similarly all IP packets, sent by Pods, are received by other Pods or the host network exactly
1. All IP packets, sent by Pods, are received by other Pods or the host network exactly
the same as if OVS bridge had not been inserted.
1. There are no requirements on Pod MAC addresses as all MAC addresses stays within the OVS bridge.

To satisfy the above requirements, Antrea needs no knowledge of Pod's network configurations nor
of underlying CNI network, it simply needs to program the following OVS flows on the OVS bridge:
1. A default ARP responder flow that answers any ARP request. Its sole purpose is so that a Pod's
neighbor may be resolved, and packets may be sent by that Pod to that neighbor.
1. IP packets are routed based on their destination IP if it matches any local Pod's IP.
1. All other IP packets are routed to host network via ``antrea-gw0`` interface.

These flows together handle all Pod traffic patterns with exception of Pod-to-Service traffic
that we will address next.

## Handling Pod-To-Service
The discussion in this section is relevant also to Pod-to-Service traffic in NoEncap traffic
mode. Antrea applies the same principle to handle Pod-to-Service traffic in all traffic modes where
traffic requires no encapsulation.

Antrea uses kube-proxy for load balancing. At the same time, it also supports Pod level
NetworkPolicy enforcement.

This means that a Pod-to-Service traffic flow needs to
1. first traverse to the host network for load balancing (DNAT), then
1. come back to OVS bridge for Pod Egress NetworkPolicy processing, and
1. go back to the host network yet again to be forwarded, if DNATed destination in 1) is an
inter-Node Pod or external network entity.

We refer to the last traffic pattern as re-entrance traffic because in this pattern, a traffic flow
enters host network twice -- first time for load balancing, and second time for forwarding.

Denote
- VIP as cluster IP of a service
- SP_IP/DP_IP as respective client and server Pod IP
- VPort as service port of a service
- TPort as target port of server Pod
- SPort as original source port

The service request's 5-tuples upon first and second entrance to the host network, and
its reply's 5-tuples would be like

```
request/service:
-- Entering Host Network(via antrea-gw0): SP_IP/SPort->VIP/VPort
-- After LB(DNAT): SP_IP/SPort->DP_IP/TPort
-- After Route(to antrea-gw0): SP_IP/SPort->DP_IP/TPort

request/forwarding:
-- Entering Host Network(via antrea-gw0): SP_IP/SPort->DP_IP/TPort
-- After route(to uplink): SP_IP/SPort->DP_IP/TPort

reply:
-- Entering Host Network(via uplink): DP_IP/TPort -> SP_IP/SPort
-- After LB(DNAT): VIP/VPort->SP_IP/Sport
-- After route(to antrea-gw0): VIP/VPort->SP_IP/Sport
```

#### Routing
Note that the request with destination IP DP_IP needs to be routed differently in LB and
forwarding cases.(This differs from encap traffic where all traffic flows including post LB
service traffic share the same ``main`` route table.) Antrea creates a customized
``antrea_service`` route table, it is used in conjunction with ip-rule and ip-tables to handle
service traffic. Together they work as follows
1. At Antrea initialization, an ip-tables rule is created in ``mangle table`` that marks IP packets
with service IP as destination IP and are from ``antrea-gw0``.
1. At Antrea initialization, an ip-rule is added to select ``antrea_service`` route table as routing
table if traffic is marked in 1).
1. At Antrea initialization, a default route entry is added to ``antrea_service`` route table to
forward all traffic to ``antrea-gw0``.

The outcome may be something like this
```bash
ip neigh | grep antrea-gw0
169.254.253.1 dev antrea-gw0 lladdr 12:34:56:78:9a:bc PERMANENT

ip route show table 300 #tbl_idx=300 is antrea_service
default via 169.254.253.1 dev antrea-gw0 onlink

ip rule | grep antrea-gw0
300: from all fwmark 0x800/0x800 iif antrea-gw0 lookup 300

iptables -t mangle -L ANTREA-MANGLE
Chain ANTREA-MANGLE (1 references)
target prot opt source destination
MARK all -- anywhere 10.0.0.0/16 /* Antrea: mark service traffic */ MARK or 0x800
MARK all -- anywhere !10.0.0.0/16 /* Antrea: unmark post LB service traffic */ MARK and 0x0
```

The above configuration allows Pod-to-Service traffic to use ``antrea_service`` route table after
load balancing, and to be steered back to OVS bridge for Pod NetworkPolicy processing.

#### Conntrack
Note also that with re-entrance traffic, a service request, after being load balanced and routed
back to OVS bridge via ``antrea-gw0``, has exactly the same 5-tuple as when it re-enters the host network
for forwarding.

When a service request with same 5-tuples re-enters the host network, it confuses Linux conntrack.
The Linux considers the re-entrance IP packet from a new connection flow that uses same source port
that has been allocated in the DNAT connection. In turn, the re-entrance packet triggers
another SNAT connection. The overall effect is that the service's DNAT connection is not
discovered by the service reply, and no Un-DNAT takes place. As a result, the reply is not
recognized, and therefore dropped by the source Pod.

Antrea uses the following mechanisms to handle Pod-to-Service traffic re-entrance to the host
network, and bypasses conntrack in host network.
1. In OVS bridge, adds flow that marks any re-entrance traffic with a special source MAC.
1. In OVS bridge, adds flow that causes any re-entrance traffic to bypasses conntrack in OVS zone.
1. In the host network' ip-tables, adds a rule in ``raw`` table that if matching the special
source MAC in 1), bypass conntrack in host zone.

#### NetworkPolicy Considerations
Note that when a traffic flow is re-entrance, the original reply packets do not make it into OVS,
as it is un-DNATted in the host network before reaching OVS. This, however, does not have any
impact on NetworkPolicy enforcement.

Antrea enforces NetworkPolicy by allowing or disallowing initial connection packets (e.g. TCP
SYN) to go through and to establish connection. Once a connection is
established, Antrea relies on conntrack to admit or reject packets for that connection. This still
holds true for re-entrance traffic flows, except that conntrack takes place not within OVS conntrack
zone, but instead is in the host network's default conntrack zone. Hence NetworkPolicy
enforcement is not impacted.

It has some effects on statistics collection. If original reply traffic reaches OVS bridge as is
the case of encap traffic flows, the OVS bridge knows about any reply packets dropped by OVS zone
conntrack, and can record them accordingly. With re-entrance traffic, the reply traffic with
original server Pod IPs does not reach OVS bridge, and any dropped traffic by host network
conntrack is unknown to the OVS bridge.

## Future Work
1. Smoother transition in/out of Antrea in policy mode, Kubernetes deployment shall be easily
scaled up and down after/before Antrea insertion to allow Pods be added to Antrea after
installation, and reconnect to old CNI topology after Antrea is uninstalled.
1. NetworkPolicy for external services is not working.
See https://github.com/vmware-tanzu/antrea/issues/538.
1. A default ARP responder flow that answers any ARP request. Its sole purpose is so that a Pod can
resolve its neighbors, and the Pod therefore can generate traffic to these neighbors.
1. A L3 flow for each local Pod that routes IP packets to that Pod if packets' destination IP
matches that of the Pod.
1. A L3 fow that routes all other IP packets to host network via ``antrea-gw0
`` interface.

These flows together handle all Pod traffic patterns.
5 changes: 5 additions & 0 deletions hack/generate-manifest.sh
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,11 @@ if [ "$MODE" == "release" ] && [ -z "$IMG_TAG" ]; then
exit 1
fi

# noEncap/policy-only mode works with antrea-proxy.
if [[ "$ENCAP_MODE" != "" ]] && [[ "$ENCAP_MODE" != "encap" ]]; then
PROXY=true
fi

THIS_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"

source $THIS_DIR/verify-kustomize.sh
Expand Down
3 changes: 2 additions & 1 deletion pkg/agent/agent.go
Original file line number Diff line number Diff line change
Expand Up @@ -453,9 +453,10 @@ func (i *Initializer) configureGatewayInterface(gatewayIface *interfacestore.Int
i.nodeConfig.GatewayConfig = &config.GatewayConfig{Name: i.hostGateway, MAC: gwMAC}
gatewayIface.MAC = gwMAC
if i.networkConfig.TrafficEncapMode.IsNetworkPolicyOnly() {
// In policy-only mode, Node IP is also assigned to local gateway for masquerade.
// Assign IP to gw as required by SpoofGuard.
i.nodeConfig.GatewayConfig.IP = i.nodeConfig.NodeIPAddr.IP
gatewayIface.IP = i.nodeConfig.NodeIPAddr.IP
// No need to assign local CIDR to gw0 because local CIDR is not managed by Antrea
return nil
}

Expand Down
11 changes: 0 additions & 11 deletions pkg/agent/openflow/client.go
Original file line number Diff line number Diff line change
Expand Up @@ -478,10 +478,6 @@ func (c *client) InstallGatewayFlows(gatewayAddr net.IP, gatewayMAC net.Hardware
flows = append(flows, c.l3ToGatewayFlow(gatewayAddr, gatewayMAC, cookie.Default))
}

if c.encapMode.SupportsNoEncap() {
flows = append(flows, c.reEntranceBypassCTFlow(gatewayOFPort, gatewayOFPort, cookie.Default))
}

if err := c.ofEntryOperations.AddAll(flows); err != nil {
return err
}
Expand Down Expand Up @@ -524,11 +520,6 @@ func (c *client) initialize() error {
if err := c.ofEntryOperations.AddAll(c.establishedConnectionFlows(cookie.Default)); err != nil {
return fmt.Errorf("failed to install flows to skip established connections: %v", err)
}
if c.encapMode.SupportsNoEncap() {
if err := c.ofEntryOperations.Add(c.l2ForwardOutputReentInPortFlow(c.gatewayPort, cookie.Default)); err != nil {
return fmt.Errorf("failed to install L2 forward same in-port and out-port flow: %v", err)
}
}
if c.encapMode.IsNetworkPolicyOnly() {
if err := c.setupPolicyOnlyFlows(); err != nil {
return fmt.Errorf("failed to setup policy only flows: %w", err)
Expand Down Expand Up @@ -644,8 +635,6 @@ func (c *client) DeleteStaleFlows() error {

func (c *client) setupPolicyOnlyFlows() error {
flows := []binding.Flow{
// Bypasses remaining l3forwarding flows if the MAC is set via ctRewriteDstMACFlow.
c.l3BypassMACRewriteFlow(c.nodeConfig.GatewayConfig.MAC, cookie.Default),
// Rewrites MAC to gw port if the packet received is unmatched by local Pod flows.
c.l3ToGWFlow(c.nodeConfig.GatewayConfig.MAC, cookie.Default),
// Replies any ARP request with the same global virtual MAC.
Expand Down
39 changes: 0 additions & 39 deletions pkg/agent/openflow/pipeline.go
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,6 @@ var (
serviceLearnRegRange = binding.Range{16, 18}

globalVirtualMAC, _ = net.ParseMAC("aa:bb:cc:dd:ee:ff")
ReentranceMAC, _ = net.ParseMAC("de:ad:be:ef:de:ad")
hairpinIP = net.ParseIP("169.254.169.252").To4()
)

Expand Down Expand Up @@ -521,20 +520,6 @@ func (c *client) traceflowConnectionTrackFlows(dataplaneTag uint8, category cook
Done()
}

// reEntranceBypassCTFlow generates flow that bypass CT for traffic re-entering host network space.
// In host network space, we disable conntrack for re-entrance traffic so not to confuse conntrack
// in host namespace, This however has inverse effect on conntrack in Antrea conntrack zone as well,
// all subsequent re-entrance traffic becomes invalid.
func (c *client) reEntranceBypassCTFlow(gwPort, reentPort uint32, category cookie.Category) binding.Flow {
conntrackCommitTable := c.pipeline[conntrackCommitTable]
return conntrackCommitTable.BuildFlow(priorityHigh).MatchProtocol(binding.ProtocolIP).
MatchRegRange(int(marksReg), portFoundMark, ofPortMarkRange).
MatchInPort(gwPort).MatchReg(int(portCacheReg), reentPort).
Action().GotoTable(conntrackCommitTable.GetNext()).
Cookie(c.cookieAllocator.Request(category).Raw()).
Done()
}

// ctRewriteDstMACFlow rewrites the destination MAC with local host gateway MAC if the packets has set ct_mark but not sent from the host gateway.
func (c *client) ctRewriteDstMACFlow(gatewayMAC net.HardwareAddr, category cookie.Category) binding.Flow {
connectionTrackStateTable := c.pipeline[conntrackStateTable]
Expand Down Expand Up @@ -599,18 +584,6 @@ func (c *client) traceflowL2ForwardOutputFlow(dataplaneTag uint8, category cooki
Done()
}

// l2ForwardOutputReentInPortFlow generates the flow that forwards re-entrance peer Node traffic via antrea-gw0.
// This flow supersedes default output flow because ovs by default auto-skips packets with output = input port.
func (c *client) l2ForwardOutputReentInPortFlow(gwPort uint32, category cookie.Category) binding.Flow {
return c.pipeline[L2ForwardingOutTable].BuildFlow(priorityHigh).MatchProtocol(binding.ProtocolIP).
MatchRegRange(int(marksReg), portFoundMark, ofPortMarkRange).
MatchInPort(gwPort).MatchReg(int(portCacheReg), gwPort).
Action().SetSrcMAC(ReentranceMAC).
Action().OutputInPort().
Cookie(c.cookieAllocator.Request(category).Raw()).
Done()
}

// l2ForwardOutputServiceHairpinFlow uses in_port action for Service
// hairpin packets to avoid packets from being dropped by OVS.
func (c *client) l2ForwardOutputServiceHairpinFlow() binding.Flow {
Expand All @@ -621,18 +594,6 @@ func (c *client) l2ForwardOutputServiceHairpinFlow() binding.Flow {
Done()
}

// l3BypassMACRewriteFlow bypasses remaining l3forwarding flows if the MAC is set via ctRewriteDstMACFlow in
// conntrackState stage.
func (c *client) l3BypassMACRewriteFlow(gatewayMAC net.HardwareAddr, category cookie.Category) binding.Flow {
l3FwdTable := c.pipeline[l3ForwardingTable]
return l3FwdTable.BuildFlow(priorityNormal).MatchProtocol(binding.ProtocolIP).
MatchCTMark(gatewayCTMark).
MatchDstMAC(gatewayMAC).
Action().GotoTable(l3FwdTable.GetNext()).
Cookie(c.cookieAllocator.Request(category).Raw()).
Done()
}

// l3FlowsToPod generates the flow to rewrite MAC if the packet is received from tunnel port and destined for local Pods.
func (c *client) l3FlowsToPod(localGatewayMAC net.HardwareAddr, podInterfaceIP net.IP, podInterfaceMAC net.HardwareAddr, category cookie.Category) binding.Flow {
l3FwdTable := c.pipeline[l3ForwardingTable]
Expand Down
Loading