Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix intra-Node service access when both Egress and AntreaProxy is enabled #2332

Merged
merged 1 commit into from
Jul 5, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions docs/design/windows-design.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,7 @@ addresses. It is implemented using OpenFlow.

To support this feature, two additional marks are introduced:

* The 17th bit of NXM Register0 is set for Pod-to-external traffic. This bit is set in the `L3Forwarding` table,
* The 17th bit of NXM Register0 is set for Pod-to-external traffic. This bit is set in the `L3Forwarding` table,
and is consumed in the `ConntrackCommit` table.
* The SNATed traffic is marked with **0x40** in the ct context. This mark is set in the `ConntrackCommit`
table when committing the new connection into the ct zone, and is consumed in the `ConntrackState` table
Expand All @@ -163,8 +163,8 @@ The following changes in the OVS pipeline are needed:
This is to ensure the reply packets from the external address to the Pod will be "unSNAT'd".
* Ensure that the packet is translated based on the NAT information provided when committing the connection by
using the `nat` argument with the `ct` action.
* Rewrite the destination MAC with the global virtual MAC (aa:bb:cc:dd:ee:ff) in the `ConntrackState` table if
the packet is from the uplink interface and has the SNAT ct_mark.
* Set the macRewrite bit in the register (reg0[19]) in the `ConntrackState` table if the packet is from the uplink
interface and has the SNAT ct_mark.

Following is an example for SNAT relevant OpenFlow entries.

Expand All @@ -179,16 +179,16 @@ Conntrack Table: 30
table=30, priority=200, ip actions=ct(table=31,zone=65520,nat)

ConntrackState Table: 31
table=31, priority=210,ct_state=-new+trk,ct_mark=0x40,ip,reg0=0x4/0xffff actions=mod_dl_dst:aa:bb:cc:dd:ee:ff,goto_table:40
table=31, priority=210, ct_state=-new+trk,ct_mark=0x40,ip,reg0=0x4/0xffff actions=load:0x1->NXM_NX_REG0[19],goto_table:40
table=31, priority=200, ip,reg0=0x4/0xffff actions=output:br-int

L3Forwarding Table: 70
// Forward the packet to L2ForwardingCalculation table if it is traffic to the local Pods.
table=70, priority=200, ip,reg0=0x2/0xffff,nw_dst=10.10.0.0/24 actions=goto_table:80
// Forward the packet to L2ForwardingCalculation table if it is traffic to the local Pods and doesn't require MAC rewriting.
table=70, priority=200, ip,reg0=0/0x80000,nw_dst=10.10.0.0/24 actions=goto_table:80
// Forward the packet to L2ForwardingCalculation table if it is Pod-to-Node traffic.
table=70, priority=200, ip,reg0=0x2/0xffff,nw_dst=$local_nodeIP actions=goto_table:80
// Forward the packet to L2ForwardingCalculation table if it is return traffic of an external-to-Pod connection.
table=70, priority=200, ip,reg0=0x2/0xffff,ct_mark=0x20 actions=goto_table:80
table=70, priority=210, ct_state=+rpl+trk,ct_mark=0x20,ip actions=mod_dl_dst:0e:6d:42:66:92:46,resubmit(,80)
// Add SNAT mark if it is Pod-to-external traffic.
table=70, priority=190, ct_state=+new+trk,ip,reg0=0x2/0xffff actions=load:0x1->NXM_NX_REG0[17], goto_table:80

Expand Down
24 changes: 6 additions & 18 deletions pkg/agent/openflow/pipeline.go
Original file line number Diff line number Diff line change
Expand Up @@ -1805,10 +1805,11 @@ func (c *client) snatCommonFlows(nodeIP net.IP, localSubnet net.IPNet, localGate
ipProto := getIPProtocol(localSubnet.IP)
flows := []binding.Flow{
// First install flows for traffic that should bypass SNAT.
// This flow is for traffic to the local Pod subnet.
// This flow is for traffic to the local Pod subnet that don't need MAC rewriting (L2 forwarding case). Other
// traffic to the local Pod subnet will be handled by L3 forwarding rules.
l3FwdTable.BuildFlow(priorityNormal).
MatchProtocol(ipProto).
MatchRegRange(int(marksReg), markTrafficFromLocal, binding.Range{0, 15}).
MatchRegRange(int(marksReg), 0, macRewriteMarkRange).
MatchDstIPNet(localSubnet).
Action().GotoTable(nextTable).
Cookie(c.cookieAllocator.Request(category).Raw()).
Expand All @@ -1821,22 +1822,9 @@ func (c *client) snatCommonFlows(nodeIP net.IP, localSubnet net.IPNet, localGate
Action().GotoTable(nextTable).
Cookie(c.cookieAllocator.Request(category).Raw()).
Done(),
// This flow is for the return traffic of connections to a local
// Pod through the gateway interface (so gatewayCTMark is set).
// For example, the return traffic of a connection from an IP
// address (not the Node's management IP or gateway interface IP
// which are covered by other flows already) of the local Node
// to a local Pod. It might also catch the Service return
// traffic from a local server Pod, but this case is also
// covered by other flows (the flows matching the local and
// remote Pod subnets) anyway.
l3FwdTable.BuildFlow(priorityNormal).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this flow?

See:

  // This flow is for the return traffic of connections to a local
  // Pod through the gateway interface (so gatewayCTMark is set).
  // For example, the return traffic of a connection from an IP
  // address (not the Node's management IP or gateway interface IP
  // which are covered by other flows already) of the local Node
  // to a local Pod.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned the reason in the commit message "It also removes a flow in L3Forwarding table that specially handles gatewayCT related traffic, which has been taken care of by another more generic flow in same table." It's added in #2318.

// Rewrite the destination MAC address with the local host gateway MAC if the packet is in the reply direction and
// is marked with gatewayCTMark. This is for connections which were initiated through the gateway, to ensure that
// this reply traffic gets forwarded correctly (back to the host network namespace, through the gateway). In
// particular, it is necessary in the following 2 cases:
// 1) reply traffic for connections from a local Pod to a ClusterIP Service (when AntreaProxy is disabled and
// kube-proxy is used). In this case the destination IP address of the reply traffic is the Pod which initiated the
// connection to the Service (no SNAT). We need to make sure that these packets are sent back through the gateway
// so that the source IP can be rewritten (Service backend IP -> Service ClusterIP).
// 2) when hair-pinning is involved, i.e. connections between 2 local Pods, for which NAT is performed. This
// applies regardless of whether AntreaProxy is enabled or not, and thus also applies to Windows Nodes (for which
// AntreaProxy is enabled by default). One example is a Pod accessing a NodePort Service for which
// externalTrafficPolicy is set to Local, using the local Node's IP address.
for _, proto := range c.ipProtocols {
flows = append(flows, l3FwdTable.BuildFlow(priorityHigh).MatchProtocol(proto).
MatchCTMark(gatewayCTMark, nil).
MatchCTStateRpl(true).MatchCTStateTrk(true).
Action().SetDstMAC(localGatewayMAC).
Action().GotoTable(l3FwdTable.GetNext()).
Cookie(c.cookieAllocator.Request(category).Raw()).
Done())
}

I think currently the deleted flow will never be hit:

table=70, priority=210,ct_state=+rpl+trk,ct_mark=0x20,ip actions=mod_dl_dst:e2:e5:a4:9b:1c:b1,goto_table:80
table=70, priority=200,ip,reg0=0x2/0xffff,ct_mark=0x20 actions=goto_table:80

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I did not see that! But would you update the comments for the new flow to include the removed case too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to keep the comments in this func, and saying this case is covered by another flow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it makes sense to only send non reply traffic to snatTable so we don't need to have special handling for various return traffic? I see the deleted flow is for gatewayCT related return traffic, and the comment also said Service return traffic may be caught by it. I think generally no reply traffic would need to go to snatTable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but how we identify non-reply traffic?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "-rpl+trk" should work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good to me, but I did not think through all cases to be confident.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I was trying to ensure there would be no regression if only forwarding "-rpl+trk" only traffic to snatTable, I didn't figure out whether it would really affect windows case. As the patch needs to be backported, to minimize the risk, I would just make minimal change to the case of local pod subnet to fix the specific issue. Maybe let's consider the windows case together later.
I have added a comment here to say the case of gatewayCT flow has been covered by other flow, please take another look.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The strategy works for me.

MatchProtocol(ipProto).
MatchRegRange(int(marksReg), markTrafficFromLocal, binding.Range{0, 15}).
MatchCTMark(gatewayCTMark, nil).
Action().GotoTable(nextTable).
Cookie(c.cookieAllocator.Request(category).Raw()).
Done(),
// The return traffic of connections to a local Pod through the gateway interface (so gatewayCTMark is set)
// should bypass SNAT too. But it has been covered by the gatewayCT related flow generated in l3FwdFlowToGateway
// which forwards all reply traffic for such connections back to the gateway interface with the high priority.

// Send the traffic to external to snatTable.
l3FwdTable.BuildFlow(priorityLow).
Expand Down
6 changes: 1 addition & 5 deletions test/integration/agent/openflow_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -1295,17 +1295,13 @@ func prepareExternalFlows(nodeIP net.IP, localSubnet *net.IPNet, gwMAC net.Hardw
uint8(70),
[]*ofTestUtils.ExpectFlow{
{
MatchStr: fmt.Sprintf("priority=200,%s,reg0=0x2/0xffff,%s=%s", ipProtoStr, nwDstFieldName, localSubnet.String()),
MatchStr: fmt.Sprintf("priority=200,%s,reg0=0/0x80000,%s=%s", ipProtoStr, nwDstFieldName, localSubnet.String()),
ActStr: "goto_table:80",
},
{
MatchStr: fmt.Sprintf("priority=200,%s,reg0=0x2/0xffff,%s=%s", ipProtoStr, nwDstFieldName, nodeIP.String()),
ActStr: "goto_table:80",
},
{
MatchStr: fmt.Sprintf("priority=200,ct_mark=0x20,%s,reg0=0x2/0xffff", ipProtoStr),
ActStr: "goto_table:80",
},
{
MatchStr: fmt.Sprintf("priority=190,%s,reg0=0x2/0xffff", ipProtoStr),
ActStr: "goto_table:71",
Expand Down