Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use mac flows to filter xde traffic #61 #62

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

rzezeski
Copy link
Contributor

This work is on the back burner at the moment as there is more pressing work that can get done; sticking with promisc isn't a problem for the near term future.

@FelixMcFelix
Copy link
Collaborator

FelixMcFelix commented Nov 9, 2023

Not yet had the luxury to test on real NICs, but we have progress (below taken from a running omicron + SoftNPU instance):

kyle@farme:~$ flowadm show-flow
FLOW        LINK        IPADDR                   PROTO  LPORT   RPORT   DSFLD
net0_xde    net0        --                       udp    6081    --      --
net1_xde    net1        --                       udp    6081    --      --
kyle@farme:~$ flowstat
           FLOW    IPKTS   RBYTES    IERRS    OPKTS   OBYTES    OERRS
       net0_xde       92   15.73K        0        0        0        0
       net1_xde        0        0        0        0        0        0
kyle@farme:~$ pfexec opteadm dump-layer gateway -p opte0
Layer gateway
======================================================================
Inbound Flows
----------------------------------------------------------------------
PROTO  SRC IP           SPORT  DST IP           DPORT  HITS     ACTION

Outbound Flows
----------------------------------------------------------------------
PROTO  SRC IP           SPORT  DST IP           DPORT  HITS     ACTION

Inbound Rules
----------------------------------------------------------------------
ID     PRI    HITS   PREDICATES                             ACTION
0      1000   97     inner.ip.dst=172.30.3.5                "Static: ether.src=A8:40:25:FF:77:77"
                     inner.ether.dst=A8:40:25:FF:9F:7B

DEF    --     0      --                                     "deny"

Outbound Rules
----------------------------------------------------------------------
ID     PRI    HITS   PREDICATES                             ACTION
3      1      0      inner.ether.src=A8:40:25:FF:9F:7B      "Hairpin: ICMPv4 Echo Reply (A8:40:25:FF:77:77,172.30.3.1) => (A8:40:25:FF:9F:7B,172.30.3.5)"
                     inner.ether.dst=A8:40:25:FF:77:77
                     inner.ip.src=172.30.3.5
                     inner.ip.dst=172.30.3.1
                     inner.ip.proto=ICMP
                     icmp.msg_type=echo request

2      1      1      inner.ether.dst=FF:FF:FF:FF:FF:FF      "Hairpin: DHCPv4 ACK: 172.30.3.5"
                     inner.ether.src=A8:40:25:FF:9F:7B
                     inner.ip.src=0.0.0.0
                     inner.ip.dst=255.255.255.255
                     inner.ip.proto=UDP
                     inner.ulp.dst=67
                     inner.ulp.src=68
                     dhcp.msg_type=Request

1      1      1      inner.ether.dst=FF:FF:FF:FF:FF:FF      "Hairpin: DHCPv4 OFFER: 172.30.3.5"
                     inner.ether.src=A8:40:25:FF:9F:7B
                     inner.ip.src=0.0.0.0
                     inner.ip.dst=255.255.255.255
                     inner.ip.proto=UDP
                     inner.ulp.dst=67
                     inner.ulp.src=68
                     dhcp.msg_type=Discover

0      1      14     inner.ether.ether_type=ARP             "Handle Packet"
                     inner.ether.dst=FF:FF:FF:FF:FF:FF
                     inner.ether.src=A8:40:25:FF:9F:7B

4      1000   98     inner.ip.src=172.30.3.5                "Meta: vpc-meta"
                     inner.ether.src=A8:40:25:FF:9F:7B

DEF    --     9      --                                     "deny"

This currently relies on utterly abusing the internals of flow_entry_t, mac_soft_ring_set_t, et al., and places xde_rx in the same callback slot that mac_rx_deliver would occupy on only the srs attached to a desired flow. Part of the issue is that mac_rx_deliver will call into the rx callback of the parent mcip (net0/1, cxgbe0/1) -- i.e., probably i_dls_link_rx on the NIC itself -- and it's not clear to me if we want to do this wrt. the parent NIC and/or global zone.

@FelixMcFelix
Copy link
Collaborator

FelixMcFelix commented Mar 8, 2024

So, the interesting news is that this still works, and works on Intel NICs. Sadly, performance (in the latency sense) is practically identical:

EDIT: Marginal changes are expected on these numbers; my test setup is capped at 2x1GbE and these latency measurements only cover xde_rx/xde_mc_tx. Covering mac_rx(_ring) would take some trickier dtrace predicates. Impact of removed copies, reduced pressure due to fewer copies of underlay cross traffic aren't captured here.

C2S results from #62 after `master`
###----------------------###
:::  Running experiment  :::
:::iperf-tcp/over-nic/c2s:::
###----------------------###
dtrace: description 'profile-201us ' matched 2 probes
Run 1/10...1641.988Mbps
Run 2/10...1742.772Mbps
Run 3/10...1429.135Mbps
Run 4/10...1637.982Mbps
Run 5/10...1583.501Mbps
Run 6/10...1672.459Mbps
Run 7/10...1669.73Mbps
Run 8/10...1516.732Mbps
Run 9/10...1712.84Mbps
Run 10/10...1367.633Mbps
###---------------------###
:::    iPerf done...    :::
:::Awaiting out files...:::
###---------------------###
###-----###
:::done!:::
###-----###
Gnuplot not found, using plotters backend
Benchmarking iperf-tcp/over-nic/c2s/rx
Benchmarking iperf-tcp/over-nic/c2s/rx: Warming up for 3.0000 s
Benchmarking iperf-tcp/over-nic/c2s/rx: Collecting 100 samples in estimated 20.000 s (199M iterations)
Benchmarking iperf-tcp/over-nic/c2s/rx: Analyzing
iperf-tcp/over-nic/c2s/rx
                        time:   [2.2234 µs 2.2235 µs 2.2237 µs]
                        change: [+24.453% +24.474% +24.495%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high severe

Benchmarking iperf-tcp/over-nic/c2s/tx
Benchmarking iperf-tcp/over-nic/c2s/tx: Warming up for 3.0000 s
Benchmarking iperf-tcp/over-nic/c2s/tx: Collecting 100 samples in estimated 20.000 s (200M iterations)
Benchmarking iperf-tcp/over-nic/c2s/tx: Analyzing
iperf-tcp/over-nic/c2s/tx
                        time:   [3.4678 µs 3.4680 µs 3.4682 µs]
                        change: [+17.321% +17.336% +17.351%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe
C2S results repeated
###----------------------###
:::  Running experiment  :::
:::iperf-tcp/over-nic/c2s:::
###----------------------###
dtrace: description 'profile-201us ' matched 2 probes
Run 1/10...1739.525Mbps
Run 2/10...1655.766Mbps
Run 3/10...1753.173Mbps
Run 4/10...1730.84Mbps
Run 5/10...1768.663Mbps
Run 6/10...1739.911Mbps
Run 7/10...1708.786Mbps
Run 8/10...1657.681Mbps
Run 9/10...1764.689Mbps
Run 10/10...1711.396Mbps
###---------------------###
:::    iPerf done...    :::
:::Awaiting out files...:::
###---------------------###
###-----###
:::done!:::
###-----###
Gnuplot not found, using plotters backend
Benchmarking iperf-tcp/over-nic/c2s/rx
Benchmarking iperf-tcp/over-nic/c2s/rx: Warming up for 3.0000 s
Benchmarking iperf-tcp/over-nic/c2s/rx: Collecting 100 samples in estimated 20.000 s (212M iterations)
Benchmarking iperf-tcp/over-nic/c2s/rx: Analyzing
iperf-tcp/over-nic/c2s/rx
                        time:   [1.9022 µs 1.9023 µs 1.9025 µs]
                        change: [-14.462% -14.447% -14.431%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

Benchmarking iperf-tcp/over-nic/c2s/tx
Benchmarking iperf-tcp/over-nic/c2s/tx: Warming up for 3.0000 s
Benchmarking iperf-tcp/over-nic/c2s/tx: Collecting 100 samples in estimated 20.000 s (190M iterations)
Benchmarking iperf-tcp/over-nic/c2s/tx: Analyzing
iperf-tcp/over-nic/c2s/tx
                        time:   [3.0245 µs 3.0246 µs 3.0248 µs]
                        change: [-12.793% -12.781% -12.768%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe

So between a few runs we're basically in the same ballpark, possibly a little worse off as we've now added in [mac_rx_flow, mac_rx_classify, mac_srs_subflow_process, mac_srs_process, mac_srs_drain] to the rx callstack. The upshot is that opte-bad-packet.d is absolutely silent, so we're not paying anything for the remainder of the underlay traffic, and we're definitely not in promisc:

# master
kyle@farme:~/gits/opte$ pfexec opteadm set-xde-underlay igb0 igb1
kyle@farme:~/gits/opte$ ifconfig | grep igb
igb0: flags=1000942<BROADCAST,RUNNING,PROMISC,MULTICAST,IPv4> mtu 9000 index 3
igb1: flags=1000942<BROADCAST,RUNNING,PROMISC,MULTICAST,IPv4> mtu 9000 index 4
igb0: flags=20002104941<UP,RUNNING,PROMISC,MULTICAST,DHCP,ROUTER,IPv6> mtu 9000 index 3
igb1: flags=20002104941<UP,RUNNING,PROMISC,MULTICAST,DHCP,ROUTER,IPv6> mtu 9000 index 4

# git switch use-the-flow-luke-61, driver recompile, ...
kyle@farme:~/gits/opte$ pfexec opteadm set-xde-underlay igb0 igb1
kyle@farme:~/gits/opte$ ifconfig | grep igb
igb0: flags=1000842<BROADCAST,RUNNING,MULTICAST,IPv4> mtu 9000 index 3
igb1: flags=1000842<BROADCAST,RUNNING,MULTICAST,IPv4> mtu 9000 index 4
igb0: flags=20002104841<UP,RUNNING,MULTICAST,DHCP,ROUTER,IPv6> mtu 9000 index 3
igb1: flags=20002104841<UP,RUNNING,MULTICAST,DHCP,ROUTER,IPv6> mtu 9000 index 4

I don't yet know why zone-to-zone over simnets is broken on CI -- from what I recall it worked on my local helios box before I acquired a second test node.

@rcgoodfellow
Copy link
Contributor

What do you get when running a similar traffic flow between the raw IPv6 addresses?

@FelixMcFelix
Copy link
Collaborator

FelixMcFelix commented Mar 8, 2024

What do you get when running a similar traffic flow between the raw IPv6 addresses?

While running an iperf session over each underlay link for 100s:

kyle@farme:~/gits/opte$ cargo kbench in-situ have-a-go
    Finished bench [optimized + debuginfo] target(s) in 0.15s
     Running benches/xde.rs (target/release/deps/xde-ae24f11c9169898b)
###----------------------###
:::  DTrace running...   :::
:::Type 'exit' to finish.:::
###----------------------###
dtrace: description 'profile-201us ' matched 2 probes
exit
###---------------------###
:::Awaiting out files...:::
###---------------------###
###-----###
:::done!:::
###-----###
ERROR: No stack counts found
Failed to create flamegraph for xde_rx.
ERROR: No stack counts found
Failed to create flamegraph for xde_mc_tx.

No hits for non-Geneve traffic. The flows themselves are:

kyle@farme:~/gits/opte$ flowadm show-flow
FLOW        LINK        IPADDR                   PROTO  LPORT   RPORT   DSFLD
igb0_xde    igb0        --                       udp    6081    --      --
igb1_xde    igb1        --                       udp    6081    --      --

So far as I can tell we can't jointly specify IP addr + family + port, c.f. man flowadm:

The following six types of combinations of attributes are supported:
local_ip=address[/prefixlen]
remote_ip=address[/prefixlen]
transport={tcp|udp|sctp|icmp|icmpv6}
transport={tcp|udp|sctp},local_port=port
transport={tcp|udp|sctp},remote_port=port
dsfield=val[:dsfield_mask]

During setup, we are not doing any work to ensure that Helios has a
valid NDP cache entry ready to use over the simnet link we install for
testing. As a result, XDE selects the right output port, but installs
source and destination MAC addrs of zero.

This worked before; the devices were in promiscuous mode, so the packets
made into `xde_rx`. In other cases, the underlay traffic in e.g. a
SoftNPU deployment was priming all the necessary NCEs, so we always knew
the target MAC address. Obviously this is an easy fix here, and in
practice we'll always have the NCE for the nexthop (i.e., the sidecar).
@morlandi7 morlandi7 added this to the 8 milestone Mar 14, 2024
@rcgoodfellow rcgoodfellow modified the milestones: 8, 9 Apr 11, 2024
@FelixMcFelix FelixMcFelix modified the milestones: 9, 10 Jul 1, 2024
@askfongjojo askfongjojo modified the milestones: 10, 12 Aug 15, 2024
FelixMcFelix added a commit that referenced this pull request Aug 20, 2024
Today, we get our TX and RX pathways on underlay devices for XDE by
creating a secondary MAC client on each device. As part of this process
we must attach a unicast MAC address (or specify
`MAC_OPEN_FLAGS_NO_UNICAST_ADDR`) during creation to spin up a valid
datapath, otherwise we can receive packets on our promiscuous mode
handler but any sent packets are immediately dropped by MAC. However,
datapath setup then fails to supply a dedicated ring/group for the new
client, and the device is reduced to pure software classification. This
hard-disables any ring polling threads, and so all packet processing
occurs in the interrupt context. This limits throughput and increases
OPTE's blast radius on control plane/crucible traffic between sleds.

This PR places a hold onto the underlay NICs via `dls`, and makes use of
`dls_open`/`dls_close` to acquire a valid transmit pathway onto the
original (primary) MAC client, to which we can also attach a promiscuous
callback. As desired, we are back in hardware classification.

This work is orthogonal to #62 (and related efforts) which will get us
out of promiscuous mode -- both are necessary parts of making optimal
use of the illumos networking stack.

Closes #489 .
@morlandi7 morlandi7 modified the milestones: 12, 13 Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants