-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move underlay NICs back into H/W Classification #504
Conversation
Extra setup details on sn9/14 for TCP tunable adjustments (so we don't saturate at ~20Gbps), and better presentation of all known speeds: # TCP tunables.
ipadm set-prop -p max_buf=8388608 tcp
ipadm set-prop -p send_buf=8388608 tcp
ipadm set-prop -p recv_buf=8388608 tcp Global zone iperf numbers and status after a
EDIT: Updated 2024-06-20, stlouis-0-ge5ec805ada. |
I spent some time revisiting and fixing this up, and trying to understand the incompatibility with the flows work -- hacky as it is, I wouldn't have expected breakage. There is a fairly consistent way to trigger a kernel panic, roughly in order of:
I have some dumps to dig into on Monday, remarkably only the test harness is causing issues when we put the two features together (i.e., only once traffic is sent over do we get a panic). EDIT: One dump puts us at:
EDIT2: Found it -- |
f18abb1
to
1da884e
Compare
// Set up a unicast callback. The MAC address here is a sentinel value with | ||
// nothing real behind it. This is why we picked the zero value in the Oxide | ||
// OUI space for virtual MACs. The reason this is being done is that illumos | ||
// requires that if there is a single mac client on a link, that client must | ||
// have an L2 address. This was not caught until recently, because this is | ||
// only enforced as a debug assert in the kernel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should never be the case now: being visible via (and getting a stream from) DLS implies the existence of a client on the underlying MAC. This is also the root-cause of landing in software classification, as we end up with two clients.
I'm going to try coming/going via dld_open and related, though we'll see if that works in quite the same way.
Good thing we don't perform any hairpin responses on inbound traffic, eh?
I think this is good to go now, a quick recap on what's changed:
CI is going to fail on |
lib/opte-ioctl/src/lib.rs
Outdated
@@ -396,7 +396,7 @@ where | |||
} | |||
|
|||
pub fn fetch_fragile_types() -> Result<FragileInternals, Error> { | |||
let dld_ctf = Ctf::from_file("/kernel/drv/amd64/dld") | |||
let dld_ctf = Ctf::from_file("/system/object/dld/object") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😮 TIL objfs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I learned that trick from Robert! I think it'll be short-lived if we use your below suggestion. I'll preserve a copy of this branch since we could end up needing these tricks again (although I hope we do not).
Sad to see the CTF check machinery go, but it's for the best after all...
* Move underlay NICs back into H/W Classification (oxidecomputer/opte#504) My disposition is to wait til R11 before we merge this -- I've done lengthy testing on `glasgow`, but I would like plenty of soak time on dogfood before this sees a release.
* Move underlay NICs back into H/W Classification (oxidecomputer/opte#504) My disposition is to wait til R11 before we merge this -- I've done lengthy testing on `glasgow`, but I would like plenty of soak time on dogfood before this sees a release.
This PR rewrites the core of OPTE's packet model to use zero-copy packet parsing/modification via the [`ingot`](oxidecomputer/ingot#1) library. This enables a few changes which get us just shy of the 3Gbps mark. * **[2.36 -> 2.7]** The use of ingot for modifying packets in both the slowpath (UFT miss) and existing fastpath (UFT hit). * Parsing is faster -- we no longer copy out all packet header bytes onto the stack, and we do not allocate a vector to decompose an `mblk_t` into individual links. * Packet field modifications are applied directly to the `mblk_t` as they happen, and field reads are made from the same source. * Non-encap layers are not copied back out. * **[2.7 -> 2.75]** Packet L4 hashes are cached as part of the UFT, speeding up multipath route selection over the underlay. * **[2.75 -> 2.8]** Incremental Internet checksum recalculation is only performed when applicable fields change on inner flow headers (e.g., NAT'd packets). * VM-to-VM / intra-VPC traffic is the main use case here. * **[2.8 -> 3.05]** `NetworkParser`s now have the concept of inbound & outbound `LightweightMeta` formats. These support the key operations needed to execute all our UFT flows today (`FlowId` lookup, inner headers modification, encap push/pop, cksum update). * This also allows us to pre-serialize any bytes to be pushed in front of a packet, speeding up `EmitSpec`. * This is crucial for outbound traffic in particular, which has far smaller (in `struct`-size) metadata. * UFT misses or incompatible flows fallback to using the full metadata. * **[3.05 -> 2.95]** TCP state tracking uses a separate per-flow lock and does not require any lookup from a UFT. * I do not have numbers on how large the performance loss would be if we held the `Port` lock for the whole time. * (Not measured) Packet/UFT L4 Hashes are used as the Geneve source port, spreading inbound packets over NIC Rx queues based on the inner flow. * This is now possible because of #504 -- software classification would have limited us to the default inbound queue/group. * I feel bad for removing one more FF7 reference, but that is the way of these things. RIP port `7777`. * Previously, Rx queue affinity was derived solely from `(Src Sled, Dst Sled)`. There are several other changes here made to how OPTE functions which are needed to support the zero-copy model. * Each collected set of header transforms are `Arc<>`'d, such that we can apply them outside of the `Port` lock. * `FlowTable<S>`s now store `Arc<FlowEntry<S>>`, rather than `FlowEntry<S>`. * This enables the UFT entry for any flow to store its matched TCP flow, update its hit count and timestamp, and then update the TCP state without reacquring the `Port` lock. * This also drastically simplifies TCP state handling in fast path cases to not rely on post-transformation packets for lookup. * `Opte::process` returns an `EmitSpec` which is needed to finalise a packet before it can be used. * I'm not too happy about the ergonomics, but we have this problem because otherwise we'd need `Packet` to have some self-referential fields when supporting other key parts of XDE (e.g., parse -> use fields -> select port -> process). Closes #571, closes #481, closes #460. Slightly alleviates #435.
Today, we get our TX and RX pathways on underlay devices for XDE by creating a secondary MAC client on each device. As part of this process we must attach a unicast MAC address (or specify
MAC_OPEN_FLAGS_NO_UNICAST_ADDR
) during creation to spin up a valid datapath, otherwise we can receive packets on our promiscuous mode handler but any sent packets are immediately dropped by MAC. However, datapath setup then fails to supply a dedicated ring/group for the new client, and the device is reduced to pure software classification. This hard-disables any ring polling threads, and so all packet processing occurs in the interrupt context. This limits throughput and increases OPTE's blast radius on control plane/crucible traffic between sleds.This PR places a hold onto the underlay NICs via
dls
, and makes use ofdls_open
/dls_close
to acquire a valid transmit pathway onto the original (primary) MAC client, to which we can also attach a promiscuous callback. As desired, we are back in hardware classification:And if we zoom into the flamegraph, we now receive packets via
mac_rx_srs_poll_ring
(!!!).Performance numbers are included below -- on its own, a 7% increase in guest throughput and a 74% increase in underlay max throughput.
This work is orthogonal to #62 (and related efforts) which will get us out of promiscuous mode -- both are necessary parts of making optimal use of the illumos networking stack.
Closes #489 .