Initial version #1

FelixMcFelix · 2024-10-04T10:11:15Z

This PR contains the first implementation of this library, which is designed as a faster (zero-copy when possible) parsing and metadata backend for OPTE.

At a high level, this includes:

A proc macro which generates, for an owned packet header definition:
- a packed version of that header, and methods to parse these headers from buffers.
- traits to read and write fields consistently, regardless of whether a packet is owned/borrowed.
A proc macro which generates choices between several headers, conditioned on a common 'choice' type.
A proc macro which generates a parser for a chain of headers (with optional control actions insterspersed).
- This is designed to account mainly for platforms which store packets as chains of buffers, and do not straddle individual headers across these buffers (i.e., illumos).
Where possible, additional traits to emit packets into byte buffers and convert to/from owned versions.

Testing notes

Macros (choice, Ingot, Parse) and base types (ingot_types) should be quite thoroughly documented on their intended use, at least via cargo +nightly doc.
- Individual packets and newtypes for actual protocols, less so. There isn't yet logic to forward docattrs or build up docstrings for generated fields/methods/items.
cargo +nightly miri test is an essential part of ensuring that all operations here are sane -- my understanding is that the test suite exercises all elements of packet parsing, field access/setting, and packet emit that use unsafe.
cargo expand -p ingot > full.rs is my main sledgehammer for verifying that generated methods are behaving as expected. This is probably most useful for examining how, e.g., packet parsing and bitfields are implemented.
cargo bench includes some ballparks for certain operations: individual header parse, full packet parse in a single buffer/across multiple buffers, full packet parse returning purely borrowed vs. hybrid contents.

Rough TODOs:

Use a separate Parse Error type on full packets, including the name of the affected layer as a &'static str and/or &'static CStr.
Fully document macros.
Forward doc-attrs to generated structs.

Should work for, e.g., [u8; 16]/

Still need to make the swap and convert all existing generators, but it seems like we're doing the right thing when it comes to generics etc.

What an irksome refactor!

FelixMcFelix · 2024-10-08T14:24:11Z

Sanity check within OPTE for the recent error handling changes:

kyle@farme:~/gits/opte/dtrace$ pfexec ./opte-trace opte-bad-packet.d
PORT         DIR MBLK               MSG+DATA

unknown      IN  0xfffffe69f9c1fa20 Parse->IngotError->outer_encap->TooSmall[0, 0]
unknown      IN  0xfffffe69e52a9a00 Parse->IngotError->outer_v6->Unwanted[0, 0]
unknown      IN  0xfffffe69f9da2160 Parse->IngotError->outer_v6->Unwanted[0, 0]
unknown      IN  0xfffffe69e2e871a0 Parse->IngotError->outer_udp->Unwanted[0, 0]
unknown      IN  0xfffffe69f9d42e40 Parse->IngotError->outer_udp->Unwanted[0, 0]
unknown      IN  0xfffffe69fa0c4e80 Parse->IngotError->outer_udp->Unwanted[0, 0]
unknown      IN  0xfffffe69e64bae80 Parse->IngotError->outer_udp->Unwanted[0, 0]
unknown      IN  0xfffffe69fe1ece00 Parse->IngotError->outer_encap->TooSmall[0, 0]

I think this is in a good spot now. Quick round of self review then I shall loop folks in while I push on getting the OPTE side cleaned up.

ingot/src/example_chain.rs

zeeshanlakhani · 2024-10-09T12:51:08Z

ingot-types/src/field.rs

+    const MINIMUM_LENGTH: usize = T::MINIMUM_LENGTH;
+
+    #[inline]
+    fn packet_length(&self) -> usize {


General Q: Is packet length automatically set when I extend a V6 packet, for example, with an additional extension header?

It is, yep. Header::packet_length is derived to include the current wire lengths of all variable-length and parsed fields:

impl<V: ::ingot::types::ByteSlice> ::ingot::types::Header for ValidIpv6<V> { const MINIMUM_LENGTH: usize = 40usize; #[inline] fn packet_length(&self) -> usize { Self::MINIMUM_LENGTH + self.1.packet_length() } } impl ::ingot::types::Header for Ipv6 { const MINIMUM_LENGTH: usize = 40usize; #[inline] fn packet_length(&self) -> usize { Self::MINIMUM_LENGTH + self.v6ext.packet_length() } }

In the borrowed case, this just returns the length of the underlying buffer for said options. In the owned case (for Repeated anyhow), it sums up the packet_length of all elements in the inner vec.

ingot-types/src/packet.rs

ingot/src/tests.rs

Accordingly, we need to move our benchmarks out to go with it. I've taken this chance to throw out a few more focussing on different packet parse types.

pfmooney

I'm still working my way through this, but here's an initial set of comments/questions.

Cargo.toml

ingot/Cargo.toml

ingot/src/lib.rs

pfmooney · 2024-10-31T00:24:29Z

ingot-types/src/accessor.rs

+            let (r, _): (Ref<&[u8], T>, _) =
+                Ref::from_prefix(buf.as_bytes())
+                    .map_err(|_| ParseError::TooSmall)?;


If I'm reading this (and other bits) correctly, the expectations of the parsing logic is that the buffer from which the item (header, or component) is being parsed must be large enough to contain the entire thing? This would be at odds with the statement in the readme:

Ingot allows packet parsing over split buffers

Am I missing how a packet with its headers split across several mblk_ts would be parsed?

Yes, individual headers do still need to reside in one contiguous buffer. Packets (#[derive(Parse)]) can contain headers in different buffers, however, using the Read trait and the generated parse_read methods. It could be possible in future to extend that functionality all the way down for extension headers or convert a straddled read to the owned version of a header, but we don't quite need that today.

This is the same model we have in OPTE -- a header can't straddle several mblk_ts, but zones do send (ether, ip, icmp), (ether) -> (ip, icmp) and similar permutations so we need to handle those.

Ok. I think this constraint on operation should be called out pretty clearly. While I'm not saying that Ingot and OPTE need to handle parsing of individual items across mblk_t boundaries immediately, that is something that many other things in mac do not take for granted today. (mac_ether_offload_info() does not even assume it can read a u16 without crossing an mblk_t boundary)

Yes, it's true that zones are well behaved, and viona transmissions should be too. I have not yet looked into cxgbe to see if the buffer sizes which are chosen there will guarantee that an incoming packet does not violate the restrictions set forth here.

Have you asked @rmustacc about his position on this?

Ok. I think this constraint on operation should be called out pretty clearly.

Agreed. I wasn't aware that illumos pessimised so heavily on this front.

Have you asked @rmustacc about his position on this?

Not yet, I'll reach out.

One thing I'd like to confirm is that my understanding is that say the entire IP or TCP header (including any extensions) must be in a single buffer. Each distinct header can be in a different single buffer. Multiple headers may be in same single contiguous buffer. Is that correct?

Ok. I think this constraint on operation should be called out pretty clearly.

Agreed. I wasn't aware that illumos pessimised so heavily on this front.

I was pessimistic in mac_ether_offload_info() that I didn't want to think about single mblk and calling msgpullup which is what a lot of folks did. That said, I think I am probably more pessimistic then required. I wrote that function from the most paranoid position I could be in. Basically I'd rather do what I did than ever call msgpullup(). I know that @pfmooney has told me many times how we could be much smarter in that code.

Have you asked @rmustacc about his position on this?

In general the way that most device drivers and hardware works is that they'll basically fill data into descriptors linearly. That is, if they get given mixed size buffers they'll fill them that way (though most hardware has constraint on buffer sizes). The only feature that some hardware has (but we don't use) is that such hardware allows for splitting the header and data into different descriptors.

I think there are a lot of places that do end up doing different checks here. But effectively the whole IP fast path relies on the headers being contiguous in the first mblk and other things through the stack.

Pretty much all driver buffer sizes are going to be sized in a way that I don't think we're going to trip over this given a reasonable upper bound on header size (e.g. 256 bytes). I think the current constraint is probably something we can reasonably live with, with the following two constraints:

It must not cause a safety issue if it is violated. I believe this is already the case.

We need counters and a way to be notified or make this visible if it does happen so that this isn't just silent dropping.

One thing I'd like to confirm is that my understanding is that say the entire IP or TCP header (including any extensions) must be in a single buffer. Each distinct header can be in a different single buffer. Multiple headers may be in same single contiguous buffer. Is that correct?

That is correct. The Packet macro is basically saying, for a set of headers:

Each header is backed by a contiguous byteslice (including its extensions/options).

If there are still bytes in the current segment, then try to parse the next header from that remainder (size permitting). If we fail, then we bail on the whole packet.

If the current segment is empty, then access the next and keep parsing.

So headers can all be in the same buffer, or all different buffers, or a middleground in the same b_cont chain. They just can't straddle, or you won't have enough bytes left to parse out a whole header -- we're not doing any pullup.

Pretty much all driver buffer sizes are going to be sized in a way that I don't think we're going to trip over this given a reasonable upper bound on header size (e.g. 256 bytes). I think the current constraint is probably something we can reasonably live with, with the following two constraints:

1. It must not cause a safety issue if it is violated. I believe this is already the case. 2. We need counters and a way to be notified or make this visible if it does happen so that this isn't just silent dropping.

1 is the case today, since everything is length-checked. On 2, we do have visibility via OPTE's bad-packet SDT, but the error in case of a straddle is a little cryptic (ParseError::TooSmall -- we can clarify this one). We don't have any KStats outside of ports today, but I think adding one wouldn't be too hard.

EDIT: This is now its own error class as of 1cd3484.

ingot/src/ip.rs

No perf cost, woohoo. Should be able to wire this into a KStat...

FelixMcFelix · 2024-11-19T17:40:27Z

From out of band conversations, we're in a good place with regard to stress testing (lengthy iperf sessions, fuzzing via OPTE's test suite), and I think the best move forward is to get these bits onto dogfood to surface any further issues from there. From review in OPTE, we also have the right level of baseline support in Ingot to safely handle body transforms / packet pullups in a way we hadn't accounted for before. There are still some feature gaps around e.g. validity checks that it would be nice to tie to the packet definition itself, but I'll fill those in as issues shortly.

This PR rewrites the core of OPTE's packet model to use zero-copy packet parsing/modification via the [`ingot`](oxidecomputer/ingot#1) library. This enables a few changes which get us just shy of the 3Gbps mark. * **[2.36 -> 2.7]** The use of ingot for modifying packets in both the slowpath (UFT miss) and existing fastpath (UFT hit). * Parsing is faster -- we no longer copy out all packet header bytes onto the stack, and we do not allocate a vector to decompose an `mblk_t` into individual links. * Packet field modifications are applied directly to the `mblk_t` as they happen, and field reads are made from the same source. * Non-encap layers are not copied back out. * **[2.7 -> 2.75]** Packet L4 hashes are cached as part of the UFT, speeding up multipath route selection over the underlay. * **[2.75 -> 2.8]** Incremental Internet checksum recalculation is only performed when applicable fields change on inner flow headers (e.g., NAT'd packets). * VM-to-VM / intra-VPC traffic is the main use case here. * **[2.8 -> 3.05]** `NetworkParser`s now have the concept of inbound & outbound `LightweightMeta` formats. These support the key operations needed to execute all our UFT flows today (`FlowId` lookup, inner headers modification, encap push/pop, cksum update). * This also allows us to pre-serialize any bytes to be pushed in front of a packet, speeding up `EmitSpec`. * This is crucial for outbound traffic in particular, which has far smaller (in `struct`-size) metadata. * UFT misses or incompatible flows fallback to using the full metadata. * **[3.05 -> 2.95]** TCP state tracking uses a separate per-flow lock and does not require any lookup from a UFT. * I do not have numbers on how large the performance loss would be if we held the `Port` lock for the whole time. * (Not measured) Packet/UFT L4 Hashes are used as the Geneve source port, spreading inbound packets over NIC Rx queues based on the inner flow. * This is now possible because of #504 -- software classification would have limited us to the default inbound queue/group. * I feel bad for removing one more FF7 reference, but that is the way of these things. RIP port `7777`. * Previously, Rx queue affinity was derived solely from `(Src Sled, Dst Sled)`. There are several other changes here made to how OPTE functions which are needed to support the zero-copy model. * Each collected set of header transforms are `Arc<>`'d, such that we can apply them outside of the `Port` lock. * `FlowTable<S>`s now store `Arc<FlowEntry<S>>`, rather than `FlowEntry<S>`. * This enables the UFT entry for any flow to store its matched TCP flow, update its hit count and timestamp, and then update the TCP state without reacquring the `Port` lock. * This also drastically simplifies TCP state handling in fast path cases to not rely on post-transformation packets for lookup. * `Opte::process` returns an `EmitSpec` which is needed to finalise a packet before it can be used. * I'm not too happy about the ergonomics, but we have this problem because otherwise we'd need `Packet` to have some self-referential fields when supporting other key parts of XDE (e.g., parse -> use fields -> select port -> process). Closes #571, closes #481, closes #460. Slightly alleviates #435.

FelixMcFelix added 30 commits September 25, 2024 12:02

Initial testing.

9d9e797

Boy this is all getting a bit P4.

cbff9da

Getting somewhere on the procmacro front.

8b1a6ee

Some renaming.

6437dae

Use libpnet to help prototype some stuff...

df543ec

Now: ingot.

f59d2ad

iEnough of a framework to start filling out, methinks.

270642f

Sketch out initial parse trait hierarchy

da04a0c

Iteration on working parse chains

396cbc7

Macro for choice layers in a parser. Cleanup.

fbec6e4

Cleanup.

75dd2cd

Better setup for codegen...

f5c4e1c

Parse chain generation that works with packet types (somewhat)

cb300d7

So horrendously messy, but working for aligned fields

828752c

Break macros up somewhat

cce3513

Ref methods only on non-transformed types

22ccb76

Should work for, e.g., [u8; 16]/

Muddling progress.

d4284c2

Rethink and test the BE cases.

f792c7c

BE field setting (and end-aligned)

7c6fbd2

Hacking away. In a good enough place to work on varlen fields.

eb7bb31

Put them together: one or two things are misnamed.

ac54ec4

Mixed up some names.

5164642

WIP: begin refactor of field types

8336e6f

Better backend representation for varwidth/parsables.

77834a5

Still need to make the swap and convert all existing generators, but it seems like we're doing the right thing when it comes to generics etc.

Header length checks/state incl var width and parsables

7d47964

WIP: further refactor progress.

af358b9

Length fns for varwidth fields.

6fea2a6

Move impl of bitfield setters to gen'd FixedWidth blocks

3259263

Some generics to sort out, but getting there...

0fcb69c

What an irksome refactor!

Well, it compiles now.

8547d0d

FelixMcFelix added 2 commits October 8, 2024 13:42

Hmm.

96bd0be

Help OPTE/DError along a bit.

22ab5b1

Self review, more rigorous impl of MINIMUM_LENGTH and EmitUnsafe

2d2f611

FelixMcFelix marked this pull request as ready for review October 8, 2024 17:49

Clearer test case names.

19be52b

FelixMcFelix requested review from zeeshanlakhani, luqmana and pfmooney October 8, 2024 18:04

Ergonomics: support emit on (&a, &b, &c)

0cfd03c

zeeshanlakhani reviewed Oct 9, 2024

View reviewed changes

ingot-types/src/packet.rs Outdated Show resolved Hide resolved

zeeshanlakhani reviewed Oct 9, 2024

View reviewed changes

ingot-types/src/packet.rs Outdated Show resolved Hide resolved

zeeshanlakhani reviewed Oct 9, 2024

View reviewed changes

ingot/src/tests.rs Outdated Show resolved Hide resolved

FelixMcFelix added 3 commits October 10, 2024 10:47

Review feedback -- heapless & clippy-allow

857ec5c

Feedback: Packet->Header/BoxedHeader/InlineHeader

7d6e29a

Review feedback: break out example_chains to crate

8cdf5c2

Accordingly, we need to move our benchmarks out to go with it. I've taken this chance to throw out a few more focussing on different packet parse types.

FelixMcFelix mentioned this pull request Oct 20, 2024

Datapath overhaul: zero-copy metadata with ingot, 'Compiled' UFTs oxidecomputer/opte#585

Merged

Allow access to inner buf on Repeated.

3b38859

pfmooney reviewed Oct 31, 2024

View reviewed changes

FelixMcFelix added 6 commits October 31, 2024 10:48

Review feedback.

1746dbd

MPL-ify.

3fd9ca1

Remove NotAnEh.

0b961bc

Attempt to clarify position on split-buffers.

5e19bd2

Explcitly signal straddled packet errors on parse_read.

1cd3484

No perf cost, woohoo. Should be able to wire this into a KStat...

Move some things around to support mblk_t pullup

a45c21c

FelixMcFelix merged commit d4667db into main Nov 19, 2024
5 checks passed

FelixMcFelix deleted the prototype branch November 19, 2024 17:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial version #1

Initial version #1

FelixMcFelix commented Oct 4, 2024 •

edited

Loading

FelixMcFelix commented Oct 8, 2024

zeeshanlakhani Oct 9, 2024

FelixMcFelix Oct 9, 2024

pfmooney left a comment

pfmooney Oct 31, 2024

FelixMcFelix Oct 31, 2024 •

edited

Loading

pfmooney Oct 31, 2024

FelixMcFelix Oct 31, 2024

rmustacc Nov 1, 2024 •

edited

Loading

FelixMcFelix Nov 2, 2024 •

edited

Loading

FelixMcFelix commented Nov 19, 2024

Initial version #1

Initial version #1

Conversation

FelixMcFelix commented Oct 4, 2024 • edited Loading

Testing notes

FelixMcFelix commented Oct 8, 2024

zeeshanlakhani Oct 9, 2024

Choose a reason for hiding this comment

FelixMcFelix Oct 9, 2024

Choose a reason for hiding this comment

pfmooney left a comment

Choose a reason for hiding this comment

pfmooney Oct 31, 2024

Choose a reason for hiding this comment

FelixMcFelix Oct 31, 2024 • edited Loading

Choose a reason for hiding this comment

pfmooney Oct 31, 2024

Choose a reason for hiding this comment

FelixMcFelix Oct 31, 2024

Choose a reason for hiding this comment

rmustacc Nov 1, 2024 • edited Loading

Choose a reason for hiding this comment

FelixMcFelix Nov 2, 2024 • edited Loading

Choose a reason for hiding this comment

FelixMcFelix commented Nov 19, 2024

FelixMcFelix commented Oct 4, 2024 •

edited

Loading

FelixMcFelix Oct 31, 2024 •

edited

Loading

rmustacc Nov 1, 2024 •

edited

Loading

FelixMcFelix Nov 2, 2024 •

edited

Loading