Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial version #1

Merged
merged 138 commits into from
Nov 19, 2024
Merged

Initial version #1

merged 138 commits into from
Nov 19, 2024

Conversation

FelixMcFelix
Copy link
Collaborator

@FelixMcFelix FelixMcFelix commented Oct 4, 2024

This PR contains the first implementation of this library, which is designed as a faster (zero-copy when possible) parsing and metadata backend for OPTE.

At a high level, this includes:

  • A proc macro which generates, for an owned packet header definition:
    • a packed version of that header, and methods to parse these headers from buffers.
    • traits to read and write fields consistently, regardless of whether a packet is owned/borrowed.
  • A proc macro which generates choices between several headers, conditioned on a common 'choice' type.
  • A proc macro which generates a parser for a chain of headers (with optional control actions insterspersed).
    • This is designed to account mainly for platforms which store packets as chains of buffers, and do not straddle individual headers across these buffers (i.e., illumos).
  • Where possible, additional traits to emit packets into byte buffers and convert to/from owned versions.

Testing notes

  • Macros (choice, Ingot, Parse) and base types (ingot_types) should be quite thoroughly documented on their intended use, at least via cargo +nightly doc.
    • Individual packets and newtypes for actual protocols, less so. There isn't yet logic to forward docattrs or build up docstrings for generated fields/methods/items.
  • cargo +nightly miri test is an essential part of ensuring that all operations here are sane -- my understanding is that the test suite exercises all elements of packet parsing, field access/setting, and packet emit that use unsafe.
  • cargo expand -p ingot > full.rs is my main sledgehammer for verifying that generated methods are behaving as expected. This is probably most useful for examining how, e.g., packet parsing and bitfields are implemented.
  • cargo bench includes some ballparks for certain operations: individual header parse, full packet parse in a single buffer/across multiple buffers, full packet parse returning purely borrowed vs. hybrid contents.

Rough TODOs:

  • Use a separate Parse Error type on full packets, including the name of the affected layer as a &'static str and/or &'static CStr.
  • Fully document macros.
  • Forward doc-attrs to generated structs.

Should work for, e.g., [u8; 16]/
Still need to make the swap and convert all existing generators, but it
seems like we're doing the right thing when it comes to generics etc.
@FelixMcFelix
Copy link
Collaborator Author

Sanity check within OPTE for the recent error handling changes:

kyle@farme:~/gits/opte/dtrace$ pfexec ./opte-trace opte-bad-packet.d
PORT         DIR MBLK               MSG+DATA

unknown      IN  0xfffffe69f9c1fa20 Parse->IngotError->outer_encap->TooSmall[0, 0]
unknown      IN  0xfffffe69e52a9a00 Parse->IngotError->outer_v6->Unwanted[0, 0]
unknown      IN  0xfffffe69f9da2160 Parse->IngotError->outer_v6->Unwanted[0, 0]
unknown      IN  0xfffffe69e2e871a0 Parse->IngotError->outer_udp->Unwanted[0, 0]
unknown      IN  0xfffffe69f9d42e40 Parse->IngotError->outer_udp->Unwanted[0, 0]
unknown      IN  0xfffffe69fa0c4e80 Parse->IngotError->outer_udp->Unwanted[0, 0]
unknown      IN  0xfffffe69e64bae80 Parse->IngotError->outer_udp->Unwanted[0, 0]
unknown      IN  0xfffffe69fe1ece00 Parse->IngotError->outer_encap->TooSmall[0, 0]

I think this is in a good spot now. Quick round of self review then I shall loop folks in while I push on getting the OPTE side cleaned up.

@FelixMcFelix FelixMcFelix marked this pull request as ready for review October 8, 2024 17:49
ingot/src/example_chain.rs Outdated Show resolved Hide resolved
const MINIMUM_LENGTH: usize = T::MINIMUM_LENGTH;

#[inline]
fn packet_length(&self) -> usize {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General Q: Is packet length automatically set when I extend a V6 packet, for example, with an additional extension header?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, yep. Header::packet_length is derived to include the current wire lengths of all variable-length and parsed fields:

    impl<V: ::ingot::types::ByteSlice> ::ingot::types::Header for ValidIpv6<V> {
        const MINIMUM_LENGTH: usize = 40usize;
        #[inline]
        fn packet_length(&self) -> usize {
            Self::MINIMUM_LENGTH + self.1.packet_length()
        }
    }
    impl ::ingot::types::Header for Ipv6 {
        const MINIMUM_LENGTH: usize = 40usize;
        #[inline]
        fn packet_length(&self) -> usize {
            Self::MINIMUM_LENGTH + self.v6ext.packet_length()
        }
    }

In the borrowed case, this just returns the length of the underlying buffer for said options. In the owned case (for Repeated anyhow), it sums up the packet_length of all elements in the inner vec.

ingot/src/tests.rs Outdated Show resolved Hide resolved
Accordingly, we need to move our benchmarks out to go with it. I've
taken this chance to throw out a few more focussing on different packet
parse types.
Copy link

@pfmooney pfmooney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still working my way through this, but here's an initial set of comments/questions.

Cargo.toml Outdated Show resolved Hide resolved
ingot/Cargo.toml Outdated Show resolved Hide resolved
ingot/src/lib.rs Outdated Show resolved Hide resolved
Comment on lines +44 to +46
let (r, _): (Ref<&[u8], T>, _) =
Ref::from_prefix(buf.as_bytes())
.map_err(|_| ParseError::TooSmall)?;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading this (and other bits) correctly, the expectations of the parsing logic is that the buffer from which the item (header, or component) is being parsed must be large enough to contain the entire thing? This would be at odds with the statement in the readme:

Ingot allows packet parsing over split buffers

Am I missing how a packet with its headers split across several mblk_ts would be parsed?

Copy link
Collaborator Author

@FelixMcFelix FelixMcFelix Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, individual headers do still need to reside in one contiguous buffer. Packets (#[derive(Parse)]) can contain headers in different buffers, however, using the Read trait and the generated parse_read methods. It could be possible in future to extend that functionality all the way down for extension headers or convert a straddled read to the owned version of a header, but we don't quite need that today.

This is the same model we have in OPTE -- a header can't straddle several mblk_ts, but zones do send (ether, ip, icmp), (ether) -> (ip, icmp) and similar permutations so we need to handle those.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I think this constraint on operation should be called out pretty clearly. While I'm not saying that Ingot and OPTE need to handle parsing of individual items across mblk_t boundaries immediately, that is something that many other things in mac do not take for granted today. (mac_ether_offload_info() does not even assume it can read a u16 without crossing an mblk_t boundary)

Yes, it's true that zones are well behaved, and viona transmissions should be too. I have not yet looked into cxgbe to see if the buffer sizes which are chosen there will guarantee that an incoming packet does not violate the restrictions set forth here.

Have you asked @rmustacc about his position on this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I think this constraint on operation should be called out pretty clearly.

Agreed. I wasn't aware that illumos pessimised so heavily on this front.

Have you asked @rmustacc about his position on this?

Not yet, I'll reach out.

Copy link

@rmustacc rmustacc Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I'd like to confirm is that my understanding is that say the entire IP or TCP header (including any extensions) must be in a single buffer. Each distinct header can be in a different single buffer. Multiple headers may be in same single contiguous buffer. Is that correct?

Ok. I think this constraint on operation should be called out pretty clearly.

Agreed. I wasn't aware that illumos pessimised so heavily on this front.

I was pessimistic in mac_ether_offload_info() that I didn't want to think about single mblk and calling msgpullup which is what a lot of folks did. That said, I think I am probably more pessimistic then required. I wrote that function from the most paranoid position I could be in. Basically I'd rather do what I did than ever call msgpullup(). I know that @pfmooney has told me many times how we could be much smarter in that code.

Have you asked @rmustacc about his position on this?

In general the way that most device drivers and hardware works is that they'll basically fill data into descriptors linearly. That is, if they get given mixed size buffers they'll fill them that way (though most hardware has constraint on buffer sizes). The only feature that some hardware has (but we don't use) is that such hardware allows for splitting the header and data into different descriptors.

I think there are a lot of places that do end up doing different checks here. But effectively the whole IP fast path relies on the headers being contiguous in the first mblk and other things through the stack.

Pretty much all driver buffer sizes are going to be sized in a way that I don't think we're going to trip over this given a reasonable upper bound on header size (e.g. 256 bytes). I think the current constraint is probably something we can reasonably live with, with the following two constraints:

  1. It must not cause a safety issue if it is violated. I believe this is already the case.
  2. We need counters and a way to be notified or make this visible if it does happen so that this isn't just silent dropping.

Copy link
Collaborator Author

@FelixMcFelix FelixMcFelix Nov 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I'd like to confirm is that my understanding is that say the entire IP or TCP header (including any extensions) must be in a single buffer. Each distinct header can be in a different single buffer. Multiple headers may be in same single contiguous buffer. Is that correct?

That is correct. The Packet macro is basically saying, for a set of headers:

  • Each header is backed by a contiguous byteslice (including its extensions/options).
  • If there are still bytes in the current segment, then try to parse the next header from that remainder (size permitting). If we fail, then we bail on the whole packet.
  • If the current segment is empty, then access the next and keep parsing.

So headers can all be in the same buffer, or all different buffers, or a middleground in the same b_cont chain. They just can't straddle, or you won't have enough bytes left to parse out a whole header -- we're not doing any pullup.

Pretty much all driver buffer sizes are going to be sized in a way that I don't think we're going to trip over this given a reasonable upper bound on header size (e.g. 256 bytes). I think the current constraint is probably something we can reasonably live with, with the following two constraints:

1. It must not cause a safety issue if it is violated. I believe this is already the case.

2. We need counters and a way to be notified or make this visible if it does happen so that this isn't just silent dropping.

1 is the case today, since everything is length-checked. On 2, we do have visibility via OPTE's bad-packet SDT, but the error in case of a straddle is a little cryptic (ParseError::TooSmall -- we can clarify this one). We don't have any KStats outside of ports today, but I think adding one wouldn't be too hard.

EDIT: This is now its own error class as of 1cd3484.

ingot/src/ip.rs Outdated Show resolved Hide resolved
@FelixMcFelix
Copy link
Collaborator Author

From out of band conversations, we're in a good place with regard to stress testing (lengthy iperf sessions, fuzzing via OPTE's test suite), and I think the best move forward is to get these bits onto dogfood to surface any further issues from there. From review in OPTE, we also have the right level of baseline support in Ingot to safely handle body transforms / packet pullups in a way we hadn't accounted for before. There are still some feature gaps around e.g. validity checks that it would be nice to tie to the packet definition itself, but I'll fill those in as issues shortly.

@FelixMcFelix FelixMcFelix merged commit d4667db into main Nov 19, 2024
5 checks passed
@FelixMcFelix FelixMcFelix deleted the prototype branch November 19, 2024 17:41
FelixMcFelix added a commit to oxidecomputer/opte that referenced this pull request Nov 19, 2024
This PR rewrites the core of OPTE's packet model to use zero-copy packet
parsing/modification via the
[`ingot`](oxidecomputer/ingot#1) library. This
enables a few changes which get us just shy of the 3Gbps mark.

* **[2.36 -> 2.7]** The use of ingot for modifying packets in both the
slowpath (UFT miss) and existing fastpath (UFT hit).
* Parsing is faster -- we no longer copy out all packet header bytes
onto the stack, and we do not allocate a vector to decompose an `mblk_t`
into individual links.
* Packet field modifications are applied directly to the `mblk_t` as
they happen, and field reads are made from the same source.
  * Non-encap layers are not copied back out.
* **[2.7 -> 2.75]** Packet L4 hashes are cached as part of the UFT,
speeding up multipath route selection over the underlay.
* **[2.75 -> 2.8]** Incremental Internet checksum recalculation is only
performed when applicable fields change on inner flow headers (e.g.,
NAT'd packets).
  * VM-to-VM / intra-VPC traffic is the main use case here.
* **[2.8 -> 3.05]** `NetworkParser`s now have the concept of inbound &
outbound `LightweightMeta` formats. These support the key operations
needed to execute all our UFT flows today (`FlowId` lookup, inner
headers modification, encap push/pop, cksum update).
* This also allows us to pre-serialize any bytes to be pushed in front
of a packet, speeding up `EmitSpec`.
* This is crucial for outbound traffic in particular, which has far
smaller (in `struct`-size) metadata.
* UFT misses or incompatible flows fallback to using the full metadata.
* **[3.05 -> 2.95]** TCP state tracking uses a separate per-flow lock
and does not require any lookup from a UFT.
* I do not have numbers on how large the performance loss would be if we
held the `Port` lock for the whole time.
* (Not measured) Packet/UFT L4 Hashes are used as the Geneve source
port, spreading inbound packets over NIC Rx queues based on the inner
flow.
* This is now possible because of #504 -- software classification would
have limited us to the default inbound queue/group.
* I feel bad for removing one more FF7 reference, but that is the way of
these things. RIP port `7777`.
* Previously, Rx queue affinity was derived solely from `(Src Sled, Dst
Sled)`.

There are several other changes here made to how OPTE functions which
are needed to support the zero-copy model.

* Each collected set of header transforms are `Arc<>`'d, such that we
can apply them outside of the `Port` lock.
* `FlowTable<S>`s now store `Arc<FlowEntry<S>>`, rather than
`FlowEntry<S>`.
* This enables the UFT entry for any flow to store its matched TCP flow,
update its hit count and timestamp, and then update the TCP state
without reacquring the `Port` lock.
* This also drastically simplifies TCP state handling in fast path cases
to not rely on post-transformation packets for lookup.
* `Opte::process` returns an `EmitSpec` which is needed to finalise a
packet before it can be used.
* I'm not too happy about the ergonomics, but we have this problem
because otherwise we'd need `Packet` to have some self-referential
fields when supporting other key parts of XDE (e.g., parse -> use fields
-> select port -> process).

Closes #571, closes #481, closes #460.

Slightly alleviates #435.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants