-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial version #1
Conversation
Should work for, e.g., [u8; 16]/
Still need to make the swap and convert all existing generators, but it seems like we're doing the right thing when it comes to generics etc.
What an irksome refactor!
Sanity check within OPTE for the recent error handling changes:
I think this is in a good spot now. Quick round of self review then I shall loop folks in while I push on getting the OPTE side cleaned up. |
const MINIMUM_LENGTH: usize = T::MINIMUM_LENGTH; | ||
|
||
#[inline] | ||
fn packet_length(&self) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General Q: Is packet length automatically set when I extend a V6 packet, for example, with an additional extension header?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, yep. Header::packet_length
is derived to include the current wire lengths of all variable-length and parsed fields:
impl<V: ::ingot::types::ByteSlice> ::ingot::types::Header for ValidIpv6<V> {
const MINIMUM_LENGTH: usize = 40usize;
#[inline]
fn packet_length(&self) -> usize {
Self::MINIMUM_LENGTH + self.1.packet_length()
}
}
impl ::ingot::types::Header for Ipv6 {
const MINIMUM_LENGTH: usize = 40usize;
#[inline]
fn packet_length(&self) -> usize {
Self::MINIMUM_LENGTH + self.v6ext.packet_length()
}
}
In the borrowed case, this just returns the length of the underlying buffer for said options. In the owned case (for Repeated
anyhow), it sums up the packet_length
of all elements in the inner vec.
Accordingly, we need to move our benchmarks out to go with it. I've taken this chance to throw out a few more focussing on different packet parse types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still working my way through this, but here's an initial set of comments/questions.
let (r, _): (Ref<&[u8], T>, _) = | ||
Ref::from_prefix(buf.as_bytes()) | ||
.map_err(|_| ParseError::TooSmall)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm reading this (and other bits) correctly, the expectations of the parsing logic is that the buffer from which the item (header, or component) is being parsed must be large enough to contain the entire thing? This would be at odds with the statement in the readme:
Ingot allows packet parsing over split buffers
Am I missing how a packet with its headers split across several mblk_t
s would be parsed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, individual headers do still need to reside in one contiguous buffer. Packets (#[derive(Parse)]
) can contain headers in different buffers, however, using the Read
trait and the generated parse_read
methods. It could be possible in future to extend that functionality all the way down for extension headers or convert a straddled read to the owned version of a header, but we don't quite need that today.
This is the same model we have in OPTE -- a header can't straddle several mblk_t
s, but zones do send (ether, ip, icmp)
, (ether) -> (ip, icmp)
and similar permutations so we need to handle those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I think this constraint on operation should be called out pretty clearly. While I'm not saying that Ingot and OPTE need to handle parsing of individual items across mblk_t
boundaries immediately, that is something that many other things in mac
do not take for granted today. (mac_ether_offload_info()
does not even assume it can read a u16 without crossing an mblk_t
boundary)
Yes, it's true that zones are well behaved, and viona transmissions should be too. I have not yet looked into cxgbe
to see if the buffer sizes which are chosen there will guarantee that an incoming packet does not violate the restrictions set forth here.
Have you asked @rmustacc about his position on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I think this constraint on operation should be called out pretty clearly.
Agreed. I wasn't aware that illumos pessimised so heavily on this front.
Have you asked @rmustacc about his position on this?
Not yet, I'll reach out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I'd like to confirm is that my understanding is that say the entire IP or TCP header (including any extensions) must be in a single buffer. Each distinct header can be in a different single buffer. Multiple headers may be in same single contiguous buffer. Is that correct?
Ok. I think this constraint on operation should be called out pretty clearly.
Agreed. I wasn't aware that illumos pessimised so heavily on this front.
I was pessimistic in mac_ether_offload_info()
that I didn't want to think about single mblk and calling msgpullup which is what a lot of folks did. That said, I think I am probably more pessimistic then required. I wrote that function from the most paranoid position I could be in. Basically I'd rather do what I did than ever call msgpullup(). I know that @pfmooney has told me many times how we could be much smarter in that code.
Have you asked @rmustacc about his position on this?
In general the way that most device drivers and hardware works is that they'll basically fill data into descriptors linearly. That is, if they get given mixed size buffers they'll fill them that way (though most hardware has constraint on buffer sizes). The only feature that some hardware has (but we don't use) is that such hardware allows for splitting the header and data into different descriptors.
I think there are a lot of places that do end up doing different checks here. But effectively the whole IP fast path relies on the headers being contiguous in the first mblk and other things through the stack.
Pretty much all driver buffer sizes are going to be sized in a way that I don't think we're going to trip over this given a reasonable upper bound on header size (e.g. 256 bytes). I think the current constraint is probably something we can reasonably live with, with the following two constraints:
- It must not cause a safety issue if it is violated. I believe this is already the case.
- We need counters and a way to be notified or make this visible if it does happen so that this isn't just silent dropping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I'd like to confirm is that my understanding is that say the entire IP or TCP header (including any extensions) must be in a single buffer. Each distinct header can be in a different single buffer. Multiple headers may be in same single contiguous buffer. Is that correct?
That is correct. The Packet macro is basically saying, for a set of headers:
- Each header is backed by a contiguous byteslice (including its extensions/options).
- If there are still bytes in the current segment, then try to parse the next header from that remainder (size permitting). If we fail, then we bail on the whole packet.
- If the current segment is empty, then access the next and keep parsing.
So headers can all be in the same buffer, or all different buffers, or a middleground in the same b_cont
chain. They just can't straddle, or you won't have enough bytes left to parse out a whole header -- we're not doing any pullup.
Pretty much all driver buffer sizes are going to be sized in a way that I don't think we're going to trip over this given a reasonable upper bound on header size (e.g. 256 bytes). I think the current constraint is probably something we can reasonably live with, with the following two constraints:
1. It must not cause a safety issue if it is violated. I believe this is already the case. 2. We need counters and a way to be notified or make this visible if it does happen so that this isn't just silent dropping.
1 is the case today, since everything is length-checked. On 2, we do have visibility via OPTE's bad-packet
SDT, but the error in case of a straddle is a little cryptic (ParseError::TooSmall
-- we can clarify this one). We don't have any KStats outside of ports today, but I think adding one wouldn't be too hard.
EDIT: This is now its own error class as of 1cd3484.
No perf cost, woohoo. Should be able to wire this into a KStat...
From out of band conversations, we're in a good place with regard to stress testing (lengthy iperf sessions, fuzzing via OPTE's test suite), and I think the best move forward is to get these bits onto dogfood to surface any further issues from there. From review in OPTE, we also have the right level of baseline support in Ingot to safely handle body transforms / packet pullups in a way we hadn't accounted for before. There are still some feature gaps around e.g. validity checks that it would be nice to tie to the packet definition itself, but I'll fill those in as issues shortly. |
This PR rewrites the core of OPTE's packet model to use zero-copy packet parsing/modification via the [`ingot`](oxidecomputer/ingot#1) library. This enables a few changes which get us just shy of the 3Gbps mark. * **[2.36 -> 2.7]** The use of ingot for modifying packets in both the slowpath (UFT miss) and existing fastpath (UFT hit). * Parsing is faster -- we no longer copy out all packet header bytes onto the stack, and we do not allocate a vector to decompose an `mblk_t` into individual links. * Packet field modifications are applied directly to the `mblk_t` as they happen, and field reads are made from the same source. * Non-encap layers are not copied back out. * **[2.7 -> 2.75]** Packet L4 hashes are cached as part of the UFT, speeding up multipath route selection over the underlay. * **[2.75 -> 2.8]** Incremental Internet checksum recalculation is only performed when applicable fields change on inner flow headers (e.g., NAT'd packets). * VM-to-VM / intra-VPC traffic is the main use case here. * **[2.8 -> 3.05]** `NetworkParser`s now have the concept of inbound & outbound `LightweightMeta` formats. These support the key operations needed to execute all our UFT flows today (`FlowId` lookup, inner headers modification, encap push/pop, cksum update). * This also allows us to pre-serialize any bytes to be pushed in front of a packet, speeding up `EmitSpec`. * This is crucial for outbound traffic in particular, which has far smaller (in `struct`-size) metadata. * UFT misses or incompatible flows fallback to using the full metadata. * **[3.05 -> 2.95]** TCP state tracking uses a separate per-flow lock and does not require any lookup from a UFT. * I do not have numbers on how large the performance loss would be if we held the `Port` lock for the whole time. * (Not measured) Packet/UFT L4 Hashes are used as the Geneve source port, spreading inbound packets over NIC Rx queues based on the inner flow. * This is now possible because of #504 -- software classification would have limited us to the default inbound queue/group. * I feel bad for removing one more FF7 reference, but that is the way of these things. RIP port `7777`. * Previously, Rx queue affinity was derived solely from `(Src Sled, Dst Sled)`. There are several other changes here made to how OPTE functions which are needed to support the zero-copy model. * Each collected set of header transforms are `Arc<>`'d, such that we can apply them outside of the `Port` lock. * `FlowTable<S>`s now store `Arc<FlowEntry<S>>`, rather than `FlowEntry<S>`. * This enables the UFT entry for any flow to store its matched TCP flow, update its hit count and timestamp, and then update the TCP state without reacquring the `Port` lock. * This also drastically simplifies TCP state handling in fast path cases to not rely on post-transformation packets for lookup. * `Opte::process` returns an `EmitSpec` which is needed to finalise a packet before it can be used. * I'm not too happy about the ergonomics, but we have this problem because otherwise we'd need `Packet` to have some self-referential fields when supporting other key parts of XDE (e.g., parse -> use fields -> select port -> process). Closes #571, closes #481, closes #460. Slightly alleviates #435.
This PR contains the first implementation of this library, which is designed as a faster (zero-copy when possible) parsing and metadata backend for OPTE.
At a high level, this includes:
Testing notes
choice, Ingot, Parse
) and base types (ingot_types
) should be quite thoroughly documented on their intended use, at least viacargo +nightly doc
.cargo +nightly miri test
is an essential part of ensuring that all operations here are sane -- my understanding is that the test suite exercises all elements of packet parsing, field access/setting, and packet emit that useunsafe
.cargo expand -p ingot > full.rs
is my main sledgehammer for verifying that generated methods are behaving as expected. This is probably most useful for examining how, e.g., packet parsing and bitfields are implemented.cargo bench
includes some ballparks for certain operations: individual header parse, full packet parse in a single buffer/across multiple buffers, full packet parse returning purely borrowed vs. hybrid contents.Rough TODOs:
&'static str
and/or&'static CStr
.