Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datapath overhaul: zero-copy metadata with ingot, 'Compiled' UFTs #585

Merged
merged 117 commits into from
Nov 19, 2024

Conversation

FelixMcFelix
Copy link
Collaborator

@FelixMcFelix FelixMcFelix commented Aug 21, 2024

This PR rewrites the core of OPTE's packet model to use zero-copy packet parsing/modification via the ingot library. This enables a few changes which get us just shy of the 3Gbps mark.

  • [2.36 -> 2.7] The use of ingot for modifying packets in both the slowpath (UFT miss) and existing fastpath (UFT hit).
    • Parsing is faster -- we no longer copy out all packet header bytes onto the stack, and we do not allocate a vector to decompose an mblk_t into individual links.
    • Packet field modifications are applied directly to the mblk_t as they happen, and field reads are made from the same source.
    • Non-encap layers are not copied back out.
  • [2.7 -> 2.75] Packet L4 hashes are cached as part of the UFT, speeding up multipath route selection over the underlay.
  • [2.75 -> 2.8] Incremental Internet checksum recalculation is only performed when applicable fields change on inner flow headers (e.g., NAT'd packets).
    • VM-to-VM / intra-VPC traffic is the main use case here.
  • [2.8 -> 3.05] NetworkParsers now have the concept of inbound & outbound LightweightMeta formats. These support the key operations needed to execute all our UFT flows today (FlowId lookup, inner headers modification, encap push/pop, cksum update).
    • This also allows us to pre-serialize any bytes to be pushed in front of a packet, speeding up EmitSpec.
    • This is crucial for outbound traffic in particular, which has far smaller (in struct-size) metadata.
    • UFT misses or incompatible flows fallback to using the full metadata.
  • [3.05 -> 2.95] TCP state tracking uses a separate per-flow lock and does not require any lookup from a UFT.
    • I do not have numbers on how large the performance loss would be if we held the Port lock for the whole time.
  • (Not measured) Packet/UFT L4 Hashes are used as the Geneve source port, spreading inbound packets over NIC Rx queues based on the inner flow.
    • This is now possible because of Move underlay NICs back into H/W Classification #504 -- software classification would have limited us to the default inbound queue/group.
    • I feel bad for removing one more FF7 reference, but that is the way of these things. RIP port 7777.
    • Previously, Rx queue affinity was derived solely from (Src Sled, Dst Sled).

There are several other changes here made to how OPTE functions which are needed to support the zero-copy model.

  • Each collected set of header transforms are Arc<>'d, such that we can apply them outside of the Port lock.
  • FlowTable<S>s now store Arc<FlowEntry<S>>, rather than FlowEntry<S>.
    • This enables the UFT entry for any flow to store its matched TCP flow, update its hit count and timestamp, and then update the TCP state without reacquring the Port lock.
    • This also drastically simplifies TCP state handling in fast path cases to not rely on post-transformation packets for lookup.
  • Opte::process returns an EmitSpec which is needed to finalise a packet before it can be used.
    • I'm not too happy about the ergonomics, but we have this problem because otherwise we'd need Packet to have some self-referential fields when supporting other key parts of XDE (e.g., parse -> use fields -> select port -> process).

Closes #571, closes #481, closes #460.

Slightly alleviates #435.

Original testing notes.

This is not exactly a transformative increase, according to testing on glasgow. But it is an increase by around 15--20% zone-to-zone vs #504:

root@a:~# iperf -c 10.0.0.1 -P8
Connecting to host 10.0.0.1, port 5201
[  4] local 10.0.0.2 port 39797 connected to 10.0.0.1 port 5201
[  6] local 10.0.0.2 port 55568 connected to 10.0.0.1 port 5201
[  8] local 10.0.0.2 port 55351 connected to 10.0.0.1 port 5201
[ 10] local 10.0.0.2 port 49474 connected to 10.0.0.1 port 5201
[ 12] local 10.0.0.2 port 61952 connected to 10.0.0.1 port 5201
[ 14] local 10.0.0.2 port 47930 connected to 10.0.0.1 port 5201
[ 16] local 10.0.0.2 port 53057 connected to 10.0.0.1 port 5201
[ 18] local 10.0.0.2 port 63541 connected to 10.0.0.1 port 5201
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec  38.2 MBytes   320 Mbits/sec
[  6]   0.00-1.00   sec  38.3 MBytes   321 Mbits/sec
[  8]   0.00-1.00   sec  38.3 MBytes   321 Mbits/sec
[ 10]   0.00-1.00   sec  38.0 MBytes   319 Mbits/sec
[ 12]   0.00-1.00   sec  38.0 MBytes   319 Mbits/sec
[ 14]   0.00-1.00   sec  38.0 MBytes   318 Mbits/sec
[ 16]   0.00-1.00   sec  38.1 MBytes   319 Mbits/sec
[ 18]   0.00-1.00   sec  38.0 MBytes   319 Mbits/sec
[SUM]   0.00-1.00   sec   305 MBytes  2.56 Gbits/sec

...

- - - - - - - - - - - - - - - - - - - - - - - - -
[  4]   9.00-10.00  sec  43.0 MBytes   361 Mbits/sec
[  6]   9.00-10.00  sec  42.9 MBytes   359 Mbits/sec
[  8]   9.00-10.00  sec  42.8 MBytes   359 Mbits/sec
[ 10]   9.00-10.00  sec  42.7 MBytes   358 Mbits/sec
[ 12]   9.00-10.00  sec  42.9 MBytes   360 Mbits/sec
[ 14]   9.00-10.00  sec  42.9 MBytes   359 Mbits/sec
[ 16]   9.00-10.00  sec  43.0 MBytes   360 Mbits/sec
[ 18]   9.00-10.00  sec  42.8 MBytes   359 Mbits/sec
[SUM]   9.00-10.00  sec   343 MBytes  2.88 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-10.00  sec   426 MBytes   357 Mbits/sec                  sender
[  4]   0.00-10.00  sec   425 MBytes   357 Mbits/sec                  receiver
[  6]   0.00-10.00  sec   426 MBytes   357 Mbits/sec                  sender
[  6]   0.00-10.00  sec   426 MBytes   357 Mbits/sec                  receiver
[  8]   0.00-10.00  sec   426 MBytes   357 Mbits/sec                  sender
[  8]   0.00-10.00  sec   426 MBytes   357 Mbits/sec                  receiver
[ 10]   0.00-10.00  sec   425 MBytes   356 Mbits/sec                  sender
[ 10]   0.00-10.00  sec   425 MBytes   356 Mbits/sec                  receiver
[ 12]   0.00-10.00  sec   426 MBytes   357 Mbits/sec                  sender
[ 12]   0.00-10.00  sec   426 MBytes   357 Mbits/sec                  receiver
[ 14]   0.00-10.00  sec   425 MBytes   356 Mbits/sec                  sender
[ 14]   0.00-10.00  sec   425 MBytes   356 Mbits/sec                  receiver
[ 16]   0.00-10.00  sec   426 MBytes   357 Mbits/sec                  sender
[ 16]   0.00-10.00  sec   425 MBytes   357 Mbits/sec                  receiver
[ 18]   0.00-10.00  sec   425 MBytes   357 Mbits/sec                  sender
[ 18]   0.00-10.00  sec   425 MBytes   357 Mbits/sec                  receiver
[SUM]   0.00-10.00  sec  3.32 GBytes  2.85 Gbits/sec                  sender
[SUM]   0.00-10.00  sec  3.32 GBytes  2.85 Gbits/sec                  receiver

The only thing is that we have basically cut the time we're spending doing non-MAC things down to the bone, and we are no longer the most contended lock-haver, courtesy of lockstat.
tx

Zooming in a little on a representative call (percentages here of CPU time across examined stacks):
image
for context, xde_mc_tx is listed as taking 39.92% on this path, and str_mdata_fastpath_put as 21.50%. Packet parsing (3.36%) and processing times (1.86%) are nice and low! So we're now spending less time on each packet than MAC and the device driver do.

@FelixMcFelix FelixMcFelix added this to the 11 milestone Aug 22, 2024
@FelixMcFelix
Copy link
Collaborator Author

Using an L4-hash-derived source port looks like it is driving Rx traffic onto separate cores from a quick look in dtrace -- in a one port scenario this puts us back at being the second most-contended lock during a -P 8 iperf run. (A single-threaded remains fairly uncontended.)

-------------------------------------------------------------------------------
Count indv cuml rcnt     nsec Lock                   Caller                  
260502  12%  41% 0.00      637 0xfffffcfa34386be8     _ZN4opte6engine4port13Port$LT$N$GT$12thin_process17h316b1c1b8ce14471E+0x7c

      nsec ------ Time Distribution ------ count                             
       256 |@@@@@@@@@@                     90486     
       512 |@@@@@@@@@@@                    97481     
      1024 |@@@@                           37487     
      2048 |@@                             20513     
      4096 |@                              11065     
      8192 |                               2786      
     16384 |                               533       
     32768 |                               116       
     65536 |                               20        
    131072 |                               13        
    262144 |                               1         
    524288 |                               1         

This doesn't really affect speed, but I expect this should mean that different port traffic will at least be able to avoid processing on the same CPU in many cases. E.g., when sled $A$ hosts ports $A_1, A_2$ and sled $B$ hosts ports $B_1, B_2$, all $A \leftrightarrow B$ combinations had the same outer 5-tuple $(A_{\mathit{IP6}},B_{\mathit{IP6}},\mathit{UDP},7777,6081)$ -- so identical Rx queue mapping. There's work to be done to get contention down further but that's beyond this PR's scope.

@twinfees twinfees added the customer For any bug reports or feature requests tied to customer requests label Aug 27, 2024
@FelixMcFelix FelixMcFelix self-assigned this Sep 6, 2024
Packet Rx is apparently 180% more costly now on `glasgow`.
TODO: find where the missing 250 Mbps has gone.
Notes from rough turning-off-and-on of the Old Way:

* Thin process is slower than it was before. I suspect this is due to
  the larger amount of things which have been shoved into the full
  Packet<Parsed> type once again. We're at 2.8--2.9 rather than 2.9--3.
* Thin process has a bigger performance impact on the Rx pathway than
  Tx:
   - Rx-only: 2.8--2.9
   - Tx-only: 2.74
   - None:    2.7
   - Old:   <=2.5

There might be value in first-classing an extra parse state for the
cases that we know we don't need to do arbitrary full-on transforms.
Copy link
Collaborator Author

@FelixMcFelix FelixMcFelix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm happy with this, barring some open questions I've left in self-review. Some 145 tests are working/passing/rewritten.

As far as reviewability goes, we could cut compiled UFTs into a follow-up PR if need be. I don't believe that this will bring the size of the diff down substantially (-1–1.5k lines?), given the nature of a stack rewrite like this.

Comment on lines 1213 to 1221
pub fn process<'a, M>(
&self,
dir: Direction,
pkt: &mut Packet<Parsed>,
mut ameta: ActionMeta,
) -> result::Result<ProcessResult, ProcessError> {
let flow_before = *pkt.flow();
let epoch = self.epoch.load(SeqCst);
let mut data = self.data.lock();
// TODO: might want to pass in a &mut to an enum
// which can advance to (and hold) light->full-fat metadata.
// My gutfeel is that there's a perf cost here -- this struct
// is pretty fat, but expressing the transform on a &mut also sucks.
mut pkt: Packet<LiteParsed<MsgBlkIterMut<'a>, M>>,
) -> result::Result<ProcessResult, ProcessError>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new CompiledUft changes (and unification of process_in/process_out) are here. I'm not too happy about pkt being passed by value, given the size of even the lite metadata formats. If there are any ideas I'd be more than happy to see whta we can do here.

) -> Result<LiteInPkt<MsgBlkIterMut<'_>, NP>, ParseError> {
let pkt = Packet::new(pkt.iter_mut());
pkt.parse_inbound(parser)
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed as of cdf1d59.

lib/opte/src/ddi/mblk.rs Outdated Show resolved Hide resolved
lib/opte/src/ddi/time.rs Outdated Show resolved Hide resolved
xde/x86_64-unknown-unknown.json Show resolved Hide resolved
Necessary to safely handle cases where, e.g., viona has pulled up part
of the packet for headers, but anything after this cutoff is guest
memory (thus, unsafe to construct a `&[u8]` or `&mut [u8]` over).

This also ensures that any time we count the bytes in a MsgBlk b_cont
chain, we do so exclusively using rptr and wptr (rather than
constructing a slice).

One piece left TODO is making sure that body transforms on such
packets are properly handled.
Seems to more reliably push us up to >=3.0Gbps, primarily be eliding the
fat `memcpy`s needed to move some of the metadata structs out (>128B).
lib/opte/src/ddi/mblk.rs Outdated Show resolved Hide resolved
/// * Return [`WrapError::Chain`] is `mp->b_next` or `mp->b_prev` are set.
pub unsafe fn wrap_mblk(ptr: *mut mblk_t) -> Result<Self, WrapError> {
let inner = NonNull::new(ptr).ok_or(WrapError::NullPtr)?;
let inner_ref = inner.as_ref();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to be turning the NonNull<mblk_t> into a reference (here, and elsewhere), we should probably verify that it meets the alignment requirements (not that there's any expectation that mblk pointers would fail them)

It seems like code uses raw pointers in some places, and references in others. Switching to raw pointers all the time might avoid some potential for UB relating to this, obviating the need for alignment checks on construction. I'm not sure which is the right approach for this abstraction.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've ended up going for raw dereferences across the board as of 33137dd, for the sake of consistency.

lib/opte/src/ddi/mblk.rs Outdated Show resolved Hide resolved

/// Drops all empty mblks from the start of this chain where possible
/// (i.e., any empty mblk is followed by another mblk).
pub fn drop_empty_segments(&mut self) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably be wary about any of the associated metadata when we're dropping mblks from the b_cont chain. If the first mblk is empty, but bears flags regarding checksums or LSO, it would be a bother to lose that info. This applies to basically all operations which are manipulating or copying (including pullup) mblk packets.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can copy db_struioun across during these operations. But I'm maybe unsure of what we should be doing with db_cksum{start, end, stuff} and whether there are any flags we should be neutering (HCK_PARTIALCKSUM?). I've caused a few panics in LSO testing by including that one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to run okay as of 33137dd.

}

#[derive(Debug)]
pub struct MsgBlkNode(mblk_t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to have some documentation here about when/why MsgBlkNode should be used in lieu of MsgBlk. In particular, why one isn't implemented in terms of the other.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've put in some commentary here as of 27ecc8d, but we could push more methods down if required.

lib/opte/src/ddi/mblk.rs Outdated Show resolved Hide resolved
@FelixMcFelix
Copy link
Collaborator Author

See the comment left on ingot -- from conversations, the disposition is to get this onto dogfood and see how it performs and operates there. To date, we are confident of its stability under:

  • Many, many, iperf runs for varying lengths of time on glasgow (at, e.g., -P 128 parallel streams).
  • OPTE's own test suite, including ~2h fuzzing.
  • Successful runs through local/standalone omicron, both with/without Wire up viona params from illumos#16738 propolis#814. VM-to-VM, VM-to-external, and Zone-to-external (Nexus) are all happily functional.

I'm merging just now with a view to getting this into R12 testing.

@FelixMcFelix FelixMcFelix merged commit 2e4d475 into master Nov 19, 2024
10 checks passed
@FelixMcFelix FelixMcFelix deleted the ingot branch November 19, 2024 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
customer For any bug reports or feature requests tied to customer requests perf
Projects
None yet
3 participants