Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Want smarter use of existing structure in Packet::emit_packet_headers #481

Closed
FelixMcFelix opened this issue Mar 28, 2024 · 1 comment · Fixed by #585
Closed

Want smarter use of existing structure in Packet::emit_packet_headers #481

FelixMcFelix opened this issue Mar 28, 2024 · 1 comment · Fixed by #585
Labels
customer For any bug reports or feature requests tied to customer requests perf

Comments

@FelixMcFelix
Copy link
Collaborator

FelixMcFelix commented Mar 28, 2024

To process a packet, we currently read each layer and then advance the r/wptr fields of the mblk_t containing the packet and its headers. This leaves the original headers in place. We then determine how many bytes the complete header stack would require (Option<encap{L2,L3,L4}> + Ulp {L2, L3, L4}), before using the existing space or allocating a new mblk to be placed at the front of the chain and copying the entire (post-transform) header set back into place.

However, many of our transforms when packets are Modified do not alter the entire header stack, or even alter the structure of ULP packet headers. While needing to add or remove the Geneve header group is universal, VPC-internal traffic would actually require no changes to the ULP, and would in effect only modify L2 addrs of transformed traffic. VPC-external packets typically only require that we change L3 src/dest addrs or L4 ports (due to NAT/SNAT) in addition to this (and, accordingly, the checksum).

It would likely be more clever to:

  • Determine whether any parts of the output headers have structurally changed. I.e., the addition of new layers, or extension fields. If the structure of a header group has not changed, then we can update the modified fields – and only the modified fields – in place.
  • Allow the encap packet headers and ULP packet headers to exist within different segments. This will allow us to only allocate and write the Geneve header stack from scratch, and apply the above optimisation to ULPs on all encap'd traffic.

This might in theory be extended to Hairpin packets under limited circumstances, but I believe that through-traffic is the most important occurrance to capture.

@FelixMcFelix
Copy link
Collaborator Author

FelixMcFelix commented Apr 11, 2024

I did some poking around in ubench to see what a reasonable upper limit in time savings is here before embarking on a redesign of PacketMeta, based on a very rough strawman:

Changes to `userland.rs`
|(mut pkt, dir)| {
    let _ = black_box(|| {
        // I.e., btreemap lookup, lock acquire.
        port.port.pseudo_process(dir, &mut pkt, ActionMeta::new());
        match dir {
            In => {
                // Write fields (ETH dst, IP DST)
                // Recompute Cksum.
                // Chop off encap.
                let inner_len = pkt.meta().inner.hdr_len();
                let mut ipsum: Checksum = HeaderChecksum::wrap(pkt.meta().inner_ip4().unwrap().csum).into();
                let mut tcpsum: Checksum = HeaderChecksum::wrap(pkt.meta().inner_tcp().unwrap().csum).into();
                let old_addr = pkt.meta().inner_ip4().unwrap().dst;

                pkt.segs[0].expand_start(inner_len);
                let mut wtr = pkt.segs[0].get_writer();
                let bytes = wtr.slice_mut(inner_len).unwrap();
                // ETH [0..14]
                bytes[0..6].copy_from_slice(&mac);
                // IP [14..34]
                bytes[30..34].copy_from_slice(&int_ip);
                
                ipsum.sub_bytes(&old_addr);
                ipsum.add_bytes(&int_ip);
                bytes[24..26].copy_from_slice(&ipsum.finalize().to_be_bytes());

                tcpsum.sub_bytes(&old_addr);
                tcpsum.add_bytes(&int_ip);
                // TCP [34..54]
                bytes[50..52].copy_from_slice(&tcpsum.finalize().to_be_bytes());
            },
            Out => {
                // Write fields (ETH src, IP SRC)
                // Recompute Cksum.
                // Push precomputed encap.
                let inner_len = pkt.meta().inner.hdr_len();
                let mut ipsum: Checksum = HeaderChecksum::wrap(pkt.meta().inner_ip4().unwrap().csum).into();
                let mut tcpsum: Checksum = HeaderChecksum::wrap(pkt.meta().inner_tcp().unwrap().csum).into();
                let old_addr = pkt.meta().inner_ip4().unwrap().src;

                pkt.segs[0].expand_start(inner_len);
                let mut wtr = pkt.segs[0].get_writer();
                let bytes = wtr.slice_mut(inner_len).unwrap();
                // ETH [0..14]
                bytes[6..12].copy_from_slice(&mac);
                // IP [14..34]
                bytes[26..30].copy_from_slice(&ext_ip);
                
                ipsum.sub_bytes(&old_addr);
                ipsum.add_bytes(&ext_ip);
                bytes[24..26].copy_from_slice(&ipsum.finalize().to_be_bytes());

                tcpsum.sub_bytes(&old_addr);
                tcpsum.add_bytes(&ext_ip);
                // TCP [34..54]
                bytes[50..52].copy_from_slice(&tcpsum.finalize().to_be_bytes());

                let mut seg = unsafe {
                    let mp = allocb(encap_len);
                    PacketSeg::wrap_mblk(mp)
                };

                // NOTE: encap_dummy_bytes is a prebuilt vector.
                seg.expand_end(encap_len);
                let mut wtr = seg.get_writer();
                let wrt = wtr.slice_mut(encap_len).unwrap();
                wrt.copy_from_slice(&encap_dummy_bytes);

                seg.link(&pkt.segs[0]);
                pkt.segs.insert(0, seg);
            },
        };
    });
},

(There are a bunch of extra pub modifiers added to existing elements to make this run.)

TL;DR: it captures a reasonable subset of a VPC-external traffic flow: we grab a lock on the port, lookup the flowkey in the btreemap, source/dest IP is rewritten, source/dest MAC is rewritten, checksums are updated and rewritten, and in the outbound case we memcpy a pre-prepared packet header stack of length ETH + v6 + UDP + Geneve + GeneveExt.

Roughly, perf looks like (ignoring the vague complaints about regression):

#
# before
#
Benchmarking process/ULP-FastPath/wallclock/V4-Tcp-OUT-1400B: Collecting 100 samples in estimated 5.0084 s
process/ULP-FastPath/wallclock/V4-Tcp-OUT-1400B
                        time:   [293.83 ns 295.23 ns 296.70 ns]
                        change: [+1.0641% +1.7977% +2.4187%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

process/ULP-FastPath/wallclock/V4-Tcp-IN-1400B
                        time:   [216.11 ns 217.07 ns 218.10 ns]
                        change: [+1.1622% +1.6947% +2.2003%] (p = 0.00 < 0.05)
                        Performance has regressed.
#
# after
#
Benchmarking strawman/V4-Tcp-OUT-1400B(hypothetical): Collecting 100 samples in estimated 5.0071 s (2.2M it
strawman/V4-Tcp-OUT-1400B(hypothetical)
                        time:   [73.838 ns 74.196 ns 74.560 ns]
                        change: [-1.2417% -0.5123% +0.2084%] (p = 0.18 > 0.05)
                        No change in performance detected.

strawman/V4-Tcp-IN-1400B(hypothetical)
                        time:   [72.621 ns 73.367 ns 74.216 ns]
                        change: [-0.9970% +0.0679% +1.1228%] (p = 0.91 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

So in the best case for VPC-external traffic, Port::process could get around a ~3X speedup on inbound packets and ~4X speedup on outbound packet processing -- assuming we can build up the transforms for UFT hit packets at least.

While it's worth knowing that there's a big gap here, the effort needed to enable this (i.e., lazy reads + dirty write tracking, knowledge of structural changes) is probably pretty substantial relative to the gain. For instance, ballpark from flamegraphs indicates:

So I guess solving this will give us a ≤1.3X speedup, whereas doing some route caching can probably get us a similar speedup with less effort. I'll puzzle this out for a while.

@twinfees twinfees added the customer For any bug reports or feature requests tied to customer requests label Aug 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
customer For any bug reports or feature requests tied to customer requests perf
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants