-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Want smarter use of existing structure in Packet::emit_packet_headers
#481
Comments
I did some poking around in ubench to see what a reasonable upper limit in time savings is here before embarking on a redesign of Changes to `userland.rs`|(mut pkt, dir)| {
let _ = black_box(|| {
// I.e., btreemap lookup, lock acquire.
port.port.pseudo_process(dir, &mut pkt, ActionMeta::new());
match dir {
In => {
// Write fields (ETH dst, IP DST)
// Recompute Cksum.
// Chop off encap.
let inner_len = pkt.meta().inner.hdr_len();
let mut ipsum: Checksum = HeaderChecksum::wrap(pkt.meta().inner_ip4().unwrap().csum).into();
let mut tcpsum: Checksum = HeaderChecksum::wrap(pkt.meta().inner_tcp().unwrap().csum).into();
let old_addr = pkt.meta().inner_ip4().unwrap().dst;
pkt.segs[0].expand_start(inner_len);
let mut wtr = pkt.segs[0].get_writer();
let bytes = wtr.slice_mut(inner_len).unwrap();
// ETH [0..14]
bytes[0..6].copy_from_slice(&mac);
// IP [14..34]
bytes[30..34].copy_from_slice(&int_ip);
ipsum.sub_bytes(&old_addr);
ipsum.add_bytes(&int_ip);
bytes[24..26].copy_from_slice(&ipsum.finalize().to_be_bytes());
tcpsum.sub_bytes(&old_addr);
tcpsum.add_bytes(&int_ip);
// TCP [34..54]
bytes[50..52].copy_from_slice(&tcpsum.finalize().to_be_bytes());
},
Out => {
// Write fields (ETH src, IP SRC)
// Recompute Cksum.
// Push precomputed encap.
let inner_len = pkt.meta().inner.hdr_len();
let mut ipsum: Checksum = HeaderChecksum::wrap(pkt.meta().inner_ip4().unwrap().csum).into();
let mut tcpsum: Checksum = HeaderChecksum::wrap(pkt.meta().inner_tcp().unwrap().csum).into();
let old_addr = pkt.meta().inner_ip4().unwrap().src;
pkt.segs[0].expand_start(inner_len);
let mut wtr = pkt.segs[0].get_writer();
let bytes = wtr.slice_mut(inner_len).unwrap();
// ETH [0..14]
bytes[6..12].copy_from_slice(&mac);
// IP [14..34]
bytes[26..30].copy_from_slice(&ext_ip);
ipsum.sub_bytes(&old_addr);
ipsum.add_bytes(&ext_ip);
bytes[24..26].copy_from_slice(&ipsum.finalize().to_be_bytes());
tcpsum.sub_bytes(&old_addr);
tcpsum.add_bytes(&ext_ip);
// TCP [34..54]
bytes[50..52].copy_from_slice(&tcpsum.finalize().to_be_bytes());
let mut seg = unsafe {
let mp = allocb(encap_len);
PacketSeg::wrap_mblk(mp)
};
// NOTE: encap_dummy_bytes is a prebuilt vector.
seg.expand_end(encap_len);
let mut wtr = seg.get_writer();
let wrt = wtr.slice_mut(encap_len).unwrap();
wrt.copy_from_slice(&encap_dummy_bytes);
seg.link(&pkt.segs[0]);
pkt.segs.insert(0, seg);
},
};
});
}, (There are a bunch of extra TL;DR: it captures a reasonable subset of a VPC-external traffic flow: we grab a lock on the port, lookup the flowkey in the btreemap, source/dest IP is rewritten, source/dest MAC is rewritten, checksums are updated and rewritten, and in the outbound case we memcpy a pre-prepared packet header stack of length Roughly, perf looks like (ignoring the vague complaints about regression): #
# before
#
Benchmarking process/ULP-FastPath/wallclock/V4-Tcp-OUT-1400B: Collecting 100 samples in estimated 5.0084 s
process/ULP-FastPath/wallclock/V4-Tcp-OUT-1400B
time: [293.83 ns 295.23 ns 296.70 ns]
change: [+1.0641% +1.7977% +2.4187%] (p = 0.00 < 0.05)
Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe
process/ULP-FastPath/wallclock/V4-Tcp-IN-1400B
time: [216.11 ns 217.07 ns 218.10 ns]
change: [+1.1622% +1.6947% +2.2003%] (p = 0.00 < 0.05)
Performance has regressed.
#
# after
#
Benchmarking strawman/V4-Tcp-OUT-1400B(hypothetical): Collecting 100 samples in estimated 5.0071 s (2.2M it
strawman/V4-Tcp-OUT-1400B(hypothetical)
time: [73.838 ns 74.196 ns 74.560 ns]
change: [-1.2417% -0.5123% +0.2084%] (p = 0.18 > 0.05)
No change in performance detected.
strawman/V4-Tcp-IN-1400B(hypothetical)
time: [72.621 ns 73.367 ns 74.216 ns]
change: [-0.9970% +0.0679% +1.1228%] (p = 0.91 > 0.05)
No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
4 (4.00%) high mild
So in the best case for VPC-external traffic, While it's worth knowing that there's a big gap here, the effort needed to enable this (i.e., lazy reads + dirty write tracking, knowledge of structural changes) is probably pretty substantial relative to the gain. For instance, ballpark from flamegraphs indicates:
So I guess solving this will give us a ≤1.3X speedup, whereas doing some route caching can probably get us a similar speedup with less effort. I'll puzzle this out for a while. |
To process a packet, we currently read each layer and then advance the
r/wptr
fields of themblk_t
containing the packet and its headers. This leaves the original headers in place. We then determine how many bytes the complete header stack would require (Option<encap{L2,L3,L4}> + Ulp {L2, L3, L4}
), before using the existing space or allocating a new mblk to be placed at the front of the chain and copying the entire (post-transform) header set back into place.However, many of our transforms when packets are
Modified
do not alter the entire header stack, or even alter the structure of ULP packet headers. While needing to add or remove the Geneve header group is universal, VPC-internal traffic would actually require no changes to the ULP, and would in effect only modify L2 addrs of transformed traffic. VPC-external packets typically only require that we change L3 src/dest addrs or L4 ports (due to NAT/SNAT) in addition to this (and, accordingly, the checksum).It would likely be more clever to:
This might in theory be extended to
Hairpin
packets under limited circumstances, but I believe that through-traffic is the most important occurrance to capture.The text was updated successfully, but these errors were encountered: