Cache routes received from `next_hop` #499

FelixMcFelix · 2024-04-16T19:06:58Z

This PR inserts a caching layer in front of next_hop to store routes for a given (IpAddr6, L4Hash) for around 100ms, and reorganises routing-related code into xde::route. This is a necessary step as today we spend around 21% of xde_mc_tx in route lookup on a per-outbound-packet basis. As a consequence, we need install a new Periodic to flush expired entries every 1s, as otherwise we can have a large amountof detritus pileup from different L4Hash values.

Currently this is implemented as a per-port cache rather than shared across the driver -- I've done this because it will give us better sharding of concurrent readers/writers at higher port/guest counts. The PR as-is pulls us up from ~1.74Gbps to 2Gbps on a single link between sn9/14.

I'm seeing similar gain on the all-perf-in-flight branch on in-review illumos bits (which includes mac flows + packet chains), where this change brings us from 2.25 Gbps to 2.50 Gbps on this setup.

This is not good code. It also appears to give us an extra 200Mbps.

I think this is necessary in principle because we're including hash information in the routing key, and I'd rather not have this state balloon out of control.

mkeeter · 2024-04-25T18:25:54Z

xde/src/route.rs

+    }
+}
+
+// The following are wrappers for reference drop functions used in XDE.


It hasn't changed in this PR, but these functions seem iffy to me, since they make it trivial to trigger UB from safe Rust:

let i = 123; let ptr = i as *const ip::ire_t; ire_refrele(ire); // probably bad!

I won't have time to squeeze this in to this PR, but I've opened #515 to track it separately.

I believe this was done to facilitate implementing DropRef. If there's a better/safer way to do this, all for it.

xde/src/route.rs

mkeeter · 2024-04-25T18:32:56Z

xde/src/route.rs

+
+/// A simple caching layer over `next_hop`.
+#[derive(Clone)]
+pub struct RouteCache(Arc<KRwLock<BTreeMap<RouteKey, CachedRoute>>>);


Out of curiosity, did you test a HashMap vs BTreeMap?

I'll admit that I hadn't -- I was following the OPTE preference for BTreeMaps. But I threw something together quickly there in criterion with RouteKey and CachedRoute to insert/lookup one row into an arbitrary sized table:

Compiling opte-bench v0.1.0 (/Users/kyle/gits/opte/bench) Finished bench [optimized + debuginfo] target(s) in 2.92s Running benches/userland.rs (target/release/deps/userland-57e28d11e88fdd6e) mapquest/btreemap/get/0 time: [287.02 ps 287.61 ps 288.30 ps] Found 9 outliers among 100 measurements (9.00%) 5 (5.00%) high mild 4 (4.00%) high severe mapquest/btreemap/get/2 time: [3.8893 ns 3.9047 ns 3.9191 ns] Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high mild mapquest/btreemap/get/8 time: [8.9178 ns 8.9306 ns 8.9465 ns] Found 19 outliers among 100 measurements (19.00%) 5 (5.00%) high mild 14 (14.00%) high severe mapquest/btreemap/get/64 time: [8.1967 ns 8.2355 ns 8.2792 ns] Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high severe mapquest/btreemap/get/512 time: [7.2225 ns 7.2445 ns 7.2702 ns] Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high severe mapquest/btreemap/get/8192 time: [2.8060 ns 2.8106 ns 2.8155 ns] Found 5 outliers among 100 measurements (5.00%) 1 (1.00%) low mild 4 (4.00%) high mild mapquest/btreemap/insert/0 time: [53.993 ns 54.513 ns 55.530 ns] Found 5 outliers among 100 measurements (5.00%) 1 (1.00%) low mild 2 (2.00%) high mild 2 (2.00%) high severe mapquest/btreemap/insert/2 time: [47.196 ns 47.271 ns 47.339 ns] Found 3 outliers among 100 measurements (3.00%) 2 (2.00%) low mild 1 (1.00%) high mild mapquest/btreemap/insert/8 time: [57.601 ns 57.679 ns 57.760 ns] Found 5 outliers among 100 measurements (5.00%) 5 (5.00%) high mild mapquest/btreemap/insert/64 time: [290.32 ns 290.57 ns 290.83 ns] Found 7 outliers among 100 measurements (7.00%) 3 (3.00%) low mild 4 (4.00%) high mild mapquest/btreemap/insert/512 time: [2.4069 µs 2.4108 µs 2.4145 µs] Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild mapquest/btreemap/insert/8192 time: [50.270 µs 50.590 µs 50.858 µs] mapquest/hashmap/get/0 time: [287.28 ps 288.20 ps 289.29 ps] Found 17 outliers among 100 measurements (17.00%) 5 (5.00%) high mild 12 (12.00%) high severe mapquest/hashmap/get/2 time: [22.259 ns 23.395 ns 24.472 ns] mapquest/hashmap/get/8 time: [21.733 ns 22.932 ns 24.172 ns] mapquest/hashmap/get/64 time: [22.045 ns 23.297 ns 24.493 ns] mapquest/hashmap/get/512 time: [19.729 ns 20.846 ns 22.009 ns] mapquest/hashmap/get/8192 time: [19.731 ns 20.825 ns 21.964 ns] mapquest/hashmap/insert/0 time: [46.374 ns 46.581 ns 46.796 ns] Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) low mild 1 (1.00%) high mild mapquest/hashmap/insert/2 time: [43.214 ns 43.935 ns 44.742 ns] Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high mild mapquest/hashmap/insert/8 time: [66.543 ns 66.619 ns 66.698 ns] Found 6 outliers among 100 measurements (6.00%) 3 (3.00%) low mild 3 (3.00%) high mild mapquest/hashmap/insert/64 time: [59.119 ns 60.854 ns 62.838 ns] mapquest/hashmap/insert/512 time: [295.78 ns 301.83 ns 308.87 ns] Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe mapquest/hashmap/insert/8192 time: [231.58 ns 232.38 ns 233.19 ns] Found 4 outliers among 100 measurements (4.00%) 2 (2.00%) low severe 1 (1.00%) high mild 1 (1.00%) high severe

This sort of tracks with my understanding of why we're using BTreeMaps in general throughout OPTE: faster lookup at the cost of more expensive insert. Perhaps a HashMap might work better here: the insert cost scaling up into microseconds around 256--512 entries is not pretty.

Perhaps a more obvious reason why we use BTree{Map,Set} that totally slipped my mind: there is no HashMap in alloc/no_std. So we would need to find and measure a suitable replacement if we wanted to go down that road. In the meantime I think the best call might be to limit the cache size.

Sorry I don't have a ton of context here, but do we expect the nexthop cache to have a lot of regular churn?

There shouldn't be in practice -- the above is very much worst-case, and I'm thinking about the case where a guest might have many active flows that are actively refreshing their routes and then add one (/n) more.

In the general case of a steady number of flows:

An extant flow will call next_hop every 100ms in case preferences or reachability change. If there's an existing entry (even if expired), we should only be paying for lookup rather than a full insert on the BTreeMap.

Expired routes are cleared out from the map every 1s.

The last part might add to the risk of churn; I'll change it up to expire flows which have expired by longer than the cleanup period to make refresh more consistent.

My intuition here is that we want to optimize for queries rather than inserts. And if that intuition is not correct I really want to understand why. If this can be modeled as an LPM query, then using poptrie might be a win.

I think I'm in agreement on that front. Summarising the other comment I made, we can rework this as an LPM query on underlay destination alone (basically removing insert cost), but we'd need to redo how we walk and inspect IREs for a destination.

This also includes more details on the relative costs we're trying to offset, and the futur ease for impling/finding a suitable hashmap.

This places 'requesting a new IRE from illumos' and 'removing the allocated slot in the RouteCache` onto two different timers. This should allow a flow to keep reusing its existing slot without needing to redo an expensive insert incurred after a removal.

rcgoodfellow

Thanks Kyle. Mostly high-level overall direction comments from my end. Code looks good to me, but let's look into the expiry rollover safety that I noted in the comments below.

rcgoodfellow · 2024-05-21T16:48:36Z

xde/src/route.rs

+// Adjacent to xde is the native IPv6 stack along with its routing
+// table. This table is routinely updated to indicate the best path to
+// any given IPv6 destination that may be specified in the outer IP
+// header. As xde is not utilizing the native IPv6 stack to send out


I wonder if we should revisit our tx-path entry point. I'm not sure if there is an intrinsic reason we need to tx directly to mac, and illumos has it's own caching mechanisms like destination cache entries (DCE) that are likely a lot quicker than full route lookups.

That's not to say we should drop this work in favor of that direction, as we're getting a clear win here, but something to think about for sure. There is also an argument to be made in the other direction that OPTE should be as self-sufficient as possible, so lots of tradeoffs to consider.

This is part of what #504 is getting at; we do already need to revisit the tx path as part of getting the NICs back into hardware classification. Although, the way I've been going about that so far is by getting another client handle at the DLS layer, so it's not fundamentally different from mac_tx.

Coming at it from the IP level is probably safer still. It looks to me like the easiest way of doing this from an NCE/IRE (using public APIs) is ip_xmit, but this works on a packet at a time. There could be something worth copying in how it uses the ILL to perform the actual send?

xde/src/route.rs

rcgoodfellow · 2024-05-21T16:57:41Z

xde/src/route.rs

+///
+/// Naturally, bringing this down to O(ns) is desirable. Usually, illumos
+/// holds `ire_t`s per `conn_t`, but we're aiming to be more fine-grained
+/// with DDM -- so we need a tradeoff between 'asking about the best route


I would not index too heavily on what will happen once the DDM data plane lands. There are some questions as to where this is going to fit in to the overall data plane - and one possibility is that it goes all the way down to the NIC as an offload.

rcgoodfellow · 2024-05-21T17:04:31Z

xde/src/route.rs

+
+/// A simple caching layer over `next_hop`.
+#[derive(Clone)]
+pub struct RouteCache(Arc<KRwLock<BTreeMap<RouteKey, CachedRoute>>>);


My intuition here is that we want to optimize for queries rather than inserts. And if that intuition is not correct I really want to understand why. If this can be modeled as an LPM query, then using poptrie might be a win.

rcgoodfellow · 2024-05-21T17:14:34Z

xde/src/route.rs

+        // counts.
+        // If full and we have no old entry to update, drop the lock and do
+        // not insert.
+        // XXX: Want to profile in future to see if LRU expiry is


I think this is worth an issue and measuring cache performance there is 1) a pool of destination route keys larger than the cap that is b) churning within the expiry time. One could easily imagine this scenarios arising for a busy web server with lots of clients. I think the reason we have the L4 hash in the key is to pin flows to paths, so expressing this as an LPM to reduce table size would probably be tricky?

I'll add some extra probes around cache events so we can observe cases like this.

The L4 Hash in the key captures behaviour we have today, which is that OPTE uses multiple sled-to-sled paths depending on flow hash. We could recast that behaviour into an LPM of O(sleds) entries: if we can query and cache all reachability and preference information for each underlay destination (e.g., over both NIC links), then we can handle distribution over paths ourselves. We could possibly walk ire->ire_bucket in the same vein as ire_round_robin to fill that out?

I think that would be the right call; it'd be followup work that would likely replace large parts of this, so it's up to you on whether we use what we have for now and keep an LPM in mind as the next step (until we can offload, that is 😄).

Let's take the win in front of us and continue to consider recasting things to fit in an LPM lookup later on.

Thanks again, #539 opened.

xde/src/route.rs

rcgoodfellow · 2024-05-21T17:45:54Z

xde/src/route.rs

+///
+/// Expired routes will not be removed from the cache, and will leave
+/// an entry to enable a quick in-place refresh in the `BTreeMap`.
+const EXPIRE_ROUTE_LIFETIME: Duration = Duration::from_millis(100);


For the moment, this is well within the DDM control plane neighbor expiry lifetime. We'll revisit when the DDM data plane lands.

FelixMcFelix added 2 commits April 16, 2024 19:38

Initial foray into route caching.

56682c4

This is not good code. It also appears to give us an extra 200Mbps.

Reuse existing l4 hash...

0e91dd1

FelixMcFelix marked this pull request as ready for review April 19, 2024 10:44

Cleanup implementation, move cache to per-port.

f313250

FelixMcFelix force-pushed the route-cache branch from 7bbc401 to f313250 Compare April 19, 2024 10:47

FelixMcFelix added 2 commits April 19, 2024 12:03

Use the correct reference matching function.

eb85463

Add in a period task for cached route expiry.

653f8e6

I think this is necessary in principle because we're including hash information in the routing key, and I'd rather not have this state balloon out of control.

FelixMcFelix self-assigned this Apr 22, 2024

FelixMcFelix added the perf label Apr 22, 2024

mkeeter reviewed Apr 25, 2024

View reviewed changes

xde/src/route.rs Outdated Show resolved Hide resolved

mkeeter reviewed Apr 25, 2024

View reviewed changes

FelixMcFelix added 2 commits April 26, 2024 20:16

Review discussion: Cap size of routing cache.

5c4a8a7

This also includes more details on the relative costs we're trying to offset, and the futur ease for impling/finding a suitable hashmap.

Merge branch 'master' into route-cache

1b1b5af

FelixMcFelix force-pushed the route-cache branch from 947bef4 to 1b1b5af Compare April 26, 2024 19:17

FelixMcFelix added 2 commits April 26, 2024 22:46

Use Entry to trim one lookup on cache entry refresh.

efea9aa

FelixMcFelix added this to the 9 milestone May 10, 2024

FelixMcFelix requested a review from rcgoodfellow May 20, 2024 20:48

Merge branch 'master' into route-cache

4aa6ae3

rcgoodfellow reviewed May 21, 2024

View reviewed changes

FelixMcFelix added 5 commits May 21, 2024 19:16

CStrs instead of b""s.

3c9e2ad

Review feedback.

7d844e5

Misc. Clippy feeding.

7ccad7a

DTrace probes on route cache events

4c80ce1

One more clippy.

157f0fd

rcgoodfellow approved these changes May 22, 2024

View reviewed changes

FelixMcFelix mentioned this pull request May 22, 2024

Recast route caching as an LPM, make multipath decisions locally #539

Open

FelixMcFelix merged commit d6177ca into master May 22, 2024
10 checks passed

FelixMcFelix deleted the route-cache branch May 22, 2024 17:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache routes received from `next_hop` #499

Cache routes received from `next_hop` #499

FelixMcFelix commented Apr 16, 2024 •

edited

Loading

mkeeter Apr 25, 2024

FelixMcFelix Apr 26, 2024 •

edited

Loading

rcgoodfellow May 21, 2024

mkeeter Apr 25, 2024

FelixMcFelix Apr 26, 2024 •

edited

Loading

FelixMcFelix Apr 26, 2024

rcgoodfellow Apr 26, 2024

FelixMcFelix Apr 26, 2024

rcgoodfellow May 21, 2024

FelixMcFelix May 22, 2024

rcgoodfellow left a comment

rcgoodfellow May 21, 2024

FelixMcFelix May 22, 2024

rcgoodfellow May 21, 2024

rcgoodfellow May 21, 2024

rcgoodfellow May 21, 2024

FelixMcFelix May 22, 2024

rcgoodfellow May 22, 2024

FelixMcFelix May 22, 2024

rcgoodfellow May 21, 2024

Cache routes received from next_hop #499

Cache routes received from next_hop #499

Conversation

FelixMcFelix commented Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

FelixMcFelix Apr 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FelixMcFelix Apr 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcgoodfellow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Cache routes received from `next_hop` #499

Cache routes received from `next_hop` #499

FelixMcFelix commented Apr 16, 2024 •

edited

Loading

FelixMcFelix Apr 26, 2024 •

edited

Loading

FelixMcFelix Apr 26, 2024 •

edited

Loading