-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache routes received from next_hop
#499
Conversation
This is not good code. It also appears to give us an extra 200Mbps.
7bbc401
to
f313250
Compare
I think this is necessary in principle because we're including hash information in the routing key, and I'd rather not have this state balloon out of control.
} | ||
} | ||
|
||
// The following are wrappers for reference drop functions used in XDE. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It hasn't changed in this PR, but these functions seem iffy to me, since they make it trivial to trigger UB from safe Rust:
let i = 123;
let ptr = i as *const ip::ire_t;
ire_refrele(ire); // probably bad!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I won't have time to squeeze this in to this PR, but I've opened #515 to track it separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this was done to facilitate implementing DropRef
. If there's a better/safer way to do this, all for it.
|
||
/// A simple caching layer over `next_hop`. | ||
#[derive(Clone)] | ||
pub struct RouteCache(Arc<KRwLock<BTreeMap<RouteKey, CachedRoute>>>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, did you test a HashMap
vs BTreeMap
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll admit that I hadn't -- I was following the OPTE preference for BTreeMap
s. But I threw something together quickly there in criterion with RouteKey
and CachedRoute
to insert/lookup one row into an arbitrary sized table:
Compiling opte-bench v0.1.0 (/Users/kyle/gits/opte/bench)
Finished bench [optimized + debuginfo] target(s) in 2.92s
Running benches/userland.rs (target/release/deps/userland-57e28d11e88fdd6e)
mapquest/btreemap/get/0 time: [287.02 ps 287.61 ps 288.30 ps]
Found 9 outliers among 100 measurements (9.00%)
5 (5.00%) high mild
4 (4.00%) high severe
mapquest/btreemap/get/2 time: [3.8893 ns 3.9047 ns 3.9191 ns]
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild
mapquest/btreemap/get/8 time: [8.9178 ns 8.9306 ns 8.9465 ns]
Found 19 outliers among 100 measurements (19.00%)
5 (5.00%) high mild
14 (14.00%) high severe
mapquest/btreemap/get/64
time: [8.1967 ns 8.2355 ns 8.2792 ns]
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high severe
mapquest/btreemap/get/512
time: [7.2225 ns 7.2445 ns 7.2702 ns]
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high severe
mapquest/btreemap/get/8192
time: [2.8060 ns 2.8106 ns 2.8155 ns]
Found 5 outliers among 100 measurements (5.00%)
1 (1.00%) low mild
4 (4.00%) high mild
mapquest/btreemap/insert/0
time: [53.993 ns 54.513 ns 55.530 ns]
Found 5 outliers among 100 measurements (5.00%)
1 (1.00%) low mild
2 (2.00%) high mild
2 (2.00%) high severe
mapquest/btreemap/insert/2
time: [47.196 ns 47.271 ns 47.339 ns]
Found 3 outliers among 100 measurements (3.00%)
2 (2.00%) low mild
1 (1.00%) high mild
mapquest/btreemap/insert/8
time: [57.601 ns 57.679 ns 57.760 ns]
Found 5 outliers among 100 measurements (5.00%)
5 (5.00%) high mild
mapquest/btreemap/insert/64
time: [290.32 ns 290.57 ns 290.83 ns]
Found 7 outliers among 100 measurements (7.00%)
3 (3.00%) low mild
4 (4.00%) high mild
mapquest/btreemap/insert/512
time: [2.4069 µs 2.4108 µs 2.4145 µs]
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild
mapquest/btreemap/insert/8192
time: [50.270 µs 50.590 µs 50.858 µs]
mapquest/hashmap/get/0 time: [287.28 ps 288.20 ps 289.29 ps]
Found 17 outliers among 100 measurements (17.00%)
5 (5.00%) high mild
12 (12.00%) high severe
mapquest/hashmap/get/2 time: [22.259 ns 23.395 ns 24.472 ns]
mapquest/hashmap/get/8 time: [21.733 ns 22.932 ns 24.172 ns]
mapquest/hashmap/get/64 time: [22.045 ns 23.297 ns 24.493 ns]
mapquest/hashmap/get/512
time: [19.729 ns 20.846 ns 22.009 ns]
mapquest/hashmap/get/8192
time: [19.731 ns 20.825 ns 21.964 ns]
mapquest/hashmap/insert/0
time: [46.374 ns 46.581 ns 46.796 ns]
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) low mild
1 (1.00%) high mild
mapquest/hashmap/insert/2
time: [43.214 ns 43.935 ns 44.742 ns]
Found 3 outliers among 100 measurements (3.00%)
3 (3.00%) high mild
mapquest/hashmap/insert/8
time: [66.543 ns 66.619 ns 66.698 ns]
Found 6 outliers among 100 measurements (6.00%)
3 (3.00%) low mild
3 (3.00%) high mild
mapquest/hashmap/insert/64
time: [59.119 ns 60.854 ns 62.838 ns]
mapquest/hashmap/insert/512
time: [295.78 ns 301.83 ns 308.87 ns]
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe
mapquest/hashmap/insert/8192
time: [231.58 ns 232.38 ns 233.19 ns]
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) low severe
1 (1.00%) high mild
1 (1.00%) high severe
This sort of tracks with my understanding of why we're using BTreeMap
s in general throughout OPTE: faster lookup at the cost of more expensive insert. Perhaps a HashMap
might work better here: the insert cost scaling up into microseconds around 256--512 entries is not pretty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps a more obvious reason why we use BTree{Map,Set}
that totally slipped my mind: there is no HashMap
in alloc
/no_std
. So we would need to find and measure a suitable replacement if we wanted to go down that road. In the meantime I think the best call might be to limit the cache size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I don't have a ton of context here, but do we expect the nexthop cache to have a lot of regular churn?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There shouldn't be in practice -- the above is very much worst-case, and I'm thinking about the case where a guest might have many active flows that are actively refreshing their routes and then add one (/n) more.
In the general case of a steady number of flows:
- An extant flow will call
next_hop
every 100ms in case preferences or reachability change. If there's an existing entry (even if expired), we should only be paying for lookup rather than a full insert on theBTreeMap
. - Expired routes are cleared out from the map every 1s.
The last part might add to the risk of churn; I'll change it up to expire flows which have expired by longer than the cleanup period to make refresh more consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intuition here is that we want to optimize for queries rather than inserts. And if that intuition is not correct I really want to understand why. If this can be modeled as an LPM query, then using poptrie might be a win.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm in agreement on that front. Summarising the other comment I made, we can rework this as an LPM query on underlay destination alone (basically removing insert cost), but we'd need to redo how we walk and inspect IREs for a destination.
This also includes more details on the relative costs we're trying to offset, and the futur ease for impling/finding a suitable hashmap.
947bef4
to
1b1b5af
Compare
This places 'requesting a new IRE from illumos' and 'removing the allocated slot in the RouteCache` onto two different timers. This should allow a flow to keep reusing its existing slot without needing to redo an expensive insert incurred after a removal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Kyle. Mostly high-level overall direction comments from my end. Code looks good to me, but let's look into the expiry rollover safety that I noted in the comments below.
// Adjacent to xde is the native IPv6 stack along with its routing | ||
// table. This table is routinely updated to indicate the best path to | ||
// any given IPv6 destination that may be specified in the outer IP | ||
// header. As xde is not utilizing the native IPv6 stack to send out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should revisit our tx-path entry point. I'm not sure if there is an intrinsic reason we need to tx directly to mac, and illumos has it's own caching mechanisms like destination cache entries (DCE) that are likely a lot quicker than full route lookups.
That's not to say we should drop this work in favor of that direction, as we're getting a clear win here, but something to think about for sure. There is also an argument to be made in the other direction that OPTE should be as self-sufficient as possible, so lots of tradeoffs to consider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is part of what #504 is getting at; we do already need to revisit the tx path as part of getting the NICs back into hardware classification. Although, the way I've been going about that so far is by getting another client handle at the DLS layer, so it's not fundamentally different from mac_tx
.
Coming at it from the IP level is probably safer still. It looks to me like the easiest way of doing this from an NCE/IRE (using public APIs) is ip_xmit
, but this works on a packet at a time. There could be something worth copying in how it uses the ILL to perform the actual send?
/// | ||
/// Naturally, bringing this down to O(ns) is desirable. Usually, illumos | ||
/// holds `ire_t`s per `conn_t`, but we're aiming to be more fine-grained | ||
/// with DDM -- so we need a tradeoff between 'asking about the best route |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not index too heavily on what will happen once the DDM data plane lands. There are some questions as to where this is going to fit in to the overall data plane - and one possibility is that it goes all the way down to the NIC as an offload.
|
||
/// A simple caching layer over `next_hop`. | ||
#[derive(Clone)] | ||
pub struct RouteCache(Arc<KRwLock<BTreeMap<RouteKey, CachedRoute>>>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intuition here is that we want to optimize for queries rather than inserts. And if that intuition is not correct I really want to understand why. If this can be modeled as an LPM query, then using poptrie might be a win.
// counts. | ||
// If full and we have no old entry to update, drop the lock and do | ||
// not insert. | ||
// XXX: Want to profile in future to see if LRU expiry is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is worth an issue and measuring cache performance there is 1) a pool of destination route keys larger than the cap that is b) churning within the expiry time. One could easily imagine this scenarios arising for a busy web server with lots of clients. I think the reason we have the L4 hash in the key is to pin flows to paths, so expressing this as an LPM to reduce table size would probably be tricky?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add some extra probes around cache events so we can observe cases like this.
The L4 Hash in the key captures behaviour we have today, which is that OPTE uses multiple sled-to-sled paths depending on flow hash. We could recast that behaviour into an LPM of O(sleds) entries: if we can query and cache all reachability and preference information for each underlay destination (e.g., over both NIC links), then we can handle distribution over paths ourselves. We could possibly walk ire->ire_bucket
in the same vein as ire_round_robin
to fill that out?
I think that would be the right call; it'd be followup work that would likely replace large parts of this, so it's up to you on whether we use what we have for now and keep an LPM in mind as the next step (until we can offload, that is 😄).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's take the win in front of us and continue to consider recasting things to fit in an LPM lookup later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again, #539 opened.
/// | ||
/// Expired routes will not be removed from the cache, and will leave | ||
/// an entry to enable a quick in-place refresh in the `BTreeMap`. | ||
const EXPIRE_ROUTE_LIFETIME: Duration = Duration::from_millis(100); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the moment, this is well within the DDM control plane neighbor expiry lifetime. We'll revisit when the DDM data plane lands.
This PR inserts a caching layer in front of
next_hop
to store routes for a given(IpAddr6, L4Hash)
for around 100ms, and reorganises routing-related code intoxde::route
. This is a necessary step as today we spend around 21% ofxde_mc_tx
in route lookup on a per-outbound-packet basis. As a consequence, we need install a newPeriodic
to flush expired entries every 1s, as otherwise we can have a large amountof detritus pileup from different L4Hash values.Currently this is implemented as a per-port cache rather than shared across the driver -- I've done this because it will give us better sharding of concurrent readers/writers at higher port/guest counts. The PR as-is pulls us up from ~1.74Gbps to 2Gbps on a single link between sn9/14.
I'm seeing similar gain on the
all-perf-in-flight
branch on in-review illumos bits (which includes mac flows + packet chains), where this change brings us from 2.25 Gbps to 2.50 Gbps on this setup.