-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: use recvmmsg in addition to GRO #2137
Conversation
This change is best summarized by the `process` function signature. On `main` branch the `process` function looks as such: ```rust pub fn process(&mut self, dgram: Option<&Datagram>, now: Instant) -> Output { ``` - It takes as **input** an optional reference to a `Datagram`. That `Datagram` owns an allocation of the UDP payload, i.e. a `Vec<u8>`. Thus for each incoming UDP datagram, its payload is allocated in a new `Vec`. - It returns as **output** an owned `Output`. Most relevantly the `Output` variant `Output::Datagram(Datagram)` contains a `Datagram` that again owns an allocation of the UDP payload, i.e. a `Vec<u8>`. Thus for each outgoing UDP datagram too, its payload is allocated in a new `Vec`. This commit changes the `process` function to: ```rust pub fn process_into<'a>( &mut self, input: Option<Datagram<&[u8]>>, now: Instant, write_buffer: &'a mut Vec<u8>, ) -> Output<&'a [u8]> { ``` (Note the rename to `process_into` is temporary.) - It takes as **input** an optional `Datagram<&[u8]>`. But contrary to before, `Datagram<&[u8]>` does not own an allocation of the UDP payload, but represents a view into a long-lived receive buffer containing the UDP payload. - It returns as **output** an `Output<&'a [u8]>` where the `Output::Datagram(Datagram<&'a [u8]>)` variant does not own an allocation of the UDP payload, but here as well represents a view into a long-lived write buffer the payload is written into. That write buffer lives outside of `neqo_transport::Connection` and is provided to `process` as `write_buffer: &'a mut Vec<u8>`. Note that both `write_buffer` and `Output` use the lifetime `'a`, i.e. the latter is a view into the former. This change to the `process` function enables the following: 1. A user of `neqo_transport` (e.g. `neqo_bin`) has the OS write incoming UDP datagrams into a long-lived receive buffer (via e.g. `recvmmsg`). 2. They pass that receive buffer to `neqo_transport::Connection::process` along with a long-lived write buffer. 3. `process` reads the UDP datagram from the long-lived receive buffer through the `Datagram<&[u8]>` view and writes outgoing datagrams into the provided long-lived `write_buffer`, returning a view into said buffer via a `Datagram<&'a [u8]>`. 4. The user, after having called `process` can then pass the write buffer to the OS (e.g. via `sendmsg`). To summarize a user can receive and send UDP datagrams, without allocation in the UDP IO path. As an aside, the above is compatible with GSO and GRO, where a send and receive buffer contains a consecutive number of UDP datagram segments.
This reverts commit 995c499.
One can just use process(None, ...)
This reverts commit 3df6660.
No need to play with fire (uninitialized memory). Simply initialize the recv buffer once at startup.
382234b
to
63eb277
Compare
Failed Interop TestsQUIC Interop Runner, client vs. server neqo-latest as client
neqo-latest as server
All resultsSucceeded Interop TestsQUIC Interop Runner, client vs. server neqo-latest as client
neqo-latest as server
Unsupported Interop TestsQUIC Interop Runner, client vs. server neqo-latest as client
neqo-latest as server
|
63eb277
to
235559f
Compare
Previously we would only do GRO.
235559f
to
f9bd792
Compare
Benchmark resultsPerformance differences relative to 55e3a93. coalesce_acked_from_zero 1+1 entries: 💚 Performance has improved.time: [98.803 ns 99.130 ns 99.472 ns] change: [-12.775% -12.368% -11.953%] (p = 0.00 < 0.05) coalesce_acked_from_zero 3+1 entries: 💚 Performance has improved.time: [116.87 ns 117.20 ns 117.56 ns] change: [-33.461% -33.093% -32.696%] (p = 0.00 < 0.05) coalesce_acked_from_zero 10+1 entries: 💚 Performance has improved.time: [116.24 ns 116.64 ns 117.13 ns] change: [-39.896% -35.657% -33.117%] (p = 0.00 < 0.05) coalesce_acked_from_zero 1000+1 entries: 💚 Performance has improved.time: [97.407 ns 97.529 ns 97.668 ns] change: [-31.875% -31.315% -30.596%] (p = 0.00 < 0.05) RxStreamOrderer::inbound_frame(): Change within noise threshold.time: [111.62 ms 111.67 ms 111.72 ms] change: [+0.1861% +0.2531% +0.3192%] (p = 0.00 < 0.05) transfer/pacing-false/varying-seeds: No change in performance detected.time: [26.891 ms 27.987 ms 29.092 ms] change: [-7.3239% -2.0801% +3.3604%] (p = 0.45 > 0.05) transfer/pacing-true/varying-seeds: No change in performance detected.time: [36.651 ms 38.242 ms 39.826 ms] change: [-5.9905% -0.1635% +5.8478%] (p = 0.95 > 0.05) transfer/pacing-false/same-seed: No change in performance detected.time: [26.723 ms 27.507 ms 28.285 ms] change: [-3.3295% +0.5804% +4.8142%] (p = 0.78 > 0.05) transfer/pacing-true/same-seed: No change in performance detected.time: [41.198 ms 43.275 ms 45.393 ms] change: [-7.2694% -1.0401% +5.9147%] (p = 0.75 > 0.05) 1-conn/1-100mb-resp (aka. Download)/client: 💚 Performance has improved.time: [106.99 ms 107.48 ms 108.11 ms] thrpt: [924.95 MiB/s 930.44 MiB/s 934.71 MiB/s] change: time: [-7.7356% -7.2870% -6.6739%] (p = 0.00 < 0.05) thrpt: [+7.1511% +7.8598% +8.3842%] 1-conn/10_000-parallel-1b-resp (aka. RPS)/client: No change in performance detected.time: [319.80 ms 323.00 ms 326.21 ms] thrpt: [30.655 Kelem/s 30.960 Kelem/s 31.270 Kelem/s] change: time: [-0.7331% +0.8246% +2.4125%] (p = 0.31 > 0.05) thrpt: [-2.3557% -0.8178% +0.7385%] 1-conn/1-1b-resp (aka. HPS)/client: 💔 Performance has regressed.time: [36.454 ms 36.648 ms 36.855 ms] thrpt: [27.134 elem/s 27.286 elem/s 27.432 elem/s] change: time: [+7.7777% +8.6189% +9.3890%] (p = 0.00 < 0.05) thrpt: [-8.5831% -7.9350% -7.2165%] Client/server transfer resultsTransfer of 33554432 bytes over loopback.
|
Work tracked in #1693. Closing here. |
Previously we would only do GRO.
Depends on #2093.
Part of #1693.
Draft only. Just testing on the benchmark runner for now.