cleanup: remove Bytes8 abstraction #120

AaronO · 2022-08-26T18:31:19Z

Also improves performance, mainly parse_version, we observe a ~10% improvement on req_short:

req_short/req_short     time:   [29.214 ns 29.296 ns 29.391 ns]
                        thrpt:  [2.1548 GiB/s 2.1617 GiB/s 2.1678 GiB/s]
                 change:
                        time:   [-10.968% -9.6366% -7.9937%] (p = 0.00 < 0.05)
                        thrpt:  [+8.6882% +10.664% +12.319%]
                        Performance has improved.

Also improves performance, mainly `parse_version` ~10% improvement on req_short: ``` req_short/req_short time: [29.214 ns 29.296 ns 29.391 ns] thrpt: [2.1548 GiB/s 2.1617 GiB/s 2.1678 GiB/s] change: time: [-10.968% -9.6366% -7.9937%] (p = 0.00 < 0.05) thrpt: [+8.6882% +10.664% +12.319%] Performance has improved. ```

seanmonstar · 2022-08-26T20:23:05Z

Thanks for the PR! That's cool to see such an improvement! I wonder what it is, is it removing the increment on each access? I intuitively would have assumed this would be slower in the longer requests, since it's copying a lot of the bytes into a stack array...

(I also notice that using const generics increases the MSRV significantly.)

AaronO · 2022-08-26T20:51:53Z

Thanks for the PR! That's cool to see such an improvement! I wonder what it is, is it removing the increment on each access? I intuitively would have assumed this would be slower in the longer requests, since it's copying a lot of the bytes into a stack array...

(I also notice that using const generics increases the MSRV significantly.)

I'll tweak it to avoid bumping MSRV (const generics aren't really core).

Regarding performance, the previous code probably had ~7 instructions per byte (bounds checking, modifying pos for each input byte, etc...), with the new code the 8 bytes of input will essentially get loaded into a single 64 bit register and then a single equality check (I believe rust/llvm actually codegen 3 compares here, 1 for the shared prefix of the match arms and then a 1 byte cmp for each remaining branch).

The current code definitely isn't optimal, the PR started as a "drive by" cleanup, I'll possibly do a performance centric follow up PR

To avoid bumping MSRV

AaronO · 2022-08-27T20:09:11Z

I briefly played around with other perf improvements, and have so far achieved a ~20-25% improvement over master (benched on M1 Air, so without SIMD, though neon might be worth exploring).

## master
test req/req ... bench:                     241 ns/iter (+/- 2)
test req_short/req_short ... bench:         32 ns/iter (+/- 1)
test resp/resp ... bench:                   228 ns/iter (+/- 3)
test resp_short/resp_short ... bench:       40 ns/iter (+/- 0)

## PR #120 + other tweaks
test req/req ... bench:                     181 ns/iter (+/- 2)
test req_short/req_short ... bench:         25 ns/iter (+/- 4)
test resp/resp ... bench:                   183 ns/iter (+/- 6)
test resp_short/resp_short ... bench:       32 ns/iter (+/- 0)

(using --output-format=bencher for brevity)

I have a few other ideas for further improvements, happy to throw up PRs if you're interested

AaronO · 2022-08-28T09:18:37Z

Improved throughput by +50-60% relative to master:

req/req                 time:   [147.23 ns 147.42 ns 147.65 ns]
                        thrpt:  [4.5414 GiB/s 4.5487 GiB/s 4.5543 GiB/s]
                 change:
                        time:   [-38.665% -38.497% -38.343%] (p = 0.00 < 0.05)
                        thrpt:  [+62.187% +62.592% +63.039%]
                        Performance has improved.

resp/resp               time:   [153.49 ns 153.92 ns 154.62 ns]
                        thrpt:  [4.2103 GiB/s 4.2295 GiB/s 4.2413 GiB/s]
                 change:
                        time:   [-32.746% -32.537% -32.220%] (p = 0.00 < 0.05)
                        thrpt:  [+47.535% +48.229% +48.691%]
                        Performance has improved.

Don't have too much time today to explore further, but a 50% reduction in time and thus a 2x in throughput might be possible

seanmonstar · 2022-08-29T17:40:33Z

This is phenomenal work, thank you!

bartlomieju · 2022-08-29T17:53:39Z

@seanmonstar we'd love to try this change in our new server that uses httparse, any chance you could release a new version?

seanmonstar · 2022-08-29T17:56:56Z

Yep!

AaronO · 2022-08-29T17:57:53Z

Improved throughput by +50-60% relative to master:

@bartlomieju This doesn't include all the improvements mentioned in the latest comment, those were additional improvements I've implemented locally.

This PR mainly optimizes parse_version, shaving off ~10% off req_short, my local changes optimize header & url parsing (since all the other parts are roughly constant size) and achieves a -40% reduction in time or +60% in throughput.

I've tested further changes that allow me to achieve over 5 GiB/s and roughly 2x faster than the previous master

bartlomieju · 2022-08-29T18:03:51Z

Ah makes sense, @seanmonstar I'm fine waiting another few days for more changes like this!

seanmonstar · 2022-08-30T18:05:06Z

Though, it doesn't cost much to publish a release. I can publish one now, and merge more improvements later. Or if you nearly have them ready, I can hold off, whichever you prefer.

bartlomieju · 2022-08-30T18:20:13Z

@seanmonstar thanks, if that's not a big deal then I'd kindly ask to release now :)

AaronO added 2 commits August 27, 2022 01:30

fix no_std

7c66be1

AaronO added 2 commits August 27, 2022 21:50

avoid const generics

024b58a

To avoid bumping MSRV

peek_n lifetime tweak

40f357e

seanmonstar approved these changes Aug 29, 2022

View reviewed changes

seanmonstar merged commit 0e1e0b0 into seanmonstar:master Aug 29, 2022

AaronO deleted the cleanup/drop-bytes8 branch March 8, 2023 09:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cleanup: remove Bytes8 abstraction #120

cleanup: remove Bytes8 abstraction #120

AaronO commented Aug 26, 2022

seanmonstar commented Aug 26, 2022

AaronO commented Aug 26, 2022

AaronO commented Aug 27, 2022

AaronO commented Aug 28, 2022

seanmonstar commented Aug 29, 2022

bartlomieju commented Aug 29, 2022

seanmonstar commented Aug 29, 2022

AaronO commented Aug 29, 2022

bartlomieju commented Aug 29, 2022

seanmonstar commented Aug 30, 2022

bartlomieju commented Aug 30, 2022

cleanup: remove Bytes8 abstraction #120

cleanup: remove Bytes8 abstraction #120

Conversation

AaronO commented Aug 26, 2022

seanmonstar commented Aug 26, 2022

AaronO commented Aug 26, 2022

AaronO commented Aug 27, 2022

AaronO commented Aug 28, 2022

seanmonstar commented Aug 29, 2022

bartlomieju commented Aug 29, 2022

seanmonstar commented Aug 29, 2022

AaronO commented Aug 29, 2022

bartlomieju commented Aug 29, 2022

seanmonstar commented Aug 30, 2022

bartlomieju commented Aug 30, 2022