Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cleanup: remove Bytes8 abstraction #120

Merged
merged 4 commits into from
Aug 29, 2022

Conversation

AaronO
Copy link
Contributor

@AaronO AaronO commented Aug 26, 2022

Also improves performance, mainly parse_version, we observe a ~10% improvement on req_short:

req_short/req_short     time:   [29.214 ns 29.296 ns 29.391 ns]
                        thrpt:  [2.1548 GiB/s 2.1617 GiB/s 2.1678 GiB/s]
                 change:
                        time:   [-10.968% -9.6366% -7.9937%] (p = 0.00 < 0.05)
                        thrpt:  [+8.6882% +10.664% +12.319%]
                        Performance has improved.

AaronO added 2 commits August 27, 2022 01:30
Also improves performance, mainly `parse_version`

~10% improvement on req_short:
```
req_short/req_short     time:   [29.214 ns 29.296 ns 29.391 ns]
                        thrpt:  [2.1548 GiB/s 2.1617 GiB/s 2.1678 GiB/s]
                 change:
                        time:   [-10.968% -9.6366% -7.9937%] (p = 0.00 < 0.05)
                        thrpt:  [+8.6882% +10.664% +12.319%]
                        Performance has improved.
```
@seanmonstar
Copy link
Owner

Thanks for the PR! That's cool to see such an improvement! I wonder what it is, is it removing the increment on each access? I intuitively would have assumed this would be slower in the longer requests, since it's copying a lot of the bytes into a stack array...

(I also notice that using const generics increases the MSRV significantly.)

@AaronO
Copy link
Contributor Author

AaronO commented Aug 26, 2022

Thanks for the PR! That's cool to see such an improvement! I wonder what it is, is it removing the increment on each access? I intuitively would have assumed this would be slower in the longer requests, since it's copying a lot of the bytes into a stack array...

(I also notice that using const generics increases the MSRV significantly.)

I'll tweak it to avoid bumping MSRV (const generics aren't really core).

Regarding performance, the previous code probably had ~7 instructions per byte (bounds checking, modifying pos for each input byte, etc...), with the new code the 8 bytes of input will essentially get loaded into a single 64 bit register and then a single equality check (I believe rust/llvm actually codegen 3 compares here, 1 for the shared prefix of the match arms and then a 1 byte cmp for each remaining branch).

The current code definitely isn't optimal, the PR started as a "drive by" cleanup, I'll possibly do a performance centric follow up PR

@AaronO
Copy link
Contributor Author

AaronO commented Aug 27, 2022

I briefly played around with other perf improvements, and have so far achieved a ~20-25% improvement over master (benched on M1 Air, so without SIMD, though neon might be worth exploring).

## master
test req/req ... bench:                     241 ns/iter (+/- 2)
test req_short/req_short ... bench:         32 ns/iter (+/- 1)
test resp/resp ... bench:                   228 ns/iter (+/- 3)
test resp_short/resp_short ... bench:       40 ns/iter (+/- 0)

## PR #120 + other tweaks
test req/req ... bench:                     181 ns/iter (+/- 2)
test req_short/req_short ... bench:         25 ns/iter (+/- 4)
test resp/resp ... bench:                   183 ns/iter (+/- 6)
test resp_short/resp_short ... bench:       32 ns/iter (+/- 0)

(using --output-format=bencher for brevity)

I have a few other ideas for further improvements, happy to throw up PRs if you're interested

@AaronO
Copy link
Contributor Author

AaronO commented Aug 28, 2022

Improved throughput by +50-60% relative to master:

req/req                 time:   [147.23 ns 147.42 ns 147.65 ns]
                        thrpt:  [4.5414 GiB/s 4.5487 GiB/s 4.5543 GiB/s]
                 change:
                        time:   [-38.665% -38.497% -38.343%] (p = 0.00 < 0.05)
                        thrpt:  [+62.187% +62.592% +63.039%]
                        Performance has improved.

resp/resp               time:   [153.49 ns 153.92 ns 154.62 ns]
                        thrpt:  [4.2103 GiB/s 4.2295 GiB/s 4.2413 GiB/s]
                 change:
                        time:   [-32.746% -32.537% -32.220%] (p = 0.00 < 0.05)
                        thrpt:  [+47.535% +48.229% +48.691%]
                        Performance has improved.

Don't have too much time today to explore further, but a 50% reduction in time and thus a 2x in throughput might be possible

@seanmonstar
Copy link
Owner

This is phenomenal work, thank you!

@seanmonstar seanmonstar merged commit 0e1e0b0 into seanmonstar:master Aug 29, 2022
@bartlomieju
Copy link

@seanmonstar we'd love to try this change in our new server that uses httparse, any chance you could release a new version?

@seanmonstar
Copy link
Owner

Yep!

@AaronO
Copy link
Contributor Author

AaronO commented Aug 29, 2022

Improved throughput by +50-60% relative to master:

@bartlomieju This doesn't include all the improvements mentioned in the latest comment, those were additional improvements I've implemented locally.

This PR mainly optimizes parse_version, shaving off ~10% off req_short, my local changes optimize header & url parsing (since all the other parts are roughly constant size) and achieves a -40% reduction in time or +60% in throughput.

I've tested further changes that allow me to achieve over 5 GiB/s and roughly 2x faster than the previous master

@bartlomieju
Copy link

Ah makes sense, @seanmonstar I'm fine waiting another few days for more changes like this!

@seanmonstar
Copy link
Owner

Though, it doesn't cost much to publish a release. I can publish one now, and merge more improvements later. Or if you nearly have them ready, I can hold off, whichever you prefer.

@bartlomieju
Copy link

@seanmonstar thanks, if that's not a big deal then I'd kindly ask to release now :)

@AaronO AaronO deleted the cleanup/drop-bytes8 branch March 8, 2023 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants