-
Notifications
You must be signed in to change notification settings - Fork 804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove simd and avx512 bitwise kernels in favor of autovectorization #1830
Conversation
…ly slower than the autovectorized version
Codecov Report
@@ Coverage Diff @@
## master #1830 +/- ##
==========================================
- Coverage 83.53% 83.49% -0.05%
==========================================
Files 200 201 +1
Lines 56798 56902 +104
==========================================
+ Hits 47449 47511 +62
- Misses 9349 9391 +42
Continue to review full report at Codecov.
|
I plan to run these benchmarks later today on a server-class machine and confirm there is no performance delta |
On a Intel Cascade Lake Xeon(R) CPU @ 3.10GHz, specifically a GCP c2-standard-16 Nightly with defaults
Nightly with simd
Nightly with avx512
Nightly with defaults and RUSTFLAGS="-Ctarget-cpu=native"
Nightly with simd and RUSTFLAGS="-Ctarget-cpu=native"
Nightly with avx512 and RUSTFLAGS="-Ctarget-cpu=native"
Nightly with defaults and RUSTFLAGS="-Ctarget-cpu=native -Ctarget-feature=-prefer-256-bit"
Nightly with simd and RUSTFLAGS="-Ctarget-cpu=native -Ctarget-feature=-prefer-256-bit"
Nightly with avx512 and RUSTFLAGS="-Ctarget-cpu=native -Ctarget-feature=-prefer-256-bit"
Stable with defaults RUSTFLAGS="-Ctarget-cpu=native"
Stable with defaults RUSTFLAGS="-Ctarget-cpu=native -Ctarget-feature=-prefer-256-bit"
Conclusion
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should call out the recommend RUSTFLAGS somewhere if we don't already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benchmarks looks great. I run some internal benchmarks with diabling simd and the RUSTFLAGS. I don't see performance downgrade.
One question. Because the RUSTFLAGS="-Ctarget-cpu=xxx" is for optimizing on specific CPU model. But for existing simd feature, we don't need to specify CPU. Compared with existing simd feature, is the RUSTFLAGS less general? |
Thanks for reproducing the results! It's interesting that you also seem to get the effect of the second benchmark function being slower. I don't think it's downclocking related, on my machine the frequency looked constant and cascade lake shouldn't downclock either. Pinning the process to one core also did not make a difference. But I assume this is a effect that will only be visible in a microbenchmark and not have an effect on larger programs |
The simd feature also benefits from specifying the There probably is some additional compiler flag that would make the standard kernel as fast as the simd kernel also for the default target. We probably can't expect everyone to set several flags though. |
Would someone be able to also benchmark this on ARM, for example on an Apple M1 or AWS graviton, to avoid any regressions there? I'm assuming that someone who cares about performance enough to enable the |
I'll run a benchmark and report back in an hour to two |
M1 Pro
A bit hard to interpret the results, but what I can see is that stable with a CPU target flag isn't much that better than without. That could make sense as there shouldn't be (m)any differences between ARMv8 CPUs that are supported, unlike x64 where there's all sorts of SIMD versions. Of the 6 results, 5 are fastest in stable, but not by a large variation compared to the other options, unlike what we saw with x64 on @tustvold 's results. Seeing as there's no regression, I'll merge this. Thanks @jhorstmann! |
let mut result = MutableBuffer::new(len).with_bitset(len, false); | ||
let lanes = u8x64::lanes(); | ||
|
||
let mut left_chunks = left.as_slice()[left_offset..].chunks_exact(lanes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was actually also buggy and should have been sliced using [left_offset..left_offset+len]
. The effect was that if the buffer was larger then len the remainders would not line up.
Which issue does this PR close?
Closes #1829.
Rationale for this change
The autovectorized implementation is actually faster, allowing us to simplify the buffer code. Benchmark results are in the linked issue.
What changes are included in this PR?
Are there any user-facing changes?
Removal of the
avx512
feature could be a breaking change if someone had it enabled.