-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port SpanHelpers.SequenceCompareTo(ref byte, int, ref byte, int) to Vector128/256 #73475
Conversation
Tagging subscribers to this area: @dotnet/area-system-memory Issue DetailsFor Arm64 we have a +- 20% improvement, mostly due to the fact that this code has not been optimized for ARM64. BenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22405.1
[Host] : .NET 7.0.0 (7.0.22.40308), Arm64 RyuJIT AdvSIMD
Job-PHNJLQ : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Job-FYBYWI : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
For x64 the performance is on par for both AVX2 and AVX. BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.7.22377.5
[Host] : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT AVX2
Job-ZYEZPW : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
Job-DJUBNM : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AXV
EnvironmentVariables=COMPlus_EnableAVX2=0
contributes to #64451
|
@tannergooding @EgorBo sorry to bother you guys, but it would be really nice to get this merged in 7 ;) |
{ | ||
// All matched | ||
offset += (nuint)Vector128<byte>.Count; | ||
continue; | ||
} | ||
|
||
goto Difference; | ||
goto BytewiseCheck; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this replacing the vectorized detection of which element in the vector differed with a linear walk through all bytes in the vector? If so, did you validate the perf impact of this on inputs smaller than 512 elements? I'm surprised this wouldn't result in regressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stephentoub please excuse me for the delay. That is true, as Vector128
code path is also executed for arm64, where ExtractMostSignificantBits
is expensive. I've used BytewiseCheck
which was so far used by Vector<T>
code path and its perf is OK.
I've added benchmarks for smaller collection sizes, synced the fork and re-run them.
x64 AVX2 (Vector256)
It's more or less on par.
BenchmarkDotNet=v0.13.2.1937-nightly, OS=Windows 11 (10.0.22000.978/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-rtm.22506.1
[Host] : .NET 7.0.0 (7.0.22.48010), X64 RyuJIT AVX2
main : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
pr : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Method | Job | Toolchain | Size | Mean | Ratio |
---|---|---|---|---|---|
SequenceCompareTo | main | \main\corerun.exe | 8 | 7.522 ns | 1.00 |
SequenceCompareTo | pr | \prSync\corerun.exe | 8 | 7.461 ns | 0.99 |
SequenceCompareToDifferent | main | \main\corerun.exe | 8 | 4.344 ns | 1.00 |
SequenceCompareToDifferent | pr | \prSync\corerun.exe | 8 | 4.230 ns | 0.97 |
SequenceCompareTo | main | \main\corerun.exe | 32 | 3.889 ns | 1.00 |
SequenceCompareTo | pr | \prSync\corerun.exe | 32 | 3.869 ns | 0.99 |
SequenceCompareToDifferent | main | \main\corerun.exe | 32 | 4.651 ns | 1.00 |
SequenceCompareToDifferent | pr | \prSync\corerun.exe | 32 | 4.603 ns | 0.99 |
SequenceCompareTo | main | \main\corerun.exe | 64 | 4.426 ns | 1.00 |
SequenceCompareTo | pr | \prSync\corerun.exe | 64 | 4.446 ns | 1.00 |
SequenceCompareToDifferent | main | \main\corerun.exe | 64 | 4.314 ns | 1.00 |
SequenceCompareToDifferent | pr | \prSync\corerun.exe | 64 | 4.332 ns | 1.00 |
SequenceCompareTo | main | \main\corerun.exe | 128 | 5.506 ns | 1.00 |
SequenceCompareTo | pr | \prSync\corerun.exe | 128 | 5.558 ns | 1.01 |
SequenceCompareToDifferent | main | \main\corerun.exe | 128 | 4.337 ns | 1.00 |
SequenceCompareToDifferent | pr | \prSync\corerun.exe | 128 | 4.321 ns | 1.00 |
SequenceCompareTo | main | \main\corerun.exe | 512 | 11.758 ns | 1.00 |
SequenceCompareTo | pr | \prSync\corerun.exe | 512 | 11.694 ns | 0.99 |
SequenceCompareToDifferent | main | \main\corerun.exe | 512 | 4.324 ns | 1.00 |
SequenceCompareToDifferent | pr | \prSync\corerun.exe | 512 | 4.321 ns | 1.00 |
x64 AVX (Vector128)
It's 2-3% slower, but it translates to just +- 0.2ns.
BenchmarkDotNet=v0.13.2.1937-nightly, OS=Windows 11 (10.0.22000.978/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-rtm.22506.1
[Host] : .NET 7.0.0 (7.0.22.48010), X64 RyuJIT AVX2
main : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX
pr : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX
EnvironmentVariables=COMPlus_EnableAVX2=0
Method | Job | Size | Mean | Ratio |
---|---|---|---|---|
SequenceCompareTo | main | 8 | 7.058 ns | 1.00 |
SequenceCompareTo | pr | 8 | 6.696 ns | 0.95 |
SequenceCompareToDifferent | main | 8 | 4.088 ns | 1.00 |
SequenceCompareToDifferent | pr | 8 | 4.099 ns | 1.00 |
SequenceCompareTo | main | 32 | 4.343 ns | 1.00 |
SequenceCompareTo | pr | 32 | 4.273 ns | 0.98 |
SequenceCompareToDifferent | main | 32 | 4.323 ns | 1.00 |
SequenceCompareToDifferent | pr | 32 | 4.414 ns | 1.02 |
SequenceCompareTo | main | 64 | 5.322 ns | 1.00 |
SequenceCompareTo | pr | 64 | 5.383 ns | 1.01 |
SequenceCompareToDifferent | main | 64 | 4.316 ns | 1.00 |
SequenceCompareToDifferent | pr | 64 | 4.444 ns | 1.03 |
SequenceCompareTo | main | 128 | 6.987 ns | 1.00 |
SequenceCompareTo | pr | 128 | 7.229 ns | 1.03 |
SequenceCompareToDifferent | main | 128 | 4.307 ns | 1.00 |
SequenceCompareToDifferent | pr | 128 | 4.493 ns | 1.04 |
SequenceCompareTo | main | 512 | 18.724 ns | 1.00 |
SequenceCompareTo | pr | 512 | 19.032 ns | 1.02 |
SequenceCompareToDifferent | main | 512 | 4.302 ns | 1.00 |
SequenceCompareToDifferent | pr | 512 | 4.451 ns | 1.03 |
Arm64 AdvSimd (Vector128)
For cases where the inputs are different the perf remains the same, but we can observe a nice boost for equal imputs >= 8 elements (20% to even x3)
BenchmarkDotNet=v0.13.2.1937-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rtm.22506.1
[Host] : .NET 7.0.0 (7.0.22.48010), Arm64 RyuJIT AdvSIMD
main : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
pr : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
LaunchCount=9 MemoryRandomization=True
Method | Job | Size | Median |
---|---|---|---|
SequenceCompareTo | main | 8 | 6.551 ns |
SequenceCompareTo | pr | 8 | 6.547 ns |
SequenceCompareToDifferent | main | 8 | 2.696 ns |
SequenceCompareToDifferent | pr | 8 | 2.696 ns |
SequenceCompareTo | main | 32 | 12.399 ns |
SequenceCompareTo | pr | 32 | 3.851 ns |
SequenceCompareToDifferent | main | 32 | 2.697 ns |
SequenceCompareToDifferent | pr | 32 | 2.696 ns |
SequenceCompareTo | main | 64 | 15.863 ns |
SequenceCompareTo | pr | 64 | 5.776 ns |
SequenceCompareToDifferent | main | 64 | 4.090 ns |
SequenceCompareToDifferent | pr | 64 | 3.274 ns |
SequenceCompareTo | main | 128 | 23.466 ns |
SequenceCompareTo | pr | 128 | 12.219 ns |
SequenceCompareToDifferent | main | 128 | 2.696 ns |
SequenceCompareToDifferent | pr | 128 | 2.696 ns |
SequenceCompareTo | main | 512 | 59.817 ns |
SequenceCompareTo | pr | 512 | 48.528 ns |
SequenceCompareToDifferent | main | 512 | 2.698 ns |
SequenceCompareToDifferent | pr | 512 | 2.696 ns |
For Arm64 we have a +- 20% improvement, mostly due to the fact that this code has not been optimized for ARM64.
For x64 the performance is on par for both AVX2 and AVX.
contributes to #64451