Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port SpanHelpers.SequenceCompareTo(ref byte, int, ref byte, int) to Vector128/256 #73475

Merged
merged 7 commits into from
Oct 10, 2022

Conversation

adamsitnik
Copy link
Member

For Arm64 we have a +- 20% improvement, mostly due to the fact that this code has not been optimized for ARM64.

BenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22405.1
  [Host]     : .NET 7.0.0 (7.0.22.40308), Arm64 RyuJIT AdvSIMD
  Job-PHNJLQ : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-FYBYWI : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Method Toolchain Size Mean Ratio
SequenceCompareTo /PR/corerun 512 43.273 ns 0.83
SequenceCompareTo /main/corerun 512 52.073 ns 1.00
SequenceCompareToDifferent /PR/corerun 512 5.393 ns 0.82
SequenceCompareToDifferent /main/corerun 512 6.547 ns 1.00

For x64 the performance is on par for both AVX2 and AVX.

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.7.22377.5
  [Host]     : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT AVX2
  Job-ZYEZPW : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
  Job-DJUBNM : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AXV

EnvironmentVariables=COMPlus_EnableAVX2=0
Method Toolchain Size Mean Ratio
SequenceCompareTo \PR\corerun.exe 512 20.099 ns 1.03
SequenceCompareTo \baseline\corerun.exe 512 19.485 ns 1.00
SequenceCompareToDifferent \7.0.0\corerun.exe 512 4.565 ns 0.98
SequenceCompareToDifferent \baseline\corerun.exe 512 4.671 ns 1.00

contributes to #64451

@adamsitnik adamsitnik added area-System.Memory tenet-performance Performance related issue labels Aug 5, 2022
@adamsitnik adamsitnik added this to the 7.0.0 milestone Aug 5, 2022
@ghost ghost assigned adamsitnik Aug 5, 2022
@ghost
Copy link

ghost commented Aug 5, 2022

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

For Arm64 we have a +- 20% improvement, mostly due to the fact that this code has not been optimized for ARM64.

BenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22405.1
  [Host]     : .NET 7.0.0 (7.0.22.40308), Arm64 RyuJIT AdvSIMD
  Job-PHNJLQ : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-FYBYWI : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Method Toolchain Size Mean Ratio
SequenceCompareTo /PR/corerun 512 43.273 ns 0.83
SequenceCompareTo /main/corerun 512 52.073 ns 1.00
SequenceCompareToDifferent /PR/corerun 512 5.393 ns 0.82
SequenceCompareToDifferent /main/corerun 512 6.547 ns 1.00

For x64 the performance is on par for both AVX2 and AVX.

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.7.22377.5
  [Host]     : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT AVX2
  Job-ZYEZPW : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
  Job-DJUBNM : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AXV

EnvironmentVariables=COMPlus_EnableAVX2=0
Method Toolchain Size Mean Ratio
SequenceCompareTo \PR\corerun.exe 512 20.099 ns 1.03
SequenceCompareTo \baseline\corerun.exe 512 19.485 ns 1.00
SequenceCompareToDifferent \7.0.0\corerun.exe 512 4.565 ns 0.98
SequenceCompareToDifferent \baseline\corerun.exe 512 4.671 ns 1.00

contributes to #64451

Author: adamsitnik
Assignees: -
Labels:

area-System.Memory, tenet-performance

Milestone: 7.0.0

@adamsitnik
Copy link
Member Author

@tannergooding @EgorBo sorry to bother you guys, but it would be really nice to get this merged in 7 ;)

{
// All matched
offset += (nuint)Vector128<byte>.Count;
continue;
}

goto Difference;
goto BytewiseCheck;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this replacing the vectorized detection of which element in the vector differed with a linear walk through all bytes in the vector? If so, did you validate the perf impact of this on inputs smaller than 512 elements? I'm surprised this wouldn't result in regressions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephentoub please excuse me for the delay. That is true, as Vector128 code path is also executed for arm64, where ExtractMostSignificantBits is expensive. I've used BytewiseCheck which was so far used by Vector<T> code path and its perf is OK.

I've added benchmarks for smaller collection sizes, synced the fork and re-run them.

x64 AVX2 (Vector256)

It's more or less on par.

BenchmarkDotNet=v0.13.2.1937-nightly, OS=Windows 11 (10.0.22000.978/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-rtm.22506.1
  [Host]     : .NET 7.0.0 (7.0.22.48010), X64 RyuJIT AVX2
        main : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
          pr : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Method Job Toolchain Size Mean Ratio
SequenceCompareTo main \main\corerun.exe 8 7.522 ns 1.00
SequenceCompareTo pr \prSync\corerun.exe 8 7.461 ns 0.99
SequenceCompareToDifferent main \main\corerun.exe 8 4.344 ns 1.00
SequenceCompareToDifferent pr \prSync\corerun.exe 8 4.230 ns 0.97
SequenceCompareTo main \main\corerun.exe 32 3.889 ns 1.00
SequenceCompareTo pr \prSync\corerun.exe 32 3.869 ns 0.99
SequenceCompareToDifferent main \main\corerun.exe 32 4.651 ns 1.00
SequenceCompareToDifferent pr \prSync\corerun.exe 32 4.603 ns 0.99
SequenceCompareTo main \main\corerun.exe 64 4.426 ns 1.00
SequenceCompareTo pr \prSync\corerun.exe 64 4.446 ns 1.00
SequenceCompareToDifferent main \main\corerun.exe 64 4.314 ns 1.00
SequenceCompareToDifferent pr \prSync\corerun.exe 64 4.332 ns 1.00
SequenceCompareTo main \main\corerun.exe 128 5.506 ns 1.00
SequenceCompareTo pr \prSync\corerun.exe 128 5.558 ns 1.01
SequenceCompareToDifferent main \main\corerun.exe 128 4.337 ns 1.00
SequenceCompareToDifferent pr \prSync\corerun.exe 128 4.321 ns 1.00
SequenceCompareTo main \main\corerun.exe 512 11.758 ns 1.00
SequenceCompareTo pr \prSync\corerun.exe 512 11.694 ns 0.99
SequenceCompareToDifferent main \main\corerun.exe 512 4.324 ns 1.00
SequenceCompareToDifferent pr \prSync\corerun.exe 512 4.321 ns 1.00

x64 AVX (Vector128)

It's 2-3% slower, but it translates to just +- 0.2ns.

BenchmarkDotNet=v0.13.2.1937-nightly, OS=Windows 11 (10.0.22000.978/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-rtm.22506.1
  [Host]     : .NET 7.0.0 (7.0.22.48010), X64 RyuJIT AVX2
        main : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX
          pr : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX

EnvironmentVariables=COMPlus_EnableAVX2=0
Method Job Size Mean Ratio
SequenceCompareTo main 8 7.058 ns 1.00
SequenceCompareTo pr 8 6.696 ns 0.95
SequenceCompareToDifferent main 8 4.088 ns 1.00
SequenceCompareToDifferent pr 8 4.099 ns 1.00
SequenceCompareTo main 32 4.343 ns 1.00
SequenceCompareTo pr 32 4.273 ns 0.98
SequenceCompareToDifferent main 32 4.323 ns 1.00
SequenceCompareToDifferent pr 32 4.414 ns 1.02
SequenceCompareTo main 64 5.322 ns 1.00
SequenceCompareTo pr 64 5.383 ns 1.01
SequenceCompareToDifferent main 64 4.316 ns 1.00
SequenceCompareToDifferent pr 64 4.444 ns 1.03
SequenceCompareTo main 128 6.987 ns 1.00
SequenceCompareTo pr 128 7.229 ns 1.03
SequenceCompareToDifferent main 128 4.307 ns 1.00
SequenceCompareToDifferent pr 128 4.493 ns 1.04
SequenceCompareTo main 512 18.724 ns 1.00
SequenceCompareTo pr 512 19.032 ns 1.02
SequenceCompareToDifferent main 512 4.302 ns 1.00
SequenceCompareToDifferent pr 512 4.451 ns 1.03

Arm64 AdvSimd (Vector128)

For cases where the inputs are different the perf remains the same, but we can observe a nice boost for equal imputs >= 8 elements (20% to even x3)

BenchmarkDotNet=v0.13.2.1937-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rtm.22506.1
  [Host]     : .NET 7.0.0 (7.0.22.48010), Arm64 RyuJIT AdvSIMD
        main : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
          pr : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD

LaunchCount=9 MemoryRandomization=True
Method Job Size Median
SequenceCompareTo main 8 6.551 ns
SequenceCompareTo pr 8 6.547 ns
SequenceCompareToDifferent main 8 2.696 ns
SequenceCompareToDifferent pr 8 2.696 ns
SequenceCompareTo main 32 12.399 ns
SequenceCompareTo pr 32 3.851 ns
SequenceCompareToDifferent main 32 2.697 ns
SequenceCompareToDifferent pr 32 2.696 ns
SequenceCompareTo main 64 15.863 ns
SequenceCompareTo pr 64 5.776 ns
SequenceCompareToDifferent main 64 4.090 ns
SequenceCompareToDifferent pr 64 3.274 ns
SequenceCompareTo main 128 23.466 ns
SequenceCompareTo pr 128 12.219 ns
SequenceCompareToDifferent main 128 2.696 ns
SequenceCompareToDifferent pr 128 2.696 ns
SequenceCompareTo main 512 59.817 ns
SequenceCompareTo pr 512 48.528 ns
SequenceCompareToDifferent main 512 2.698 ns
SequenceCompareToDifferent pr 512 2.696 ns

@adamsitnik adamsitnik merged commit 91ae19b into dotnet:main Oct 10, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Nov 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants