Port SpanHelpers.SequenceCompareTo(ref byte, int, ref byte, int) to Vector128/256 #73475

adamsitnik · 2022-08-05T17:22:03Z

For Arm64 we have a +- 20% improvement, mostly due to the fact that this code has not been optimized for ARM64.

BenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22405.1
  [Host]     : .NET 7.0.0 (7.0.22.40308), Arm64 RyuJIT AdvSIMD
  Job-PHNJLQ : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-FYBYWI : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD

Method	Toolchain	Size	Mean	Ratio
SequenceCompareTo	/PR/corerun	512	43.273 ns	0.83
SequenceCompareTo	/main/corerun	512	52.073 ns	1.00

SequenceCompareToDifferent	/PR/corerun	512	5.393 ns	0.82
SequenceCompareToDifferent	/main/corerun	512	6.547 ns	1.00

For x64 the performance is on par for both AVX2 and AVX.

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.7.22377.5
  [Host]     : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT AVX2
  Job-ZYEZPW : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
  Job-DJUBNM : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AXV

EnvironmentVariables=COMPlus_EnableAVX2=0

Method	Toolchain	Size	Mean	Ratio
SequenceCompareTo	\PR\corerun.exe	512	20.099 ns	1.03
SequenceCompareTo	\baseline\corerun.exe	512	19.485 ns	1.00

SequenceCompareToDifferent	\7.0.0\corerun.exe	512	4.565 ns	0.98
SequenceCompareToDifferent	\baseline\corerun.exe	512	4.671 ns	1.00

contributes to #64451

…nificantBits when not needed

ghost · 2022-08-05T17:22:29Z

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

For Arm64 we have a +- 20% improvement, mostly due to the fact that this code has not been optimized for ARM64.

BenchmarkDotNet=v0.13.1.1828-nightly, OS=ubuntu 20.04
Unknown processor
.NET SDK=7.0.100-rc.1.22405.1
  [Host]     : .NET 7.0.0 (7.0.22.40308), Arm64 RyuJIT AdvSIMD
  Job-PHNJLQ : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-FYBYWI : .NET 7.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD

Method	Toolchain	Size	Mean	Ratio
SequenceCompareTo	/PR/corerun	512	43.273 ns	0.83
SequenceCompareTo	/main/corerun	512	52.073 ns	1.00

SequenceCompareToDifferent	/PR/corerun	512	5.393 ns	0.82
SequenceCompareToDifferent	/main/corerun	512	6.547 ns	1.00

For x64 the performance is on par for both AVX2 and AVX.

BenchmarkDotNet=v0.13.1.1828-nightly, OS=Windows 11 (10.0.22000.795/21H2)
AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.7.22377.5
  [Host]     : .NET 7.0.0 (7.0.22.37506), X64 RyuJIT AVX2
  Job-ZYEZPW : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AVX
  Job-DJUBNM : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT AXV

EnvironmentVariables=COMPlus_EnableAVX2=0

Method	Toolchain	Size	Mean	Ratio
SequenceCompareTo	\PR\corerun.exe	512	20.099 ns	1.03
SequenceCompareTo	\baseline\corerun.exe	512	19.485 ns	1.00

SequenceCompareToDifferent	\7.0.0\corerun.exe	512	4.565 ns	0.98
SequenceCompareToDifferent	\baseline\corerun.exe	512	4.671 ns	1.00

contributes to #64451

Author:	adamsitnik
Assignees:	-
Labels:	`area-System.Memory`, `tenet-performance`
Milestone:	7.0.0

adamsitnik · 2022-08-11T10:40:48Z

@tannergooding @EgorBo sorry to bother you guys, but it would be really nice to get this merged in 7 ;)

stephentoub · 2022-08-11T18:18:01Z

src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs

                        {
                            // All matched
                            offset += (nuint)Vector128<byte>.Count;
                            continue;
                        }

-                        goto Difference;
+                        goto BytewiseCheck;


Isn't this replacing the vectorized detection of which element in the vector differed with a linear walk through all bytes in the vector? If so, did you validate the perf impact of this on inputs smaller than 512 elements? I'm surprised this wouldn't result in regressions.

@stephentoub please excuse me for the delay. That is true, as Vector128 code path is also executed for arm64, where ExtractMostSignificantBits is expensive. I've used BytewiseCheck which was so far used by Vector<T> code path and its perf is OK.

I've added benchmarks for smaller collection sizes, synced the fork and re-run them.

x64 AVX2 (Vector256)

It's more or less on par.

BenchmarkDotNet=v0.13.2.1937-nightly, OS=Windows 11 (10.0.22000.978/21H2) AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores .NET SDK=7.0.100-rtm.22506.1 [Host] : .NET 7.0.0 (7.0.22.48010), X64 RyuJIT AVX2 main : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2 pr : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

Method Job Toolchain Size Mean Ratio

SequenceCompareTo main \main\corerun.exe 8 7.522 ns 1.00

SequenceCompareTo pr \prSync\corerun.exe 8 7.461 ns 0.99

SequenceCompareToDifferent main \main\corerun.exe 8 4.344 ns 1.00

SequenceCompareToDifferent pr \prSync\corerun.exe 8 4.230 ns 0.97

SequenceCompareTo main \main\corerun.exe 32 3.889 ns 1.00

SequenceCompareTo pr \prSync\corerun.exe 32 3.869 ns 0.99

SequenceCompareToDifferent main \main\corerun.exe 32 4.651 ns 1.00

SequenceCompareToDifferent pr \prSync\corerun.exe 32 4.603 ns 0.99

SequenceCompareTo main \main\corerun.exe 64 4.426 ns 1.00

SequenceCompareTo pr \prSync\corerun.exe 64 4.446 ns 1.00

SequenceCompareToDifferent main \main\corerun.exe 64 4.314 ns 1.00

SequenceCompareToDifferent pr \prSync\corerun.exe 64 4.332 ns 1.00

SequenceCompareTo main \main\corerun.exe 128 5.506 ns 1.00

SequenceCompareTo pr \prSync\corerun.exe 128 5.558 ns 1.01

SequenceCompareToDifferent main \main\corerun.exe 128 4.337 ns 1.00

SequenceCompareToDifferent pr \prSync\corerun.exe 128 4.321 ns 1.00

SequenceCompareTo main \main\corerun.exe 512 11.758 ns 1.00

SequenceCompareTo pr \prSync\corerun.exe 512 11.694 ns 0.99

SequenceCompareToDifferent main \main\corerun.exe 512 4.324 ns 1.00

SequenceCompareToDifferent pr \prSync\corerun.exe 512 4.321 ns 1.00

x64 AVX (Vector128)

It's 2-3% slower, but it translates to just +- 0.2ns.

BenchmarkDotNet=v0.13.2.1937-nightly, OS=Windows 11 (10.0.22000.978/21H2) AMD Ryzen Threadripper PRO 3945WX 12-Cores, 1 CPU, 24 logical and 12 physical cores .NET SDK=7.0.100-rtm.22506.1 [Host] : .NET 7.0.0 (7.0.22.48010), X64 RyuJIT AVX2 main : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX pr : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX EnvironmentVariables=COMPlus_EnableAVX2=0

Method Job Size Mean Ratio

SequenceCompareTo main 8 7.058 ns 1.00

SequenceCompareTo pr 8 6.696 ns 0.95

SequenceCompareToDifferent main 8 4.088 ns 1.00

SequenceCompareToDifferent pr 8 4.099 ns 1.00

SequenceCompareTo main 32 4.343 ns 1.00

SequenceCompareTo pr 32 4.273 ns 0.98

SequenceCompareToDifferent main 32 4.323 ns 1.00

SequenceCompareToDifferent pr 32 4.414 ns 1.02

SequenceCompareTo main 64 5.322 ns 1.00

SequenceCompareTo pr 64 5.383 ns 1.01

SequenceCompareToDifferent main 64 4.316 ns 1.00

SequenceCompareToDifferent pr 64 4.444 ns 1.03

SequenceCompareTo main 128 6.987 ns 1.00

SequenceCompareTo pr 128 7.229 ns 1.03

SequenceCompareToDifferent main 128 4.307 ns 1.00

SequenceCompareToDifferent pr 128 4.493 ns 1.04

SequenceCompareTo main 512 18.724 ns 1.00

SequenceCompareTo pr 512 19.032 ns 1.02

SequenceCompareToDifferent main 512 4.302 ns 1.00

SequenceCompareToDifferent pr 512 4.451 ns 1.03

Arm64 AdvSimd (Vector128)

For cases where the inputs are different the perf remains the same, but we can observe a nice boost for equal imputs >= 8 elements (20% to even x3)

BenchmarkDotNet=v0.13.2.1937-nightly, OS=ubuntu 20.04 Unknown processor .NET SDK=7.0.100-rtm.22506.1 [Host] : .NET 7.0.0 (7.0.22.48010), Arm64 RyuJIT AdvSIMD main : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD pr : .NET 8.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD LaunchCount=9 MemoryRandomization=True

Method Job Size Median

SequenceCompareTo main 8 6.551 ns

SequenceCompareTo pr 8 6.547 ns

SequenceCompareToDifferent main 8 2.696 ns

SequenceCompareToDifferent pr 8 2.696 ns

SequenceCompareTo main 32 12.399 ns

SequenceCompareTo pr 32 3.851 ns

SequenceCompareToDifferent main 32 2.697 ns

SequenceCompareToDifferent pr 32 2.696 ns

SequenceCompareTo main 64 15.863 ns

SequenceCompareTo pr 64 5.776 ns

SequenceCompareToDifferent main 64 4.090 ns

SequenceCompareToDifferent pr 64 3.274 ns

SequenceCompareTo main 128 23.466 ns

SequenceCompareTo pr 128 12.219 ns

SequenceCompareToDifferent main 128 2.696 ns

SequenceCompareToDifferent pr 128 2.696 ns

SequenceCompareTo main 512 59.817 ns

SequenceCompareTo pr 512 48.528 ns

SequenceCompareToDifferent main 512 2.698 ns

SequenceCompareToDifferent pr 512 2.696 ns

adamsitnik added 5 commits August 5, 2022 18:00

port SpanHelpers.SequenceCompareTo(ref byte, int, ref byte, int)

e2c60f3

try to fix arm64 performance regression by not calling ExtractMostSig…

55d86e2

…nificantBits when not needed

perform only one comparison when there is no mismatch

13db03f

how about now?

f565f58

remove the comment

eb22170

adamsitnik added area-System.Memory tenet-performance Performance related issue labels Aug 5, 2022

adamsitnik added this to the 7.0.0 milestone Aug 5, 2022

adamsitnik requested review from EgorBo, stephentoub and tannergooding August 5, 2022 17:22

ghost assigned adamsitnik Aug 5, 2022

adamsitnik mentioned this pull request Aug 5, 2022

Switch from direct intrinsics usage to Vector/Vector64/Vector128/Vector256 #64451

Open

75 tasks

This was referenced Aug 5, 2022

Infra improvements for Helix #68176

Closed

GC/API/GC/GetGCMemoryInfo/GetGCMemoryInfo.sh test failing intermittently on CoreCLR Linux ARM32 #73247

Closed

Merge branch 'dotnet:main' into spanSequenceCompareTo

3506efd

stephentoub reviewed Aug 11, 2022

View reviewed changes

Merge remote-tracking branch 'upstream/main' into spanSequenceCompareTo

1c31e15

adamsitnik modified the milestones: 7.0.0, 8.0.0 Oct 7, 2022

adamsitnik requested a review from stephentoub October 7, 2022 13:08

build-analysis bot mentioned this pull request Oct 7, 2022

Tracking issue for CI build timeouts #76454

Closed

tannergooding approved these changes Oct 7, 2022

View reviewed changes

adamsitnik merged commit 91ae19b into dotnet:main Oct 10, 2022

This was referenced Oct 11, 2022

[Perf] Alpine/x64: 2 Regressions on 10/10/2022 3:25:05 PM DrewScoggins/performance-2#8281

Closed

[Perf] Regressions in System.Globalization.Tests.StringSearch #76885

Open

ghost locked as resolved and limited conversation to collaborators Nov 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port SpanHelpers.SequenceCompareTo(ref byte, int, ref byte, int) to Vector128/256 #73475

Port SpanHelpers.SequenceCompareTo(ref byte, int, ref byte, int) to Vector128/256 #73475

adamsitnik commented Aug 5, 2022

ghost commented Aug 5, 2022

adamsitnik commented Aug 11, 2022

stephentoub Aug 11, 2022

adamsitnik Oct 7, 2022

Method	Job	Toolchain	Size	Mean	Ratio
SequenceCompareTo	main	\main\corerun.exe	8	7.522 ns	1.00
SequenceCompareTo	pr	\prSync\corerun.exe	8	7.461 ns	0.99

SequenceCompareToDifferent	main	\main\corerun.exe	8	4.344 ns	1.00
SequenceCompareToDifferent	pr	\prSync\corerun.exe	8	4.230 ns	0.97

SequenceCompareTo	main	\main\corerun.exe	32	3.889 ns	1.00
SequenceCompareTo	pr	\prSync\corerun.exe	32	3.869 ns	0.99

SequenceCompareToDifferent	main	\main\corerun.exe	32	4.651 ns	1.00
SequenceCompareToDifferent	pr	\prSync\corerun.exe	32	4.603 ns	0.99

SequenceCompareTo	main	\main\corerun.exe	64	4.426 ns	1.00
SequenceCompareTo	pr	\prSync\corerun.exe	64	4.446 ns	1.00

SequenceCompareToDifferent	main	\main\corerun.exe	64	4.314 ns	1.00
SequenceCompareToDifferent	pr	\prSync\corerun.exe	64	4.332 ns	1.00

SequenceCompareTo	main	\main\corerun.exe	128	5.506 ns	1.00
SequenceCompareTo	pr	\prSync\corerun.exe	128	5.558 ns	1.01

SequenceCompareToDifferent	main	\main\corerun.exe	128	4.337 ns	1.00
SequenceCompareToDifferent	pr	\prSync\corerun.exe	128	4.321 ns	1.00

SequenceCompareTo	main	\main\corerun.exe	512	11.758 ns	1.00
SequenceCompareTo	pr	\prSync\corerun.exe	512	11.694 ns	0.99

SequenceCompareToDifferent	main	\main\corerun.exe	512	4.324 ns	1.00
SequenceCompareToDifferent	pr	\prSync\corerun.exe	512	4.321 ns	1.00

Method	Job	Size	Mean	Ratio
SequenceCompareTo	main	8	7.058 ns	1.00
SequenceCompareTo	pr	8	6.696 ns	0.95

SequenceCompareToDifferent	main	8	4.088 ns	1.00
SequenceCompareToDifferent	pr	8	4.099 ns	1.00

SequenceCompareTo	main	32	4.343 ns	1.00
SequenceCompareTo	pr	32	4.273 ns	0.98

SequenceCompareToDifferent	main	32	4.323 ns	1.00
SequenceCompareToDifferent	pr	32	4.414 ns	1.02

SequenceCompareTo	main	64	5.322 ns	1.00
SequenceCompareTo	pr	64	5.383 ns	1.01

SequenceCompareToDifferent	main	64	4.316 ns	1.00
SequenceCompareToDifferent	pr	64	4.444 ns	1.03

SequenceCompareTo	main	128	6.987 ns	1.00
SequenceCompareTo	pr	128	7.229 ns	1.03

SequenceCompareToDifferent	main	128	4.307 ns	1.00
SequenceCompareToDifferent	pr	128	4.493 ns	1.04

SequenceCompareTo	main	512	18.724 ns	1.00
SequenceCompareTo	pr	512	19.032 ns	1.02

SequenceCompareToDifferent	main	512	4.302 ns	1.00
SequenceCompareToDifferent	pr	512	4.451 ns	1.03

Method	Job	Size	Median
SequenceCompareTo	main	8	6.551 ns
SequenceCompareTo	pr	8	6.547 ns

SequenceCompareToDifferent	main	8	2.696 ns
SequenceCompareToDifferent	pr	8	2.696 ns

SequenceCompareTo	main	32	12.399 ns
SequenceCompareTo	pr	32	3.851 ns

SequenceCompareToDifferent	main	32	2.697 ns
SequenceCompareToDifferent	pr	32	2.696 ns

SequenceCompareTo	main	64	15.863 ns
SequenceCompareTo	pr	64	5.776 ns

SequenceCompareToDifferent	main	64	4.090 ns
SequenceCompareToDifferent	pr	64	3.274 ns

SequenceCompareTo	main	128	23.466 ns
SequenceCompareTo	pr	128	12.219 ns

SequenceCompareToDifferent	main	128	2.696 ns
SequenceCompareToDifferent	pr	128	2.696 ns

SequenceCompareTo	main	512	59.817 ns
SequenceCompareTo	pr	512	48.528 ns

SequenceCompareToDifferent	main	512	2.698 ns
SequenceCompareToDifferent	pr	512	2.696 ns

Port SpanHelpers.SequenceCompareTo(ref byte, int, ref byte, int) to Vector128/256 #73475

Port SpanHelpers.SequenceCompareTo(ref byte, int, ref byte, int) to Vector128/256 #73475

Conversation

adamsitnik commented Aug 5, 2022

ghost commented Aug 5, 2022

adamsitnik commented Aug 11, 2022

stephentoub Aug 11, 2022

Choose a reason for hiding this comment

adamsitnik Oct 7, 2022

Choose a reason for hiding this comment

x64 AVX2 (Vector256)

x64 AVX (Vector128)

Arm64 AdvSimd (Vector128)