Improving performance in .NET Core App 3.0 #3

briancylui · 2018-08-02T19:36:59Z

In the main progress page, the performance tests originally sitting in the src\Native\CpuMath\ folder gives comparable performance results for both native and managed implementations of SSE key intrinsics.

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.15063.1155 (1703/CreatorsUpdate/Redstone2)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
Frequency=3515623 Hz, Resolution=284.4446 ns, Timer=TSC
.NET Core SDK=2.1.300
  [Host]     : .NET Core 2.1.0 (CoreCLR 4.6.26515.07, CoreFX 4.6.26515.06), 64bit RyuJIT
  DefaultJob : .NET Core 2.1.0 (CoreCLR 4.6.26515.07, CoreFX 4.6.26515.06), 64bit RyuJIT

Method	Mean	Error	StdDev
NativeDotUPerf	363.2 us	7.7293 us	18.8143 us
MyDotUPerf	340.2 us	6.7218 us	8.0018 us
NativeDotSUPerf	2,178.3 us	43.4641 us	40.6563 us
MyDotSUPerf	2,144.7 us	19.1638 us	16.0027 us
NativeSumSqUPerf	540.6 us	3.0299 us	2.8342 us
MySumSqUPerf	538.8 us	2.5507 us	2.3859 us
NativeAddUPerf	313.9 us	2.5163 us	2.3537 us
MyAddUPerf	303.3 us	4.5125 us	4.2210 us
NativeAddSUPerf	2,691.8 us	29.4588 us	27.5558 us
MyAddSUPerf	2,658.1 us	51.3336 us	64.9206 us
NativeAddScaleUPerf	300.0 us	5.5529 us	5.1941 us
MyAddScaleUPerf	309.8 us	5.3974 us	4.7846 us
NativeAddScaleSUPerf	2,550.9 us	21.8322 us	20.4218 us
MyAddScaleSUPerf	2,805.3 us	20.5171 us	19.1917 us
NativeScaleUPerf	131.4 us	0.6347 us	0.5626 us
MyScaleUPerf	130.7 us	1.2159 us	1.1373 us
NativeDist2Perf	336.4 us	2.0555 us	1.9227 us
MyDist2Perf	335.2 us	8.3427 us	11.4196 us
NativeSumAbsUPerf	258.0 us	1.6470 us	1.5406 us
MySumAbsqUPerf	258.9 us	0.9447 us	0.7889 us
NativeMulElementWiseUPerf	466.4 us	1.9625 us	1.6388 us
MyMulElementWiseUPerf	467.2 us	4.3560 us	4.0747 us

However, once moved into the test\Microsoft.ML.CpuMath.PerformanceTests\ folder, with multi-targeting, using Span<T>, having a lower TargetCount (from ~20 to 3) in the ToolChain, the performances of managed DotU, SumSqU, Dist2, and SumAbsU seem to deviate noticeably from those of their native counterparts. Two relevant tables are shown below.

Run in .NET Core App 3.0 (`ManagedXPerf` uses the managed `X`)

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.15063.1209 (1703/CreatorsUpdate/Redstone2)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
Frequency=3515626 Hz, Resolution=284.4444 ns, Timer=TSC
.NET Core SDK=3.0.100-alpha1-20180720-2
  [Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT

Toolchain=InProcessToolchain  LaunchCount=1  TargetCount=3
WarmupCount=3

Method	Mean	Error	StdDev
NativeDotUPerf	346.2 us	63.44 us	3.584 us
ManagedDotUPerf	662.3 us	31.82 us	1.798 us
NativeDotSUPerf	2,291.9 us	280.57 us	15.853 us
ManagedDotSUPerf	2,303.4 us	275.33 us	15.557 us
NativeSumSqUPerf	551.5 us	20.80 us	1.175 us
ManagedSumSqUPerf	882.4 us	385.73 us	21.794 us
NativeAddUPerf	326.1 us	104.46 us	5.902 us
ManagedAddUPerf	324.6 us	70.70 us	3.995 us
NativeAddSUPerf	2,982.1 us	5,531.05 us	312.514 us
ManagedAddSUPerf	2,763.0 us	951.16 us	53.742 us
NativeAddScaleUPerf	327.4 us	90.74 us	5.127 us
ManagedAddScaleUPerf	324.5 us	118.36 us	6.688 us
NativeAddScaleSUPerf	2,675.9 us	590.91 us	33.387 us
ManagedAddScaleSUPerf	2,693.1 us	62.59 us	3.536 us
NativeScaleUPerf	140.2 us	51.56 us	2.913 us
ManagedScaleUPerf	155.3 us	238.58 us	13.480 us
NativeDist2Perf	348.5 us	125.00 us	7.063 us
ManagedDist2Perf	671.6 us	518.96 us	29.322 us
NativeSumAbsUPerf	272.4 us	79.46 us	4.490 us
ManagedSumAbsqUPerf	601.2 us	86.91 us	4.910 us
NativeMulElementWiseUPerf	497.3 us	404.78 us	22.871 us
ManagedMulElementWiseUPerf	493.7 us	145.72 us	8.233 us

Run in .NET Core App 2.1 (`ManagedXPerf` uses the native `X`)

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.15063.1209 (1703/CreatorsUpdate/Redstone2)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
Frequency=3515626 Hz, Resolution=284.4444 ns, Timer=TSC
.NET Core SDK=2.2.100-refac-20180613-1
  [Host] : .NET Core 2.1.2 (CoreCLR 4.6.26628.05, CoreFX 4.6.26629.01), 64bit RyuJIT

Toolchain=InProcessToolchain  LaunchCount=1  TargetCount=3
WarmupCount=3

Method	Mean	Error	StdDev
NativeDotUPerf	352.5 us	45.87 us	2.592 us
ManagedDotUPerf	346.7 us	72.99 us	4.124 us
NativeDotSUPerf	2,274.1 us	729.91 us	41.241 us
ManagedDotSUPerf	2,264.7 us	220.67 us	12.468 us
NativeSumSqUPerf	601.9 us	41.43 us	2.341 us
ManagedSumSqUPerf	562.9 us	453.06 us	25.599 us
NativeAddUPerf	333.9 us	140.60 us	7.944 us
ManagedAddUPerf	330.2 us	143.10 us	8.086 us
NativeAddSUPerf	2,839.8 us	4,658.38 us	263.207 us
ManagedAddSUPerf	2,726.4 us	467.48 us	26.413 us
NativeAddScaleUPerf	330.6 us	58.80 us	3.322 us
ManagedAddScaleUPerf	327.8 us	88.76 us	5.015 us
NativeAddScaleSUPerf	2,755.9 us	563.51 us	31.839 us
ManagedAddScaleSUPerf	2,752.0 us	598.46 us	33.814 us
NativeScaleUPerf	141.8 us	29.23 us	1.652 us
ManagedScaleUPerf	150.2 us	202.48 us	11.441 us
NativeDist2Perf	350.6 us	44.27 us	2.501 us
ManagedDist2Perf	350.2 us	23.96 us	1.354 us
NativeSumAbsUPerf	270.0 us	82.27 us	4.648 us
ManagedSumAbsqUPerf	272.9 us	159.45 us	9.009 us
NativeMulElementWiseUPerf	502.1 us	275.94 us	15.591 us
ManagedMulElementWiseUPerf	503.3 us	125.91 us	7.114 us

TODOs

When I ran the performance tests in the early half of the PR review period, the perfs looked fine, but the most recent run above looked pretty different. Will look into reasons that cause this issue.

Experiments made to find the cause to the issue

Changing the perf test to Default to ShortRun, i.e. increasing LaunchCount and other warm-up steps to make perf measurement more accurate.
Conclusion: Not the main factor.

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.15063.1209 (1703/CreatorsUpdate/Redstone2)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
Frequency=3515626 Hz, Resolution=284.4444 ns, Timer=TSC
.NET Core SDK=3.0.100-alpha1-20180720-2
  [Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT

Toolchain=InProcessToolchain

Method	Mean	Error	StdDev	Median
NativeDotUPerf	351.8 us	5.815 us	5.154 us	350.7 us
ManagedDotUPerf	664.8 us	5.631 us	5.268 us	664.7 us
NativeDotSUPerf	2,416.7 us	74.286 us	207.080 us	2,342.7 us
ManagedDotSUPerf	2,311.7 us	56.298 us	49.907 us	2,308.0 us
NativeSumSqUPerf	547.4 us	2.866 us	2.237 us	546.8 us
ManagedSumSqUPerf	910.6 us	24.005 us	35.186 us	894.9 us
NativeAddUPerf	328.8 us	5.920 us	4.943 us	327.8 us
ManagedAddUPerf	357.8 us	14.013 us	39.524 us	337.0 us
NativeAddSUPerf	2,749.6 us	50.185 us	100.224 us	2,722.3 us
ManagedAddSUPerf	2,873.1 us	25.477 us	22.585 us	2,871.1 us
NativeAddScaleUPerf	334.3 us	8.223 us	8.076 us	331.2 us
ManagedAddScaleUPerf	334.6 us	3.100 us	2.748 us	333.9 us
NativeAddScaleSUPerf	2,729.2 us	32.378 us	30.286 us	2,730.1 us
ManagedAddScaleSUPerf	2,670.1 us	29.478 us	23.014 us	2,662.7 us
NativeScaleUPerf	140.0 us	1.780 us	1.390 us	140.0 us
ManagedScaleUPerf	143.3 us	2.711 us	2.784 us	142.9 us
NativeDist2Perf	350.2 us	3.081 us	2.573 us	349.6 us
ManagedDist2Perf	664.7 us	2.621 us	2.046 us	664.6 us
NativeSumAbsUPerf	271.8 us	2.229 us	1.741 us	271.8 us
ManagedSumAbsUPerf	600.1 us	3.051 us	2.854 us	600.6 us
NativeMulElementWiseUPerf	503.8 us	9.875 us	8.754 us	501.3 us
ManagedMulElementWiseUPerf	518.0 us	25.485 us	39.676 us	498.5 us

Removed the dependency on Span<T> to resort to using normal input float arrays instead.
Conclusion: Not the main factor.
Removed the dependency on the VectorSum function to resort to using original code instead.
Conclusion: This is the main factor.
Perf results after the fix:

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.15063.1209 (1703/CreatorsUpdate/Redstone2)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
Frequency=3515626 Hz, Resolution=284.4444 ns, Timer=TSC
.NET Core SDK=3.0.100-alpha1-20180720-2
  [Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT

Toolchain=InProcessToolchain  LaunchCount=1  TargetCount=3
WarmupCount=3

Method	Mean	Error	StdDev
NativeDotUPerf	550.7 us	1,839.99 us	103.963 us
ManagedDotUPerf	486.4 us	61.79 us	3.492 us
NativeDotSUPerf	2,446.8 us	405.57 us	22.915 us
ManagedDotSUPerf	2,620.6 us	219.16 us	12.383 us
NativeSumSqUPerf	569.0 us	18.54 us	1.047 us
ManagedSumSqUPerf	579.5 us	68.04 us	3.845 us
NativeAddUPerf	389.9 us	562.21 us	31.766 us
ManagedAddUPerf	368.6 us	48.36 us	2.733 us
NativeAddSUPerf	4,324.0 us	10,768.20 us	608.423 us
ManagedAddSUPerf	3,118.3 us	109.61 us	6.193 us
NativeAddScaleUPerf	512.1 us	1,694.78 us	95.758 us
ManagedAddScaleUPerf	480.3 us	252.98 us	14.294 us
NativeAddScaleSUPerf	3,425.0 us	6,916.49 us	390.795 us
ManagedAddScaleSUPerf	3,161.6 us	808.89 us	45.704 us
NativeScaleUPerf	153.5 us	52.31 us	2.955 us
ManagedScaleUPerf	152.5 us	59.76 us	3.377 us
NativeDist2Perf	394.8 us	126.76 us	7.162 us
ManagedDist2Perf	386.7 us	145.84 us	8.240 us
NativeSumAbsUPerf	304.7 us	610.29 us	34.483 us
ManagedSumAbsqUPerf	291.5 us	277.25 us	15.665 us
NativeMulElementWiseUPerf	563.3 us	124.22 us	7.018 us
ManagedMulElementWiseUPerf	572.0 us	295.43 us	16.692 us

The text was updated successfully, but these errors were encountered:

helloguo · 2018-08-03T01:42:34Z

Can you manually inline vectorsum function and see what's going on? I tested with two simple functions and it seems vectorsum makes the difference. test2 has better codegen.

        static unsafe float test1(Span<float> src, Span<float> dst, int count)
        {
            Vector128<float> result = Sse.SetZeroVector128();

            fixed (float* psrc = src)
            fixed (float* pdst = dst)
            {
                float* pSrcCurrent = psrc;
                float* pDstCurrent = pdst;
                float* pEnd = psrc + src.Length;

                while (pSrcCurrent + 4 <= pEnd)
                {
                    Vector128<float> srcVector = Sse.LoadVector128(pSrcCurrent);
                    Vector128<float> dstVector = Sse.LoadVector128(pDstCurrent);

                    result = Sse.Add(result, Sse.Multiply(srcVector, dstVector));

                    pSrcCurrent += 4;
                    pDstCurrent += 4;
                }

                result = VectorSum(result);

                while (pSrcCurrent < pEnd)
                {
                    Vector128<float> srcVector = Sse.LoadScalarVector128(pSrcCurrent);
                    Vector128<float> dstVector = Sse.LoadScalarVector128(pDstCurrent);

                    result = Sse.AddScalar(result, Sse.MultiplyScalar(srcVector, dstVector));

                    pSrcCurrent++;
                    pDstCurrent++;
                }
            }

            return Sse.ConvertToSingle(result);
        }

        static unsafe float test2(Span<float> src, Span<float> dst, int count)
        {
            Vector128<float> result = Sse.SetZeroVector128();

            fixed (float* psrc = src)
            fixed (float* pdst = dst)
            {
                float* pSrcCurrent = psrc;
                float* pDstCurrent = pdst;
                float* pEnd = psrc + src.Length;

                while (pSrcCurrent + 4 <= pEnd)
                {
                    Vector128<float> srcVector = Sse.LoadVector128(pSrcCurrent);
                    Vector128<float> dstVector = Sse.LoadVector128(pDstCurrent);

                    result = Sse.Add(result, Sse.Multiply(srcVector, dstVector));

                    pSrcCurrent += 4;
                    pDstCurrent += 4;
                }


                if (Sse3.IsSupported)
                {
                    Vector128<float> tmp = Sse3.HorizontalAdd(result, result);
                    result = Sse3.HorizontalAdd(tmp, tmp);
                }
                else
                {
                    // SSE3 is not supported.
                    Vector128<float> tmp = Sse.Add(result, Sse.MoveHighToLow(result, result));
                    // The control byte shuffles the four 32-bit floats of tmp: ABCD -> BADC.
                    result = Sse.Add(tmp, Sse.Shuffle(tmp, tmp, 0xb1));
                }

                while (pSrcCurrent < pEnd)
                {
                    Vector128<float> srcVector = Sse.LoadScalarVector128(pSrcCurrent);
                    Vector128<float> dstVector = Sse.LoadScalarVector128(pDstCurrent);

                    result = Sse.AddScalar(result, Sse.MultiplyScalar(srcVector, dstVector));

                    pSrcCurrent++;
                    pDstCurrent++;
                }
            }

            return Sse.ConvertToSingle(result);
        }

codegen of test1:

Address	Source Line	Assembly	CPU Time	Instructions Retired
0x7ff8120e1ad7		vmovups xmm0, xmmword ptr [rdi]	88.272ms	20,800,000
0x7ff8120e1adc		vmovups xmm1, xmmword ptr [rsi]	2.006ms	7,800,000
0x7ff8120e1ae1		vmulps xmm0, xmm0, xmm1	2.006ms	10,400,000
0x7ff8120e1ae6		vaddps xmm0, xmm0, xmmword ptr [rsp+0x40]		
0x7ff8120e1aed		vmovapd xmmword ptr [rsp+0x40], xmm0	673.071ms	1,201,200,000
0x7ff8120e1af4		add rdi, 0x10	93.287ms	1,326,000,000
0x7ff8120e1af8		add rsi, 0x10		
0x7ff8120e1afc		lea rcx, ptr [rdi+0x10]		
0x7ff8120e1b00		cmp rcx, rbx	0ms	2,600,000
0x7ff8120e1b03		jbe 0x7ff8120e1ad7 <Block 10>

codegen of test2:

Address	Source Line	Assembly	CPU Time	Instructions Retired
0x7ff8120e203c		vmovups xmm0, xmmword ptr [rdx]	98.302ms	62,400,000
0x7ff8120e2041		vmovups xmm1, xmmword ptr [rax]		
0x7ff8120e2046		vmulps xmm0, xmm0, xmm1		
0x7ff8120e204b		vaddps xmm6, xmm6, xmm0	11.034ms	135,200,000
0x7ff8120e2050		add rdx, 0x10	255.787ms	2,017,600,000
0x7ff8120e2054		add rax, 0x10		
0x7ff8120e2058		lea r8, ptr [rdx+0x10]		
0x7ff8120e205c		cmp r8, rcx	11.034ms	67,600,000
0x7ff8120e205f		jbe 0x7ff8120e203c <Block 10>

briancylui · 2018-08-03T21:26:56Z

Thank you for all the help and guidance from commenters and my mentors. Now, the perf results have been reverted back to the desired behavior, so I may close the issue.

briancylui closed this as completed Aug 3, 2018

briancylui reopened this Aug 3, 2018

briancylui closed this as completed Aug 3, 2018

briancylui self-assigned this Aug 3, 2018

eerhardt mentioned this issue Aug 17, 2018

Optimize some Matrix4x4 operations with SSE dotnet/corefx#31779

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving performance in .NET Core App 3.0 #3

Improving performance in .NET Core App 3.0 #3

briancylui commented Aug 2, 2018 •

edited

Loading

helloguo commented Aug 3, 2018

briancylui commented Aug 3, 2018

Improving performance in .NET Core App 3.0 #3

Improving performance in .NET Core App 3.0 #3

Comments

briancylui commented Aug 2, 2018 • edited Loading

Run in .NET Core App 3.0 (ManagedXPerf uses the managed X)

Run in .NET Core App 2.1 (ManagedXPerf uses the native X)

TODOs

Experiments made to find the cause to the issue

helloguo commented Aug 3, 2018

briancylui commented Aug 3, 2018

briancylui commented Aug 2, 2018 •

edited

Loading

Run in .NET Core App 3.0 (`ManagedXPerf` uses the managed `X`)

Run in .NET Core App 2.1 (`ManagedXPerf` uses the native `X`)