Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving performance in .NET Core App 3.0 #3

Closed
briancylui opened this issue Aug 2, 2018 · 2 comments
Closed

Improving performance in .NET Core App 3.0 #3

briancylui opened this issue Aug 2, 2018 · 2 comments
Assignees

Comments

@briancylui
Copy link
Owner

briancylui commented Aug 2, 2018

In the main progress page, the performance tests originally sitting in the src\Native\CpuMath\ folder gives comparable performance results for both native and managed implementations of SSE key intrinsics.

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.15063.1155 (1703/CreatorsUpdate/Redstone2)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
Frequency=3515623 Hz, Resolution=284.4446 ns, Timer=TSC
.NET Core SDK=2.1.300
  [Host]     : .NET Core 2.1.0 (CoreCLR 4.6.26515.07, CoreFX 4.6.26515.06), 64bit RyuJIT
  DefaultJob : .NET Core 2.1.0 (CoreCLR 4.6.26515.07, CoreFX 4.6.26515.06), 64bit RyuJIT
Method Mean Error StdDev
NativeDotUPerf 363.2 us 7.7293 us 18.8143 us
MyDotUPerf 340.2 us 6.7218 us 8.0018 us
NativeDotSUPerf 2,178.3 us 43.4641 us 40.6563 us
MyDotSUPerf 2,144.7 us 19.1638 us 16.0027 us
NativeSumSqUPerf 540.6 us 3.0299 us 2.8342 us
MySumSqUPerf 538.8 us 2.5507 us 2.3859 us
NativeAddUPerf 313.9 us 2.5163 us 2.3537 us
MyAddUPerf 303.3 us 4.5125 us 4.2210 us
NativeAddSUPerf 2,691.8 us 29.4588 us 27.5558 us
MyAddSUPerf 2,658.1 us 51.3336 us 64.9206 us
NativeAddScaleUPerf 300.0 us 5.5529 us 5.1941 us
MyAddScaleUPerf 309.8 us 5.3974 us 4.7846 us
NativeAddScaleSUPerf 2,550.9 us 21.8322 us 20.4218 us
MyAddScaleSUPerf 2,805.3 us 20.5171 us 19.1917 us
NativeScaleUPerf 131.4 us 0.6347 us 0.5626 us
MyScaleUPerf 130.7 us 1.2159 us 1.1373 us
NativeDist2Perf 336.4 us 2.0555 us 1.9227 us
MyDist2Perf 335.2 us 8.3427 us 11.4196 us
NativeSumAbsUPerf 258.0 us 1.6470 us 1.5406 us
MySumAbsqUPerf 258.9 us 0.9447 us 0.7889 us
NativeMulElementWiseUPerf 466.4 us 1.9625 us 1.6388 us
MyMulElementWiseUPerf 467.2 us 4.3560 us 4.0747 us

However, once moved into the test\Microsoft.ML.CpuMath.PerformanceTests\ folder, with multi-targeting, using Span<T>, having a lower TargetCount (from ~20 to 3) in the ToolChain, the performances of managed DotU, SumSqU, Dist2, and SumAbsU seem to deviate noticeably from those of their native counterparts. Two relevant tables are shown below.

Run in .NET Core App 3.0 (ManagedXPerf uses the managed X)

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.15063.1209 (1703/CreatorsUpdate/Redstone2)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
Frequency=3515626 Hz, Resolution=284.4444 ns, Timer=TSC
.NET Core SDK=3.0.100-alpha1-20180720-2
  [Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT

Toolchain=InProcessToolchain  LaunchCount=1  TargetCount=3
WarmupCount=3
Method Mean Error StdDev
NativeDotUPerf 346.2 us 63.44 us 3.584 us
ManagedDotUPerf 662.3 us 31.82 us 1.798 us
NativeDotSUPerf 2,291.9 us 280.57 us 15.853 us
ManagedDotSUPerf 2,303.4 us 275.33 us 15.557 us
NativeSumSqUPerf 551.5 us 20.80 us 1.175 us
ManagedSumSqUPerf 882.4 us 385.73 us 21.794 us
NativeAddUPerf 326.1 us 104.46 us 5.902 us
ManagedAddUPerf 324.6 us 70.70 us 3.995 us
NativeAddSUPerf 2,982.1 us 5,531.05 us 312.514 us
ManagedAddSUPerf 2,763.0 us 951.16 us 53.742 us
NativeAddScaleUPerf 327.4 us 90.74 us 5.127 us
ManagedAddScaleUPerf 324.5 us 118.36 us 6.688 us
NativeAddScaleSUPerf 2,675.9 us 590.91 us 33.387 us
ManagedAddScaleSUPerf 2,693.1 us 62.59 us 3.536 us
NativeScaleUPerf 140.2 us 51.56 us 2.913 us
ManagedScaleUPerf 155.3 us 238.58 us 13.480 us
NativeDist2Perf 348.5 us 125.00 us 7.063 us
ManagedDist2Perf 671.6 us 518.96 us 29.322 us
NativeSumAbsUPerf 272.4 us 79.46 us 4.490 us
ManagedSumAbsqUPerf 601.2 us 86.91 us 4.910 us
NativeMulElementWiseUPerf 497.3 us 404.78 us 22.871 us
ManagedMulElementWiseUPerf 493.7 us 145.72 us 8.233 us

Run in .NET Core App 2.1 (ManagedXPerf uses the native X)

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.15063.1209 (1703/CreatorsUpdate/Redstone2)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
Frequency=3515626 Hz, Resolution=284.4444 ns, Timer=TSC
.NET Core SDK=2.2.100-refac-20180613-1
  [Host] : .NET Core 2.1.2 (CoreCLR 4.6.26628.05, CoreFX 4.6.26629.01), 64bit RyuJIT

Toolchain=InProcessToolchain  LaunchCount=1  TargetCount=3
WarmupCount=3
Method Mean Error StdDev
NativeDotUPerf 352.5 us 45.87 us 2.592 us
ManagedDotUPerf 346.7 us 72.99 us 4.124 us
NativeDotSUPerf 2,274.1 us 729.91 us 41.241 us
ManagedDotSUPerf 2,264.7 us 220.67 us 12.468 us
NativeSumSqUPerf 601.9 us 41.43 us 2.341 us
ManagedSumSqUPerf 562.9 us 453.06 us 25.599 us
NativeAddUPerf 333.9 us 140.60 us 7.944 us
ManagedAddUPerf 330.2 us 143.10 us 8.086 us
NativeAddSUPerf 2,839.8 us 4,658.38 us 263.207 us
ManagedAddSUPerf 2,726.4 us 467.48 us 26.413 us
NativeAddScaleUPerf 330.6 us 58.80 us 3.322 us
ManagedAddScaleUPerf 327.8 us 88.76 us 5.015 us
NativeAddScaleSUPerf 2,755.9 us 563.51 us 31.839 us
ManagedAddScaleSUPerf 2,752.0 us 598.46 us 33.814 us
NativeScaleUPerf 141.8 us 29.23 us 1.652 us
ManagedScaleUPerf 150.2 us 202.48 us 11.441 us
NativeDist2Perf 350.6 us 44.27 us 2.501 us
ManagedDist2Perf 350.2 us 23.96 us 1.354 us
NativeSumAbsUPerf 270.0 us 82.27 us 4.648 us
ManagedSumAbsqUPerf 272.9 us 159.45 us 9.009 us
NativeMulElementWiseUPerf 502.1 us 275.94 us 15.591 us
ManagedMulElementWiseUPerf 503.3 us 125.91 us 7.114 us

TODOs

When I ran the performance tests in the early half of the PR review period, the perfs looked fine, but the most recent run above looked pretty different. Will look into reasons that cause this issue.

Experiments made to find the cause to the issue

  1. Changing the perf test to Default to ShortRun, i.e. increasing LaunchCount and other warm-up steps to make perf measurement more accurate.
    Conclusion: Not the main factor.
BenchmarkDotNet=v0.10.14, OS=Windows 10.0.15063.1209 (1703/CreatorsUpdate/Redstone2)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
Frequency=3515626 Hz, Resolution=284.4444 ns, Timer=TSC
.NET Core SDK=3.0.100-alpha1-20180720-2
  [Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT

Toolchain=InProcessToolchain
Method Mean Error StdDev Median
NativeDotUPerf 351.8 us 5.815 us 5.154 us 350.7 us
ManagedDotUPerf 664.8 us 5.631 us 5.268 us 664.7 us
NativeDotSUPerf 2,416.7 us 74.286 us 207.080 us 2,342.7 us
ManagedDotSUPerf 2,311.7 us 56.298 us 49.907 us 2,308.0 us
NativeSumSqUPerf 547.4 us 2.866 us 2.237 us 546.8 us
ManagedSumSqUPerf 910.6 us 24.005 us 35.186 us 894.9 us
NativeAddUPerf 328.8 us 5.920 us 4.943 us 327.8 us
ManagedAddUPerf 357.8 us 14.013 us 39.524 us 337.0 us
NativeAddSUPerf 2,749.6 us 50.185 us 100.224 us 2,722.3 us
ManagedAddSUPerf 2,873.1 us 25.477 us 22.585 us 2,871.1 us
NativeAddScaleUPerf 334.3 us 8.223 us 8.076 us 331.2 us
ManagedAddScaleUPerf 334.6 us 3.100 us 2.748 us 333.9 us
NativeAddScaleSUPerf 2,729.2 us 32.378 us 30.286 us 2,730.1 us
ManagedAddScaleSUPerf 2,670.1 us 29.478 us 23.014 us 2,662.7 us
NativeScaleUPerf 140.0 us 1.780 us 1.390 us 140.0 us
ManagedScaleUPerf 143.3 us 2.711 us 2.784 us 142.9 us
NativeDist2Perf 350.2 us 3.081 us 2.573 us 349.6 us
ManagedDist2Perf 664.7 us 2.621 us 2.046 us 664.6 us
NativeSumAbsUPerf 271.8 us 2.229 us 1.741 us 271.8 us
ManagedSumAbsUPerf 600.1 us 3.051 us 2.854 us 600.6 us
NativeMulElementWiseUPerf 503.8 us 9.875 us 8.754 us 501.3 us
ManagedMulElementWiseUPerf 518.0 us 25.485 us 39.676 us 498.5 us
  1. Removed the dependency on Span<T> to resort to using normal input float arrays instead.
    Conclusion: Not the main factor.

  2. Removed the dependency on the VectorSum function to resort to using original code instead.
    Conclusion: This is the main factor.
    Perf results after the fix:

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.15063.1209 (1703/CreatorsUpdate/Redstone2)
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
Frequency=3515626 Hz, Resolution=284.4444 ns, Timer=TSC
.NET Core SDK=3.0.100-alpha1-20180720-2
  [Host] : .NET Core 3.0.0-preview1-26710-03 (CoreCLR 4.6.26710.05, CoreFX 4.6.26708.04), 64bit RyuJIT

Toolchain=InProcessToolchain  LaunchCount=1  TargetCount=3
WarmupCount=3
Method Mean Error StdDev
NativeDotUPerf 550.7 us 1,839.99 us 103.963 us
ManagedDotUPerf 486.4 us 61.79 us 3.492 us
NativeDotSUPerf 2,446.8 us 405.57 us 22.915 us
ManagedDotSUPerf 2,620.6 us 219.16 us 12.383 us
NativeSumSqUPerf 569.0 us 18.54 us 1.047 us
ManagedSumSqUPerf 579.5 us 68.04 us 3.845 us
NativeAddUPerf 389.9 us 562.21 us 31.766 us
ManagedAddUPerf 368.6 us 48.36 us 2.733 us
NativeAddSUPerf 4,324.0 us 10,768.20 us 608.423 us
ManagedAddSUPerf 3,118.3 us 109.61 us 6.193 us
NativeAddScaleUPerf 512.1 us 1,694.78 us 95.758 us
ManagedAddScaleUPerf 480.3 us 252.98 us 14.294 us
NativeAddScaleSUPerf 3,425.0 us 6,916.49 us 390.795 us
ManagedAddScaleSUPerf 3,161.6 us 808.89 us 45.704 us
NativeScaleUPerf 153.5 us 52.31 us 2.955 us
ManagedScaleUPerf 152.5 us 59.76 us 3.377 us
NativeDist2Perf 394.8 us 126.76 us 7.162 us
ManagedDist2Perf 386.7 us 145.84 us 8.240 us
NativeSumAbsUPerf 304.7 us 610.29 us 34.483 us
ManagedSumAbsqUPerf 291.5 us 277.25 us 15.665 us
NativeMulElementWiseUPerf 563.3 us 124.22 us 7.018 us
ManagedMulElementWiseUPerf 572.0 us 295.43 us 16.692 us
@helloguo
Copy link

helloguo commented Aug 3, 2018

Can you manually inline vectorsum function and see what's going on? I tested with two simple functions and it seems vectorsum makes the difference. test2 has better codegen.

        static unsafe float test1(Span<float> src, Span<float> dst, int count)
        {
            Vector128<float> result = Sse.SetZeroVector128();

            fixed (float* psrc = src)
            fixed (float* pdst = dst)
            {
                float* pSrcCurrent = psrc;
                float* pDstCurrent = pdst;
                float* pEnd = psrc + src.Length;

                while (pSrcCurrent + 4 <= pEnd)
                {
                    Vector128<float> srcVector = Sse.LoadVector128(pSrcCurrent);
                    Vector128<float> dstVector = Sse.LoadVector128(pDstCurrent);

                    result = Sse.Add(result, Sse.Multiply(srcVector, dstVector));

                    pSrcCurrent += 4;
                    pDstCurrent += 4;
                }

                result = VectorSum(result);

                while (pSrcCurrent < pEnd)
                {
                    Vector128<float> srcVector = Sse.LoadScalarVector128(pSrcCurrent);
                    Vector128<float> dstVector = Sse.LoadScalarVector128(pDstCurrent);

                    result = Sse.AddScalar(result, Sse.MultiplyScalar(srcVector, dstVector));

                    pSrcCurrent++;
                    pDstCurrent++;
                }
            }

            return Sse.ConvertToSingle(result);
        }

        static unsafe float test2(Span<float> src, Span<float> dst, int count)
        {
            Vector128<float> result = Sse.SetZeroVector128();

            fixed (float* psrc = src)
            fixed (float* pdst = dst)
            {
                float* pSrcCurrent = psrc;
                float* pDstCurrent = pdst;
                float* pEnd = psrc + src.Length;

                while (pSrcCurrent + 4 <= pEnd)
                {
                    Vector128<float> srcVector = Sse.LoadVector128(pSrcCurrent);
                    Vector128<float> dstVector = Sse.LoadVector128(pDstCurrent);

                    result = Sse.Add(result, Sse.Multiply(srcVector, dstVector));

                    pSrcCurrent += 4;
                    pDstCurrent += 4;
                }


                if (Sse3.IsSupported)
                {
                    Vector128<float> tmp = Sse3.HorizontalAdd(result, result);
                    result = Sse3.HorizontalAdd(tmp, tmp);
                }
                else
                {
                    // SSE3 is not supported.
                    Vector128<float> tmp = Sse.Add(result, Sse.MoveHighToLow(result, result));
                    // The control byte shuffles the four 32-bit floats of tmp: ABCD -> BADC.
                    result = Sse.Add(tmp, Sse.Shuffle(tmp, tmp, 0xb1));
                }

                while (pSrcCurrent < pEnd)
                {
                    Vector128<float> srcVector = Sse.LoadScalarVector128(pSrcCurrent);
                    Vector128<float> dstVector = Sse.LoadScalarVector128(pDstCurrent);

                    result = Sse.AddScalar(result, Sse.MultiplyScalar(srcVector, dstVector));

                    pSrcCurrent++;
                    pDstCurrent++;
                }
            }

            return Sse.ConvertToSingle(result);
        }

codegen of test1:

Address	Source Line	Assembly	CPU Time	Instructions Retired
0x7ff8120e1ad7		vmovups xmm0, xmmword ptr [rdi]	88.272ms	20,800,000
0x7ff8120e1adc		vmovups xmm1, xmmword ptr [rsi]	2.006ms	7,800,000
0x7ff8120e1ae1		vmulps xmm0, xmm0, xmm1	2.006ms	10,400,000
0x7ff8120e1ae6		vaddps xmm0, xmm0, xmmword ptr [rsp+0x40]		
0x7ff8120e1aed		vmovapd xmmword ptr [rsp+0x40], xmm0	673.071ms	1,201,200,000
0x7ff8120e1af4		add rdi, 0x10	93.287ms	1,326,000,000
0x7ff8120e1af8		add rsi, 0x10		
0x7ff8120e1afc		lea rcx, ptr [rdi+0x10]		
0x7ff8120e1b00		cmp rcx, rbx	0ms	2,600,000
0x7ff8120e1b03		jbe 0x7ff8120e1ad7 <Block 10>	

codegen of test2:

Address	Source Line	Assembly	CPU Time	Instructions Retired
0x7ff8120e203c		vmovups xmm0, xmmword ptr [rdx]	98.302ms	62,400,000
0x7ff8120e2041		vmovups xmm1, xmmword ptr [rax]		
0x7ff8120e2046		vmulps xmm0, xmm0, xmm1		
0x7ff8120e204b		vaddps xmm6, xmm6, xmm0	11.034ms	135,200,000
0x7ff8120e2050		add rdx, 0x10	255.787ms	2,017,600,000
0x7ff8120e2054		add rax, 0x10		
0x7ff8120e2058		lea r8, ptr [rdx+0x10]		
0x7ff8120e205c		cmp r8, rcx	11.034ms	67,600,000
0x7ff8120e205f		jbe 0x7ff8120e203c <Block 10>	

@briancylui briancylui reopened this Aug 3, 2018
@briancylui
Copy link
Owner Author

Thank you for all the help and guidance from commenters and my mentors. Now, the perf results have been reverted back to the desired behavior, so I may close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants