Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array.Sort perf regression with .NET Core 5.0 preview #35175

Closed
damageboy opened this issue Apr 19, 2020 · 11 comments · Fixed by #35297
Closed

Array.Sort perf regression with .NET Core 5.0 preview #35175

damageboy opened this issue Apr 19, 2020 · 11 comments · Fixed by #35297
Assignees
Labels
Milestone

Comments

@damageboy
Copy link
Contributor

damageboy commented Apr 19, 2020

FYI, While preparing my next Array.Sort blog post I decided to run benchmarks against .NET Core 5.0...

I'm seeing quite a considerable perf drop (~10%) when comparing 3.1 / 5.0p2.

My guess is this is related to the Spanification PR although the original PR did not include enough benchmark points to (maybe) exhibit this regression back when it was introduced:

FWIW, These are the results for benchmarking int32 sorting with my benchmark harness:

BenchmarkDotNet=v0.12.0.1329-nightly, OS=clear-linux-os 32870
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.2.20181.10
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.16006, CoreFX 5.0.20.16006), X64 RyuJIT
  Job-CHKSFG : .NET Core 5.0.0 (CoreCLR 5.0.20.16006, CoreFX 5.0.20.16006), X64 RyuJIT

InvocationCount=3  IterationCount=3  LaunchCount=1
UnrollFactor=1  WarmupCount=3
Method Version N Mean [us] Time / N [us]
ArraySort 3.1.201 100 2.115 us 21.1517 ns
ArraySort 5.0.1-preview2.20181.10 100 2.755 us 27.5539 ns
ArraySort 3.1.201 1000 36.746 us 36.7463 ns
ArraySort 5.0.1-preview2.20181.10 1000 49.097 us 49.0969 ns
ArraySort 3.1.201 10000 543.148 us 54.3148 ns
ArraySort 5.0.1-preview2.20181.10 10000 596.996 us 59.6996 ns
ArraySort 3.1.201 100000 6,669.767 us 66.6977 ns
ArraySort 5.0.1-preview2.20181.10 100000 7,519.011 us 75.1901 ns
ArraySort 3.1.201 1000000 73,520.593 us 73.5206 ns
ArraySort 5.0.1-preview2.20181.10 1000000 83,483.225 us 83.4832 ns
ArraySort 3.1.201 10000000 842,300.223 us 84.2300 ns
ArraySort 5.0.1-preview2.20181.10 10000000 963,871.867 us 96.3872 ns
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Apr 19, 2020
@Dotnet-GitSync-Bot
Copy link
Collaborator

I couldn't figure out the best area label to add to this issue. Please help me learn by adding exactly one area label.

@EgorBo
Copy link
Member

EgorBo commented Apr 19, 2020

Can you re-run the benchmarks with [MemoryDiagnoser]?

@AaronRobinsonMSFT
Copy link
Member

/cc @stephentoub

@saucecontrol
Copy link
Member

saucecontrol commented Apr 19, 2020

Looks like the warmup and iteration counts are too low to ever hit Tier-1. The Span methods make fairly heavy use of BMI1 and LZCNT intrinsics, which are forced to a slow fallback path in the R2R corelib image. This should improve once #34550 is done, particularly if crossgen emit of LZCNT is disabled after and the explicit uses of BMI1 and LZCNT are switched to BitOperations calls.

@EgorBo
Copy link
Member

EgorBo commented Apr 19, 2020

Looks like the warmup and iteration counts are too low to ever hit Tier-1. The Span methods make fairly heavy use of BMI1 and LZCNT intrinsics, which are forced to a slow fallback path in the R2R corelib image. This should improve once #34550 is done, particularly if crossgen emit of LZCNT is disabled after and the explicit uses of BMI1 and LZCNT are switched to BitOperations calls.

I guess BDN needs some sort of events from JIT (method X was promoted to tier1) for better results. There were a couple of such requests somewhere.

@saucecontrol
Copy link
Member

Yep, I usually disable tiered JIT in benchmarks when testing my own code, but that's counterproductive when benchmarking CoreLib code since you then just keep the R2R code forever. The only solution I know of for now is to make sure your warmup hits the magical 30 iterations.

@damageboy
Copy link
Contributor Author

damageboy commented Apr 20, 2020

@saucecontrol I don't think this is the case here since BDN still does a considerable about of loops to warm up the JIT, and also all of my other results (For my own C# code) do not budge that much for the shorter runs, it's mostly about reducing the number of outliers.

Also, keep in mind the benchmark harness I linked to does 3 invocations per iterations:
https://github.com/damageboy/VxSort/blob/11e83bfed45d9359d761ce1adb9894fd658d77cd/Bench/QuickSortBench.cs#L104

At any rate, here are the results with a longer setup:

Setup

For 3.1.201:

BenchmarkDotNet=v0.12.0.1329-nightly, OS=clear-linux-os 32870
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.2.20181.10
  [Host]     : .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT
  Job-GLGTXW : .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT

InvocationCount=3  IterationCount=15  LaunchCount=2
UnrollFactor=1  WarmupCount=10

For 5.0 preview2:

BenchmarkDotNet=v0.12.0.1329-nightly, OS=clear-linux-os 32870
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.2.20181.10
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.16006, CoreFX 5.0.20.16006), X64 RyuJIT
  Job-OKHTIU : .NET Core 5.0.0 (CoreCLR 5.0.20.16006, CoreFX 5.0.20.16006), X64 RyuJIT

InvocationCount=3  IterationCount=15  LaunchCount=2
UnrollFactor=1  WarmupCount=10

Results

Method Version N Mean [us] Time / N [us]
ArraySort 3.1.201 100 1.993 us 19.9271 ns
ArraySort 5.0.1-preview2.20181.10 100 2.352 us 23.5182 ns
ArraySort 3.1.201 1000 38.034 us 38.0343 ns
ArraySort 5.0.1-preview2.20181.10 1000 42.208 us 42.2077 ns
ArraySort 3.1.201 10000 514.081 us 51.4081 ns
ArraySort 5.0.1-preview2.20181.10 10000 586.367 us 58.6367 ns
ArraySort 3.1.201 100000 6,406.406 us 64.0641 ns
ArraySort 5.0.1-preview2.20181.10 100000 6,751.723 us 67.5172 ns
ArraySort 3.1.201 1000000 69,250.935 us 69.2509 ns
ArraySort 5.0.1-preview2.20181.10 1000000 77,663.890 us 77.6639 ns
ArraySort 3.1.201 10000000 795,386.618 us 79.5387 ns
ArraySort 5.0.1-preview2.20181.10 10000000 901,314.586 us 90.1315 ns

To me, this looks even slightly worse. Even though both "targets" look better than the top comment.
The main difference (and reason for this, IMO) is that this time I bothered to turn on performance CPU governor in Linux and drop to console mode, killing almost every other process / L3 cache user in the system during the system. So while both result sets now look better in absolute terms compared to the original comment/post, they still retain the same, if not slightly worse relative perf gap.

@damageboy
Copy link
Contributor Author

damageboy commented Apr 20, 2020

@EgorBo Here are the results for 5.0p2 with [MemoryDiagnoser]:

Method N Mean [us] Error [us] StdDev [us] Time / N [ns] Median [us] Gen 0 Gen 1 Gen 2 Allocated [B]
ArraySort 100 3.451 1.461 2.048 34.5057 2.316 - - - -
ArraySort 1000 44.809 2.732 3.830 44.8088 42.600 - - - -
ArraySort 10000 591.393 9.126 13.660 59.1393 589.615 - - - -
ArraySort 100000 6,722.189 32.602 46.757 67.2219 6,705.429 - - - -
ArraySort 1000000 78,358.818 754.056 1,057.081 78.3588 77,816.009 - - - 776 B
ArraySort 10000000 913,850.467 6,249.496 9,353.951 91.3850 908,485.036 - - - -

And for 3.1.201 (Which should really not matter, I think, but why not anyway):

Method N Mean [us] Error [us] StdDev [us] Time / N [us] Median [us] Gen 0 Gen 1 Gen 2 Allocated [B]
ArraySort 100 2.972 1.2903 1.8912 29.7211 2.044 - - - -
ArraySort 1000 38.300 2.8583 4.1897 38.2999 36.078 - - - -
ArraySort 10000 520.251 9.6589 14.4570 52.0251 520.268 - - - -
ArraySort 100000 6,173.007 125.6716 188.0994 61.7301 6,091.730 - - - -
ArraySort 1000000 69,319.559 152.7283 214.1038 69.3196 69,238.988 - - - 2843 B
ArraySort 10000000 801,735.753 6,431.5074 9,626.3774 80.1736 795,883.073 - - - 3003 B

@saucecontrol
Copy link
Member

saucecontrol commented Apr 20, 2020

Why is the Allocated column in GB? Of course that'll be 0. And where are the Error and StdDev columns? Hard to tell if the run reached steady state without those.

all of my other results (For my own C# code) do not budge that much for the shorter runs

The really big difference there is that your own HWIntrinsics code will be JIT compiled the first time for your own hardware. The CoreLib image is Ready2Run compiled, which is limited to SSE2 instructions. Any place in CoreLib where BMI1 instructions are used will instead pre-JIT the managed fallback code, and any place LZCNT is used, both the IsSupported property and the LeadingZeroCount methods are compiled as full method calls rather than a single instruction. I believe Span also uses some SSE4.1 itrinsics, which get the same treatment as LZCNT.

Unless you're certain your benchmark is running enough iterations to trigger a Tier1 recompile before measuring anything, you're not measuring the right thing. It's possible you're reaching that state (in which case my theory is completely wrong), but there's not enough information in your results tables to see that.

@damageboy
Copy link
Contributor Author

damageboy commented Apr 20, 2020

@saucecontrol
I've forced BDN to use bytes as a unit and updated the tables (above) accordingly and also added back the rest of the columns I originally omitted for brevity.

As for JITing / tiered compilation, Here is a log for a full run:

// BeforeAnythingElse

// Benchmark Process Environment Information:
// Runtime=.NET Core 5.0.0 (CoreCLR 5.0.20.16006, CoreFX 5.0.20.16006), X64 RyuJIT
// GC=Concurrent Workstation
// Job: Job-VKJPFJ(InvocationCount=3, IterationCount=15, LaunchCount=2, UnrollFactor=1, WarmupCount=10)

OverheadJitting  1: 1 op, 212143.00 ns, 212.1430 us/op
WorkloadJitting  1: 1 op, 911815580.00 ns, 911.8156 ms/op

OverheadWarmup   1: 3 op, 300.00 ns, 100.0000 ns/op
OverheadWarmup   2: 3 op, 282.00 ns, 94.0000 ns/op
OverheadWarmup   3: 3 op, 251.00 ns, 83.6667 ns/op
OverheadWarmup   4: 3 op, 204.00 ns, 68.0000 ns/op
OverheadWarmup   5: 3 op, 271.00 ns, 90.3333 ns/op
OverheadWarmup   6: 3 op, 171.00 ns, 57.0000 ns/op
OverheadWarmup   7: 3 op, 241.00 ns, 80.3333 ns/op
OverheadWarmup   8: 3 op, 201.00 ns, 67.0000 ns/op

OverheadActual   1: 3 op, 217.00 ns, 72.3333 ns/op
OverheadActual   2: 3 op, 947.00 ns, 315.6667 ns/op
OverheadActual   3: 3 op, 224.00 ns, 74.6667 ns/op
OverheadActual   4: 3 op, 242.00 ns, 80.6667 ns/op
OverheadActual   5: 3 op, 227.00 ns, 75.6667 ns/op
OverheadActual   6: 3 op, 217.00 ns, 72.3333 ns/op
OverheadActual   7: 3 op, 241.00 ns, 80.3333 ns/op
OverheadActual   8: 3 op, 236.00 ns, 78.6667 ns/op
OverheadActual   9: 3 op, 228.00 ns, 76.0000 ns/op
OverheadActual  10: 3 op, 222.00 ns, 74.0000 ns/op
OverheadActual  11: 3 op, 192.00 ns, 64.0000 ns/op
OverheadActual  12: 3 op, 263.00 ns, 87.6667 ns/op
OverheadActual  13: 3 op, 187.00 ns, 62.3333 ns/op
OverheadActual  14: 3 op, 214.00 ns, 71.3333 ns/op
OverheadActual  15: 3 op, 214.00 ns, 71.3333 ns/op
OverheadActual  16: 3 op, 218.00 ns, 72.6667 ns/op
OverheadActual  17: 3 op, 220.00 ns, 73.3333 ns/op
OverheadActual  18: 3 op, 231.00 ns, 77.0000 ns/op
OverheadActual  19: 3 op, 236.00 ns, 78.6667 ns/op
OverheadActual  20: 3 op, 267.00 ns, 89.0000 ns/op

WorkloadWarmup   1: 3 op, 2731248877.00 ns, 910.4163 ms/op
WorkloadWarmup   2: 3 op, 2785353507.00 ns, 928.4512 ms/op
WorkloadWarmup   3: 3 op, 2720062975.00 ns, 906.6877 ms/op
WorkloadWarmup   4: 3 op, 2719255895.00 ns, 906.4186 ms/op
WorkloadWarmup   5: 3 op, 2717183480.00 ns, 905.7278 ms/op
WorkloadWarmup   6: 3 op, 2788945759.00 ns, 929.6486 ms/op
WorkloadWarmup   7: 3 op, 2756700118.00 ns, 918.9000 ms/op
WorkloadWarmup   8: 3 op, 2723493732.00 ns, 907.8312 ms/op
WorkloadWarmup   9: 3 op, 2716591674.00 ns, 905.5306 ms/op
WorkloadWarmup  10: 3 op, 2786646262.00 ns, 928.8821 ms/op

// BeforeActualRun
WorkloadActual   1: 3 op, 2719270008.00 ns, 906.4233 ms/op
WorkloadActual   2: 3 op, 2727867484.00 ns, 909.2892 ms/op
WorkloadActual   3: 3 op, 2752234607.00 ns, 917.4115 ms/op
WorkloadActual   4: 3 op, 2746344962.00 ns, 915.4483 ms/op
WorkloadActual   5: 3 op, 2721176649.00 ns, 907.0589 ms/op
WorkloadActual   6: 3 op, 2716237772.00 ns, 905.4126 ms/op
WorkloadActual   7: 3 op, 2783251202.00 ns, 927.7504 ms/op
WorkloadActual   8: 3 op, 2757663609.00 ns, 919.2212 ms/op
WorkloadActual   9: 3 op, 2723043182.00 ns, 907.6811 ms/op
WorkloadActual  10: 3 op, 2716898888.00 ns, 905.6330 ms/op
WorkloadActual  11: 3 op, 2791437881.00 ns, 930.4793 ms/op
WorkloadActual  12: 3 op, 2718779663.00 ns, 906.2599 ms/op
WorkloadActual  13: 3 op, 2719404023.00 ns, 906.4680 ms/op
WorkloadActual  14: 3 op, 2727891316.00 ns, 909.2971 ms/op
WorkloadActual  15: 3 op, 2793246803.00 ns, 931.0823 ms/op

// AfterActualRun
WorkloadResult   1: 3 op, 2719269782.50 ns, 906.4233 ms/op
WorkloadResult   2: 3 op, 2727867258.50 ns, 909.2891 ms/op
WorkloadResult   3: 3 op, 2752234381.50 ns, 917.4115 ms/op
WorkloadResult   4: 3 op, 2746344736.50 ns, 915.4482 ms/op
WorkloadResult   5: 3 op, 2721176423.50 ns, 907.0588 ms/op
WorkloadResult   6: 3 op, 2716237546.50 ns, 905.4125 ms/op
WorkloadResult   7: 3 op, 2783250976.50 ns, 927.7503 ms/op
WorkloadResult   8: 3 op, 2757663383.50 ns, 919.2211 ms/op
WorkloadResult   9: 3 op, 2723042956.50 ns, 907.6810 ms/op
WorkloadResult  10: 3 op, 2716898662.50 ns, 905.6329 ms/op
WorkloadResult  11: 3 op, 2791437655.50 ns, 930.4792 ms/op
WorkloadResult  12: 3 op, 2718779437.50 ns, 906.2598 ms/op
WorkloadResult  13: 3 op, 2719403797.50 ns, 906.4679 ms/op
WorkloadResult  14: 3 op, 2727891090.50 ns, 909.2970 ms/op
WorkloadResult  15: 3 op, 2793246577.50 ns, 931.0822 ms/op

From what I understand all the iterations (per launch/process) that occur before the // BeforeActualRun marking are all various forms of warmup iterations by BDN, and only the results after that mark are the ones recorded and analyzed as part of the results I'm showing here.

I think there are enough iterations in there (I count 30 just for the WorkloadWarmup phase) for the JIT to kick in, but you can be the judge for yourself.

The tests were originally adapted (lifted) from the dotnet/performance repo and contain some random data generation which IIRC is the main reason for the error/stddev columns.

But more importantly, you can see that those columns are pretty much in line with respect to the different .NET versions.

@saucecontrol
Copy link
Member

Thanks for the update. Yeah, those results look reasonable, which shoots my theory down.

BTW, I read the first 3 parts of your blog series on array sort and enjoyed them immensely. Looking forward to the rest :)

@stephentoub stephentoub self-assigned this Apr 22, 2020
@stephentoub stephentoub removed the untriaged New issue has not been triaged by the area owner label Apr 22, 2020
@stephentoub stephentoub added this to the 5.0 milestone Apr 22, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants