Array.Sort perf regression with .NET Core 5.0 preview #35175

damageboy · 2020-04-19T09:44:09Z

FYI, While preparing my next Array.Sort blog post I decided to run benchmarks against .NET Core 5.0...

I'm seeing quite a considerable perf drop (~10%) when comparing 3.1 / 5.0p2.

My guess is this is related to the Spanification PR although the original PR did not include enough benchmark points to (maybe) exhibit this regression back when it was introduced:

FWIW, These are the results for benchmarking int32 sorting with my benchmark harness:

BenchmarkDotNet=v0.12.0.1329-nightly, OS=clear-linux-os 32870
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.2.20181.10
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.16006, CoreFX 5.0.20.16006), X64 RyuJIT
  Job-CHKSFG : .NET Core 5.0.0 (CoreCLR 5.0.20.16006, CoreFX 5.0.20.16006), X64 RyuJIT

InvocationCount=3  IterationCount=3  LaunchCount=1
UnrollFactor=1  WarmupCount=3

Method	Version	N	Mean [us]	Time / N [us]
ArraySort	3.1.201	100	2.115 us	21.1517 ns
ArraySort	5.0.1-preview2.20181.10	100	2.755 us	27.5539 ns

ArraySort	3.1.201	1000	36.746 us	36.7463 ns
ArraySort	5.0.1-preview2.20181.10	1000	49.097 us	49.0969 ns

ArraySort	3.1.201	10000	543.148 us	54.3148 ns
ArraySort	5.0.1-preview2.20181.10	10000	596.996 us	59.6996 ns

ArraySort	3.1.201	100000	6,669.767 us	66.6977 ns
ArraySort	5.0.1-preview2.20181.10	100000	7,519.011 us	75.1901 ns

ArraySort	3.1.201	1000000	73,520.593 us	73.5206 ns
ArraySort	5.0.1-preview2.20181.10	1000000	83,483.225 us	83.4832 ns

ArraySort	3.1.201	10000000	842,300.223 us	84.2300 ns
ArraySort	5.0.1-preview2.20181.10	10000000	963,871.867 us	96.3872 ns

Dotnet-GitSync-Bot · 2020-04-19T09:44:13Z

I couldn't figure out the best area label to add to this issue. Please help me learn by adding exactly one area label.

EgorBo · 2020-04-19T14:04:18Z

Can you re-run the benchmarks with [MemoryDiagnoser]?

AaronRobinsonMSFT · 2020-04-19T19:14:00Z

/cc @stephentoub

saucecontrol · 2020-04-19T20:21:14Z

Looks like the warmup and iteration counts are too low to ever hit Tier-1. The Span methods make fairly heavy use of BMI1 and LZCNT intrinsics, which are forced to a slow fallback path in the R2R corelib image. This should improve once #34550 is done, particularly if crossgen emit of LZCNT is disabled after and the explicit uses of BMI1 and LZCNT are switched to BitOperations calls.

EgorBo · 2020-04-19T20:29:26Z

Looks like the warmup and iteration counts are too low to ever hit Tier-1. The Span methods make fairly heavy use of BMI1 and LZCNT intrinsics, which are forced to a slow fallback path in the R2R corelib image. This should improve once #34550 is done, particularly if crossgen emit of LZCNT is disabled after and the explicit uses of BMI1 and LZCNT are switched to BitOperations calls.

I guess BDN needs some sort of events from JIT (method X was promoted to tier1) for better results. There were a couple of such requests somewhere.

saucecontrol · 2020-04-19T20:43:31Z

Yep, I usually disable tiered JIT in benchmarks when testing my own code, but that's counterproductive when benchmarking CoreLib code since you then just keep the R2R code forever. The only solution I know of for now is to make sure your warmup hits the magical 30 iterations.

damageboy · 2020-04-20T04:32:34Z

@saucecontrol I don't think this is the case here since BDN still does a considerable about of loops to warm up the JIT, and also all of my other results (For my own C# code) do not budge that much for the shorter runs, it's mostly about reducing the number of outliers.

Also, keep in mind the benchmark harness I linked to does 3 invocations per iterations:
https://github.com/damageboy/VxSort/blob/11e83bfed45d9359d761ce1adb9894fd658d77cd/Bench/QuickSortBench.cs#L104

At any rate, here are the results with a longer setup:

Setup

For 3.1.201:

BenchmarkDotNet=v0.12.0.1329-nightly, OS=clear-linux-os 32870
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.2.20181.10
  [Host]     : .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT
  Job-GLGTXW : .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT

InvocationCount=3  IterationCount=15  LaunchCount=2
UnrollFactor=1  WarmupCount=10

For 5.0 preview2:

BenchmarkDotNet=v0.12.0.1329-nightly, OS=clear-linux-os 32870
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.2.20181.10
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.16006, CoreFX 5.0.20.16006), X64 RyuJIT
  Job-OKHTIU : .NET Core 5.0.0 (CoreCLR 5.0.20.16006, CoreFX 5.0.20.16006), X64 RyuJIT

InvocationCount=3  IterationCount=15  LaunchCount=2
UnrollFactor=1  WarmupCount=10

Results

Method	Version	N	Mean [us]	Time / N [us]
ArraySort	3.1.201	100	1.993 us	19.9271 ns
ArraySort	5.0.1-preview2.20181.10	100	2.352 us	23.5182 ns

ArraySort	3.1.201	1000	38.034 us	38.0343 ns
ArraySort	5.0.1-preview2.20181.10	1000	42.208 us	42.2077 ns

ArraySort	3.1.201	10000	514.081 us	51.4081 ns
ArraySort	5.0.1-preview2.20181.10	10000	586.367 us	58.6367 ns

ArraySort	3.1.201	100000	6,406.406 us	64.0641 ns
ArraySort	5.0.1-preview2.20181.10	100000	6,751.723 us	67.5172 ns

ArraySort	3.1.201	1000000	69,250.935 us	69.2509 ns
ArraySort	5.0.1-preview2.20181.10	1000000	77,663.890 us	77.6639 ns

ArraySort	3.1.201	10000000	795,386.618 us	79.5387 ns
ArraySort	5.0.1-preview2.20181.10	10000000	901,314.586 us	90.1315 ns

To me, this looks even slightly worse. Even though both "targets" look better than the top comment.
The main difference (and reason for this, IMO) is that this time I bothered to turn on performance CPU governor in Linux and drop to console mode, killing almost every other process / L3 cache user in the system during the system. So while both result sets now look better in absolute terms compared to the original comment/post, they still retain the same, if not slightly worse relative perf gap.

damageboy · 2020-04-20T04:44:54Z

@EgorBo Here are the results for 5.0p2 with [MemoryDiagnoser]:

Method	N	Mean [us]	Error [us]	StdDev [us]	Time / N [ns]	Median [us]	Gen 0	Gen 1	Gen 2	Allocated [B]
ArraySort	100	3.451	1.461	2.048	34.5057	2.316	-	-	-	-
ArraySort	1000	44.809	2.732	3.830	44.8088	42.600	-	-	-	-
ArraySort	10000	591.393	9.126	13.660	59.1393	589.615	-	-	-	-
ArraySort	100000	6,722.189	32.602	46.757	67.2219	6,705.429	-	-	-	-
ArraySort	1000000	78,358.818	754.056	1,057.081	78.3588	77,816.009	-	-	-	776 B
ArraySort	10000000	913,850.467	6,249.496	9,353.951	91.3850	908,485.036	-	-	-	-

And for 3.1.201 (Which should really not matter, I think, but why not anyway):

Method	N	Mean [us]	Error [us]	StdDev [us]	Time / N [us]	Median [us]	Gen 0	Gen 1	Gen 2	Allocated [B]
ArraySort	100	2.972	1.2903	1.8912	29.7211	2.044	-	-	-	-
ArraySort	1000	38.300	2.8583	4.1897	38.2999	36.078	-	-	-	-
ArraySort	10000	520.251	9.6589	14.4570	52.0251	520.268	-	-	-	-
ArraySort	100000	6,173.007	125.6716	188.0994	61.7301	6,091.730	-	-	-	-
ArraySort	1000000	69,319.559	152.7283	214.1038	69.3196	69,238.988	-	-	-	2843 B
ArraySort	10000000	801,735.753	6,431.5074	9,626.3774	80.1736	795,883.073	-	-	-	3003 B

saucecontrol · 2020-04-20T05:12:19Z

Why is the Allocated column in GB? Of course that'll be 0. And where are the Error and StdDev columns? Hard to tell if the run reached steady state without those.

all of my other results (For my own C# code) do not budge that much for the shorter runs

The really big difference there is that your own HWIntrinsics code will be JIT compiled the first time for your own hardware. The CoreLib image is Ready2Run compiled, which is limited to SSE2 instructions. Any place in CoreLib where BMI1 instructions are used will instead pre-JIT the managed fallback code, and any place LZCNT is used, both the IsSupported property and the LeadingZeroCount methods are compiled as full method calls rather than a single instruction. I believe Span also uses some SSE4.1 itrinsics, which get the same treatment as LZCNT.

Unless you're certain your benchmark is running enough iterations to trigger a Tier1 recompile before measuring anything, you're not measuring the right thing. It's possible you're reaching that state (in which case my theory is completely wrong), but there's not enough information in your results tables to see that.

damageboy · 2020-04-20T06:15:21Z

@saucecontrol
I've forced BDN to use bytes as a unit and updated the tables (above) accordingly and also added back the rest of the columns I originally omitted for brevity.

As for JITing / tiered compilation, Here is a log for a full run:

// BeforeAnythingElse

// Benchmark Process Environment Information:
// Runtime=.NET Core 5.0.0 (CoreCLR 5.0.20.16006, CoreFX 5.0.20.16006), X64 RyuJIT
// GC=Concurrent Workstation
// Job: Job-VKJPFJ(InvocationCount=3, IterationCount=15, LaunchCount=2, UnrollFactor=1, WarmupCount=10)

OverheadJitting  1: 1 op, 212143.00 ns, 212.1430 us/op
WorkloadJitting  1: 1 op, 911815580.00 ns, 911.8156 ms/op

OverheadWarmup   1: 3 op, 300.00 ns, 100.0000 ns/op
OverheadWarmup   2: 3 op, 282.00 ns, 94.0000 ns/op
OverheadWarmup   3: 3 op, 251.00 ns, 83.6667 ns/op
OverheadWarmup   4: 3 op, 204.00 ns, 68.0000 ns/op
OverheadWarmup   5: 3 op, 271.00 ns, 90.3333 ns/op
OverheadWarmup   6: 3 op, 171.00 ns, 57.0000 ns/op
OverheadWarmup   7: 3 op, 241.00 ns, 80.3333 ns/op
OverheadWarmup   8: 3 op, 201.00 ns, 67.0000 ns/op

OverheadActual   1: 3 op, 217.00 ns, 72.3333 ns/op
OverheadActual   2: 3 op, 947.00 ns, 315.6667 ns/op
OverheadActual   3: 3 op, 224.00 ns, 74.6667 ns/op
OverheadActual   4: 3 op, 242.00 ns, 80.6667 ns/op
OverheadActual   5: 3 op, 227.00 ns, 75.6667 ns/op
OverheadActual   6: 3 op, 217.00 ns, 72.3333 ns/op
OverheadActual   7: 3 op, 241.00 ns, 80.3333 ns/op
OverheadActual   8: 3 op, 236.00 ns, 78.6667 ns/op
OverheadActual   9: 3 op, 228.00 ns, 76.0000 ns/op
OverheadActual  10: 3 op, 222.00 ns, 74.0000 ns/op
OverheadActual  11: 3 op, 192.00 ns, 64.0000 ns/op
OverheadActual  12: 3 op, 263.00 ns, 87.6667 ns/op
OverheadActual  13: 3 op, 187.00 ns, 62.3333 ns/op
OverheadActual  14: 3 op, 214.00 ns, 71.3333 ns/op
OverheadActual  15: 3 op, 214.00 ns, 71.3333 ns/op
OverheadActual  16: 3 op, 218.00 ns, 72.6667 ns/op
OverheadActual  17: 3 op, 220.00 ns, 73.3333 ns/op
OverheadActual  18: 3 op, 231.00 ns, 77.0000 ns/op
OverheadActual  19: 3 op, 236.00 ns, 78.6667 ns/op
OverheadActual  20: 3 op, 267.00 ns, 89.0000 ns/op

WorkloadWarmup   1: 3 op, 2731248877.00 ns, 910.4163 ms/op
WorkloadWarmup   2: 3 op, 2785353507.00 ns, 928.4512 ms/op
WorkloadWarmup   3: 3 op, 2720062975.00 ns, 906.6877 ms/op
WorkloadWarmup   4: 3 op, 2719255895.00 ns, 906.4186 ms/op
WorkloadWarmup   5: 3 op, 2717183480.00 ns, 905.7278 ms/op
WorkloadWarmup   6: 3 op, 2788945759.00 ns, 929.6486 ms/op
WorkloadWarmup   7: 3 op, 2756700118.00 ns, 918.9000 ms/op
WorkloadWarmup   8: 3 op, 2723493732.00 ns, 907.8312 ms/op
WorkloadWarmup   9: 3 op, 2716591674.00 ns, 905.5306 ms/op
WorkloadWarmup  10: 3 op, 2786646262.00 ns, 928.8821 ms/op

// BeforeActualRun
WorkloadActual   1: 3 op, 2719270008.00 ns, 906.4233 ms/op
WorkloadActual   2: 3 op, 2727867484.00 ns, 909.2892 ms/op
WorkloadActual   3: 3 op, 2752234607.00 ns, 917.4115 ms/op
WorkloadActual   4: 3 op, 2746344962.00 ns, 915.4483 ms/op
WorkloadActual   5: 3 op, 2721176649.00 ns, 907.0589 ms/op
WorkloadActual   6: 3 op, 2716237772.00 ns, 905.4126 ms/op
WorkloadActual   7: 3 op, 2783251202.00 ns, 927.7504 ms/op
WorkloadActual   8: 3 op, 2757663609.00 ns, 919.2212 ms/op
WorkloadActual   9: 3 op, 2723043182.00 ns, 907.6811 ms/op
WorkloadActual  10: 3 op, 2716898888.00 ns, 905.6330 ms/op
WorkloadActual  11: 3 op, 2791437881.00 ns, 930.4793 ms/op
WorkloadActual  12: 3 op, 2718779663.00 ns, 906.2599 ms/op
WorkloadActual  13: 3 op, 2719404023.00 ns, 906.4680 ms/op
WorkloadActual  14: 3 op, 2727891316.00 ns, 909.2971 ms/op
WorkloadActual  15: 3 op, 2793246803.00 ns, 931.0823 ms/op

// AfterActualRun
WorkloadResult   1: 3 op, 2719269782.50 ns, 906.4233 ms/op
WorkloadResult   2: 3 op, 2727867258.50 ns, 909.2891 ms/op
WorkloadResult   3: 3 op, 2752234381.50 ns, 917.4115 ms/op
WorkloadResult   4: 3 op, 2746344736.50 ns, 915.4482 ms/op
WorkloadResult   5: 3 op, 2721176423.50 ns, 907.0588 ms/op
WorkloadResult   6: 3 op, 2716237546.50 ns, 905.4125 ms/op
WorkloadResult   7: 3 op, 2783250976.50 ns, 927.7503 ms/op
WorkloadResult   8: 3 op, 2757663383.50 ns, 919.2211 ms/op
WorkloadResult   9: 3 op, 2723042956.50 ns, 907.6810 ms/op
WorkloadResult  10: 3 op, 2716898662.50 ns, 905.6329 ms/op
WorkloadResult  11: 3 op, 2791437655.50 ns, 930.4792 ms/op
WorkloadResult  12: 3 op, 2718779437.50 ns, 906.2598 ms/op
WorkloadResult  13: 3 op, 2719403797.50 ns, 906.4679 ms/op
WorkloadResult  14: 3 op, 2727891090.50 ns, 909.2970 ms/op
WorkloadResult  15: 3 op, 2793246577.50 ns, 931.0822 ms/op

From what I understand all the iterations (per launch/process) that occur before the // BeforeActualRun marking are all various forms of warmup iterations by BDN, and only the results after that mark are the ones recorded and analyzed as part of the results I'm showing here.

I think there are enough iterations in there (I count 30 just for the WorkloadWarmup phase) for the JIT to kick in, but you can be the judge for yourself.

The tests were originally adapted (lifted) from the dotnet/performance repo and contain some random data generation which IIRC is the main reason for the error/stddev columns.

But more importantly, you can see that those columns are pretty much in line with respect to the different .NET versions.

saucecontrol · 2020-04-20T06:51:17Z

Thanks for the update. Yeah, those results look reasonable, which shoots my theory down.

BTW, I read the first 3 parts of your blog series on array sort and enjoyed them immensely. Looking forward to the rest :)

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Apr 19, 2020

AaronRobinsonMSFT added area-System.Runtime tenet-performance Performance related issue labels Apr 19, 2020

stephentoub self-assigned this Apr 22, 2020

stephentoub removed the untriaged New issue has not been triaged by the area owner label Apr 22, 2020

stephentoub added this to the 5.0 milestone Apr 22, 2020

stephentoub mentioned this issue Apr 22, 2020

Improve Array.Sort(T[]) performance #35297

Merged

stephentoub closed this as completed in #35297 Apr 24, 2020

ghost locked as resolved and limited conversation to collaborators Dec 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Array.Sort perf regression with .NET Core 5.0 preview #35175

Array.Sort perf regression with .NET Core 5.0 preview #35175

damageboy commented Apr 19, 2020 •

edited

Loading

Dotnet-GitSync-Bot commented Apr 19, 2020

EgorBo commented Apr 19, 2020

AaronRobinsonMSFT commented Apr 19, 2020

saucecontrol commented Apr 19, 2020 •

edited

Loading

EgorBo commented Apr 19, 2020 •

edited

Loading

saucecontrol commented Apr 19, 2020

damageboy commented Apr 20, 2020 •

edited

Loading

damageboy commented Apr 20, 2020 •

edited

Loading

saucecontrol commented Apr 20, 2020 •

edited

Loading

damageboy commented Apr 20, 2020 •

edited

Loading

saucecontrol commented Apr 20, 2020

Array.Sort perf regression with .NET Core 5.0 preview #35175

Array.Sort perf regression with .NET Core 5.0 preview #35175

Comments

damageboy commented Apr 19, 2020 • edited Loading

Dotnet-GitSync-Bot commented Apr 19, 2020

EgorBo commented Apr 19, 2020

AaronRobinsonMSFT commented Apr 19, 2020

saucecontrol commented Apr 19, 2020 • edited Loading

EgorBo commented Apr 19, 2020 • edited Loading

saucecontrol commented Apr 19, 2020

damageboy commented Apr 20, 2020 • edited Loading

Setup

Results

damageboy commented Apr 20, 2020 • edited Loading

saucecontrol commented Apr 20, 2020 • edited Loading

damageboy commented Apr 20, 2020 • edited Loading

saucecontrol commented Apr 20, 2020

damageboy commented Apr 19, 2020 •

edited

Loading

saucecontrol commented Apr 19, 2020 •

edited

Loading

EgorBo commented Apr 19, 2020 •

edited

Loading

damageboy commented Apr 20, 2020 •

edited

Loading

damageboy commented Apr 20, 2020 •

edited

Loading

saucecontrol commented Apr 20, 2020 •

edited

Loading

damageboy commented Apr 20, 2020 •

edited

Loading