-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Array.Sort perf regression with .NET Core 5.0 preview #35175
Comments
I couldn't figure out the best area label to add to this issue. Please help me learn by adding exactly one area label. |
Can you re-run the benchmarks with |
/cc @stephentoub |
Looks like the warmup and iteration counts are too low to ever hit Tier-1. The Span methods make fairly heavy use of BMI1 and LZCNT intrinsics, which are forced to a slow fallback path in the R2R corelib image. This should improve once #34550 is done, particularly if crossgen emit of LZCNT is disabled after and the explicit uses of BMI1 and LZCNT are switched to |
I guess BDN needs some sort of events from JIT (method X was promoted to tier1) for better results. There were a couple of such requests somewhere. |
Yep, I usually disable tiered JIT in benchmarks when testing my own code, but that's counterproductive when benchmarking CoreLib code since you then just keep the R2R code forever. The only solution I know of for now is to make sure your warmup hits the magical 30 iterations. |
@saucecontrol I don't think this is the case here since BDN still does a considerable about of loops to warm up the JIT, and also all of my other results (For my own C# code) do not budge that much for the shorter runs, it's mostly about reducing the number of outliers. Also, keep in mind the benchmark harness I linked to does 3 invocations per iterations: At any rate, here are the results with a longer setup: SetupFor 3.1.201: BenchmarkDotNet=v0.12.0.1329-nightly, OS=clear-linux-os 32870
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.2.20181.10
[Host] : .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT
Job-GLGTXW : .NET Core 3.1.2 (CoreCLR 4.700.20.6602, CoreFX 4.700.20.6702), X64 RyuJIT
InvocationCount=3 IterationCount=15 LaunchCount=2
UnrollFactor=1 WarmupCount=10 For 5.0 preview2: BenchmarkDotNet=v0.12.0.1329-nightly, OS=clear-linux-os 32870
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.2.20181.10
[Host] : .NET Core 5.0.0 (CoreCLR 5.0.20.16006, CoreFX 5.0.20.16006), X64 RyuJIT
Job-OKHTIU : .NET Core 5.0.0 (CoreCLR 5.0.20.16006, CoreFX 5.0.20.16006), X64 RyuJIT
InvocationCount=3 IterationCount=15 LaunchCount=2
UnrollFactor=1 WarmupCount=10 Results
To me, this looks even slightly worse. Even though both "targets" look better than the top comment. |
@EgorBo Here are the results for 5.0p2 with
And for 3.1.201 (Which should really not matter, I think, but why not anyway):
|
Why is the Allocated column in GB? Of course that'll be 0. And where are the Error and StdDev columns? Hard to tell if the run reached steady state without those.
The really big difference there is that your own HWIntrinsics code will be JIT compiled the first time for your own hardware. The CoreLib image is Ready2Run compiled, which is limited to SSE2 instructions. Any place in CoreLib where BMI1 instructions are used will instead pre-JIT the managed fallback code, and any place LZCNT is used, both the Unless you're certain your benchmark is running enough iterations to trigger a Tier1 recompile before measuring anything, you're not measuring the right thing. It's possible you're reaching that state (in which case my theory is completely wrong), but there's not enough information in your results tables to see that. |
@saucecontrol As for JITing / tiered compilation, Here is a log for a full run: // BeforeAnythingElse
// Benchmark Process Environment Information:
// Runtime=.NET Core 5.0.0 (CoreCLR 5.0.20.16006, CoreFX 5.0.20.16006), X64 RyuJIT
// GC=Concurrent Workstation
// Job: Job-VKJPFJ(InvocationCount=3, IterationCount=15, LaunchCount=2, UnrollFactor=1, WarmupCount=10)
OverheadJitting 1: 1 op, 212143.00 ns, 212.1430 us/op
WorkloadJitting 1: 1 op, 911815580.00 ns, 911.8156 ms/op
OverheadWarmup 1: 3 op, 300.00 ns, 100.0000 ns/op
OverheadWarmup 2: 3 op, 282.00 ns, 94.0000 ns/op
OverheadWarmup 3: 3 op, 251.00 ns, 83.6667 ns/op
OverheadWarmup 4: 3 op, 204.00 ns, 68.0000 ns/op
OverheadWarmup 5: 3 op, 271.00 ns, 90.3333 ns/op
OverheadWarmup 6: 3 op, 171.00 ns, 57.0000 ns/op
OverheadWarmup 7: 3 op, 241.00 ns, 80.3333 ns/op
OverheadWarmup 8: 3 op, 201.00 ns, 67.0000 ns/op
OverheadActual 1: 3 op, 217.00 ns, 72.3333 ns/op
OverheadActual 2: 3 op, 947.00 ns, 315.6667 ns/op
OverheadActual 3: 3 op, 224.00 ns, 74.6667 ns/op
OverheadActual 4: 3 op, 242.00 ns, 80.6667 ns/op
OverheadActual 5: 3 op, 227.00 ns, 75.6667 ns/op
OverheadActual 6: 3 op, 217.00 ns, 72.3333 ns/op
OverheadActual 7: 3 op, 241.00 ns, 80.3333 ns/op
OverheadActual 8: 3 op, 236.00 ns, 78.6667 ns/op
OverheadActual 9: 3 op, 228.00 ns, 76.0000 ns/op
OverheadActual 10: 3 op, 222.00 ns, 74.0000 ns/op
OverheadActual 11: 3 op, 192.00 ns, 64.0000 ns/op
OverheadActual 12: 3 op, 263.00 ns, 87.6667 ns/op
OverheadActual 13: 3 op, 187.00 ns, 62.3333 ns/op
OverheadActual 14: 3 op, 214.00 ns, 71.3333 ns/op
OverheadActual 15: 3 op, 214.00 ns, 71.3333 ns/op
OverheadActual 16: 3 op, 218.00 ns, 72.6667 ns/op
OverheadActual 17: 3 op, 220.00 ns, 73.3333 ns/op
OverheadActual 18: 3 op, 231.00 ns, 77.0000 ns/op
OverheadActual 19: 3 op, 236.00 ns, 78.6667 ns/op
OverheadActual 20: 3 op, 267.00 ns, 89.0000 ns/op
WorkloadWarmup 1: 3 op, 2731248877.00 ns, 910.4163 ms/op
WorkloadWarmup 2: 3 op, 2785353507.00 ns, 928.4512 ms/op
WorkloadWarmup 3: 3 op, 2720062975.00 ns, 906.6877 ms/op
WorkloadWarmup 4: 3 op, 2719255895.00 ns, 906.4186 ms/op
WorkloadWarmup 5: 3 op, 2717183480.00 ns, 905.7278 ms/op
WorkloadWarmup 6: 3 op, 2788945759.00 ns, 929.6486 ms/op
WorkloadWarmup 7: 3 op, 2756700118.00 ns, 918.9000 ms/op
WorkloadWarmup 8: 3 op, 2723493732.00 ns, 907.8312 ms/op
WorkloadWarmup 9: 3 op, 2716591674.00 ns, 905.5306 ms/op
WorkloadWarmup 10: 3 op, 2786646262.00 ns, 928.8821 ms/op
// BeforeActualRun
WorkloadActual 1: 3 op, 2719270008.00 ns, 906.4233 ms/op
WorkloadActual 2: 3 op, 2727867484.00 ns, 909.2892 ms/op
WorkloadActual 3: 3 op, 2752234607.00 ns, 917.4115 ms/op
WorkloadActual 4: 3 op, 2746344962.00 ns, 915.4483 ms/op
WorkloadActual 5: 3 op, 2721176649.00 ns, 907.0589 ms/op
WorkloadActual 6: 3 op, 2716237772.00 ns, 905.4126 ms/op
WorkloadActual 7: 3 op, 2783251202.00 ns, 927.7504 ms/op
WorkloadActual 8: 3 op, 2757663609.00 ns, 919.2212 ms/op
WorkloadActual 9: 3 op, 2723043182.00 ns, 907.6811 ms/op
WorkloadActual 10: 3 op, 2716898888.00 ns, 905.6330 ms/op
WorkloadActual 11: 3 op, 2791437881.00 ns, 930.4793 ms/op
WorkloadActual 12: 3 op, 2718779663.00 ns, 906.2599 ms/op
WorkloadActual 13: 3 op, 2719404023.00 ns, 906.4680 ms/op
WorkloadActual 14: 3 op, 2727891316.00 ns, 909.2971 ms/op
WorkloadActual 15: 3 op, 2793246803.00 ns, 931.0823 ms/op
// AfterActualRun
WorkloadResult 1: 3 op, 2719269782.50 ns, 906.4233 ms/op
WorkloadResult 2: 3 op, 2727867258.50 ns, 909.2891 ms/op
WorkloadResult 3: 3 op, 2752234381.50 ns, 917.4115 ms/op
WorkloadResult 4: 3 op, 2746344736.50 ns, 915.4482 ms/op
WorkloadResult 5: 3 op, 2721176423.50 ns, 907.0588 ms/op
WorkloadResult 6: 3 op, 2716237546.50 ns, 905.4125 ms/op
WorkloadResult 7: 3 op, 2783250976.50 ns, 927.7503 ms/op
WorkloadResult 8: 3 op, 2757663383.50 ns, 919.2211 ms/op
WorkloadResult 9: 3 op, 2723042956.50 ns, 907.6810 ms/op
WorkloadResult 10: 3 op, 2716898662.50 ns, 905.6329 ms/op
WorkloadResult 11: 3 op, 2791437655.50 ns, 930.4792 ms/op
WorkloadResult 12: 3 op, 2718779437.50 ns, 906.2598 ms/op
WorkloadResult 13: 3 op, 2719403797.50 ns, 906.4679 ms/op
WorkloadResult 14: 3 op, 2727891090.50 ns, 909.2970 ms/op
WorkloadResult 15: 3 op, 2793246577.50 ns, 931.0822 ms/op From what I understand all the iterations (per launch/process) that occur before the I think there are enough iterations in there (I count 30 just for the The tests were originally adapted (lifted) from the dotnet/performance repo and contain some random data generation which IIRC is the main reason for the error/stddev columns. But more importantly, you can see that those columns are pretty much in line with respect to the different .NET versions. |
Thanks for the update. Yeah, those results look reasonable, which shoots my theory down. BTW, I read the first 3 parts of your blog series on array sort and enjoyed them immensely. Looking forward to the rest :) |
FYI, While preparing my next Array.Sort blog post I decided to run benchmarks against .NET Core 5.0...
I'm seeing quite a considerable perf drop (~10%) when comparing 3.1 / 5.0p2.
My guess is this is related to the Spanification PR although the original PR did not include enough benchmark points to (maybe) exhibit this regression back when it was introduced:
FWIW, These are the results for benchmarking int32 sorting with my benchmark harness:
The text was updated successfully, but these errors were encountered: