Improve Array.Sort(T[]) performance #35297

stephentoub · 2020-04-22T17:38:53Z

#35175 was opened about a 10% regression in sorting throughput (specifically looking just at Int32[]) from .NET Core 3.1 to .NET 5.

On my machine:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.207 (2004/?/20H1)
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-preview.4.20211.5

our dotnet/performance tests don't show a regression anywhere near that, e.g.

Type	Method	Toolchain	Size	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD
Sort<BigStruct>	Array_Comparison	\master\corerun.exe	128	4.028 us	0.1816 us	0.1783 us	4.014 us	3.805 us	4.529 us	0.93	0.07
Sort<BigStruct>	Array_Comparison	\netcore31\corerun.exe	128	4.333 us	0.2711 us	0.2663 us	4.267 us	4.089 us	4.945 us	1.00	0.00

Sort<Int32>	Array_Comparison	\master\corerun.exe	128	3.277 us	0.7013 us	0.8076 us	2.822 us	2.649 us	4.951 us	0.99	0.11
Sort<Int32>	Array_Comparison	\netcore31\corerun.exe	128	3.327 us	0.6240 us	0.7186 us	2.848 us	2.668 us	4.293 us	1.00	0.00

Sort<IntClass>	Array_Comparison	\master\corerun.exe	128	6.481 us	0.1237 us	0.1033 us	6.454 us	6.336 us	6.645 us	1.05	0.02
Sort<IntClass>	Array_Comparison	\netcore31\corerun.exe	128	6.167 us	0.0816 us	0.0637 us	6.156 us	6.078 us	6.272 us	1.00	0.00

Sort<IntStruct>	Array_Comparison	\master\corerun.exe	128	3.967 us	1.4690 us	1.6917 us	3.002 us	2.671 us	6.633 us	0.98	0.09
Sort<IntStruct>	Array_Comparison	\netcore31\corerun.exe	128	4.039 us	1.3931 us	1.6043 us	3.026 us	2.692 us	6.397 us	1.00	0.00

Sort<String>	Array_Comparison	\master\corerun.exe	128	41.962 us	0.4130 us	0.3449 us	41.896 us	41.473 us	42.598 us	0.78	0.01
Sort<String>	Array_Comparison	\netcore31\corerun.exe	128	53.607 us	0.2796 us	0.2335 us	53.682 us	53.180 us	53.989 us	1.00	0.00

Sort<BigStruct>	Array_Comparison	\master\corerun.exe	256	10.402 us	0.2142 us	0.2200 us	10.431 us	10.121 us	10.930 us	0.91	0.04
Sort<BigStruct>	Array_Comparison	\netcore31\corerun.exe	256	11.442 us	0.2745 us	0.2819 us	11.352 us	11.102 us	11.946 us	1.00	0.00

Sort<Int32>	Array_Comparison	\master\corerun.exe	256	7.440 us	0.0629 us	0.0525 us	7.430 us	7.367 us	7.532 us	1.00	0.01
Sort<Int32>	Array_Comparison	\netcore31\corerun.exe	256	7.449 us	0.0787 us	0.0658 us	7.448 us	7.349 us	7.573 us	1.00	0.00

Sort<IntClass>	Array_Comparison	\master\corerun.exe	256	16.090 us	0.0999 us	0.0835 us	16.108 us	15.961 us	16.187 us	1.04	0.01
Sort<IntClass>	Array_Comparison	\netcore31\corerun.exe	256	15.492 us	0.1317 us	0.1100 us	15.461 us	15.367 us	15.746 us	1.00	0.00

Sort<IntStruct>	Array_Comparison	\master\corerun.exe	256	7.741 us	0.0896 us	0.0699 us	7.722 us	7.659 us	7.892 us	1.03	0.01
Sort<IntStruct>	Array_Comparison	\netcore31\corerun.exe	256	7.528 us	0.1079 us	0.0843 us	7.524 us	7.405 us	7.679 us	1.00	0.00

Sort<String>	Array_Comparison	\master\corerun.exe	256	95.722 us	0.7744 us	0.6467 us	95.536 us	94.978 us	96.726 us	0.79	0.01
Sort<String>	Array_Comparison	\netcore31\corerun.exe	256	121.650 us	0.5643 us	0.5002 us	121.748 us	120.844 us	122.655 us	1.00	0.00

Sort<BigStruct>	Array_Comparison	\master\corerun.exe	1024	69.300 us	0.7271 us	0.6802 us	69.153 us	68.452 us	70.507 us	0.97	0.01
Sort<BigStruct>	Array_Comparison	\netcore31\corerun.exe	1024	71.688 us	0.3637 us	0.3224 us	71.633 us	71.268 us	72.251 us	1.00	0.00

Sort<Int32>	Array_Comparison	\master\corerun.exe	1024	48.195 us	0.3197 us	0.2991 us	48.131 us	47.767 us	48.726 us	1.00	0.01
Sort<Int32>	Array_Comparison	\netcore31\corerun.exe	1024	48.023 us	0.4813 us	0.4266 us	47.822 us	47.539 us	48.785 us	1.00	0.00

Sort<IntClass>	Array_Comparison	\master\corerun.exe	1024	86.818 us	0.4345 us	0.4064 us	86.950 us	86.163 us	87.711 us	1.03	0.01
Sort<IntClass>	Array_Comparison	\netcore31\corerun.exe	1024	84.631 us	0.3439 us	0.3048 us	84.565 us	84.260 us	85.095 us	1.00	0.00

Sort<IntStruct>	Array_Comparison	\master\corerun.exe	1024	49.627 us	0.8809 us	0.7356 us	49.472 us	48.783 us	51.387 us	0.96	0.02
Sort<IntStruct>	Array_Comparison	\netcore31\corerun.exe	1024	51.501 us	0.6825 us	0.6384 us	51.372 us	50.403 us	52.402 us	1.00	0.00

Sort<String>	Array_Comparison	\master\corerun.exe	1024	551.786 us	4.9619 us	4.3986 us	550.369 us	546.377 us	561.409 us	0.82	0.01
Sort<String>	Array_Comparison	\netcore31\corerun.exe	1024	670.227 us	5.4861 us	4.5811 us	669.726 us	662.498 us	680.227 us	1.00	0.00

However, the benchmarks shared in that PR do show a regression in some cases for Int32 on my machine, albeit not quite what was cited:

Method	Job	Toolchain	N	Mean[us]	Error[us]	StdDev[us]	Time / N[us]	Ratio	RatioSD	SpeedupMedian	Code Size[B]
ArraySort	Job-RWQZQA	Default	100	2.123 us	0.0490 us	0.0733 us	21.2333 ns	1.13	0.07	0.89	336 B
ArraySort	Job-WOYWHX	\master\corerun.exe	100	2.050 us	0.0421 us	0.0887 us	20.5000 ns	1.10	0.07	0.91	191 B

ArraySort	Job-RWQZQA	Default	1000	30.437 us	0.5058 us	0.7091 us	30.4370 ns	1.04	0.02	0.96	336 B
ArraySort	Job-WOYWHX	\master\corerun.exe	1000	31.331 us	0.2103 us	0.1642 us	31.3306 ns	1.08	0.01	0.93	191 B

ArraySort	Job-RWQZQA	Default	10000	404.177 us	0.6468 us	0.9277 us	40.4177 ns	1.02	0.00	0.98	336 B
ArraySort	Job-WOYWHX	\master\corerun.exe	10000	425.490 us	3.3336 us	2.9551 us	42.5490 ns	1.08	0.01	0.93	191 B

ArraySort	Job-RWQZQA	Default	100000	5,079.667 us	4.6692 us	6.6964 us	50.7967 ns	1.02	0.01	0.98	336 B
ArraySort	Job-WOYWHX	\master\corerun.exe	100000	5,338.014 us	20.5143 us	16.0162 us	53.3801 ns	1.07	0.01	0.93	191 B

ArraySort	Job-RWQZQA	Default	1000000	60,104.120 us	32.8821 us	47.1586 us	60.1041 ns	1.01	0.00	0.99	306 B
ArraySort	Job-WOYWHX	\master\corerun.exe	1000000	63,220.262 us	93.8403 us	87.7783 us	63.2203 ns	1.06	0.00	0.94	97 B

ArraySort	Job-RWQZQA	Default	10000000	699,312.886 us	755.2464 us	1,107.0303 us	69.9313 ns	1.01	0.00	0.99	306 B
ArraySort	Job-WOYWHX	\master\corerun.exe	10000000	736,870.178 us	1,700.3041 us	1,590.4655 us	73.6870 ns	1.07	0.00	0.94	150 B

Even so, a simple console app does sufficiently demonstrate a meaningful throughput regression:

using System;
using System.Diagnostics;
using System.Linq;

class Program
{
    static void Main()
    {
        const int Size = 1_000;
        int[][] arrays = Enumerable.Range(0, 100_000).Select(_ => new int[Size]).ToArray();
        var sw = new Stopwatch();

        var r = new Random(42);
        var unsorted = new int[Size];
        for (int i = 0; i < unsorted.Length; i++) unsorted[i] = r.Next();

        while (true)
        {
            foreach (int[] array in arrays) Array.Copy(unsorted, array, unsorted.Length);

            sw.Restart();
            foreach (int[] array in arrays) Array.Sort(array);
            sw.Stop();

            Console.WriteLine(sw.Elapsed.TotalSeconds);
        }
    }
}

On .NET Core 3.1 I get results like:

and with master I get results like:

which is closer to a 20% regression.

Since even though our actual benchmarks aren’t showing anything close to that (and in some cases .NET 5 being meaningfully faster, especially with strings), this PR addresses the gap. It includes a variety of tweaks to improve Array.Sort<T>(T[]) performance; the two most impactful are using Unsafe.* in PickPivotAndPartition to avoid bounds checks and aggressive inlining on SwapIfGreater. A few other small improvements to codegen round it out. I only made the unsafe changes in the Sort<T>(T[]) implementation, and not in the more complicated implementations, such as for Sort<T>(T[], Comparer<T>) and Sort<TKey, TValue>(TKey[], TValue[]), but I did make some of the smaller changes for consistency across the file.

Fixes #35175
@jkotas, any visceral reaction to the Unsafe.* code here? 😄
cc: @damageboy, @adamsitnik

In terms of impact, here are my results from my running the benchmarks shared in #35175:

Method	Job	Toolchain	N	Mean[us]	Error[us]	StdDev[us]	Time / N[us]	Ratio	RatioSD	SpeedupMedian	Code Size[B]
ArraySort	Job-RWQZQA	Default	100	2.123 us	0.0490 us	0.0733 us	21.2333 ns	1.13	0.07	0.89	336 B
ArraySort	Job-WOYWHX	\master\corerun.exe	100	2.050 us	0.0421 us	0.0887 us	20.5000 ns	1.10	0.07	0.91	191 B
ArraySort	Job-KOZKWL	\pr\corerun.exe	100	1.868 us	0.0386 us	0.0933 us	18.6812 ns	1.00	0.00	1.00	191 B

ArraySort	Job-RWQZQA	Default	1000	30.437 us	0.5058 us	0.7091 us	30.4370 ns	1.04	0.02	0.96	336 B
ArraySort	Job-WOYWHX	\master\corerun.exe	1000	31.331 us	0.2103 us	0.1642 us	31.3306 ns	1.08	0.01	0.93	191 B
ArraySort	Job-KOZKWL	\pr\corerun.exe	1000	28.990 us	0.1958 us	0.1635 us	28.9897 ns	1.00	0.00	1.00	191 B

ArraySort	Job-RWQZQA	Default	10000	404.177 us	0.6468 us	0.9277 us	40.4177 ns	1.02	0.00	0.98	336 B
ArraySort	Job-WOYWHX	\master\corerun.exe	10000	425.490 us	3.3336 us	2.9551 us	42.5490 ns	1.08	0.01	0.93	191 B
ArraySort	Job-KOZKWL	\pr\corerun.exe	10000	394.917 us	2.1902 us	1.9415 us	39.4917 ns	1.00	0.00	1.00	191 B

ArraySort	Job-RWQZQA	Default	100000	5,079.667 us	4.6692 us	6.6964 us	50.7967 ns	1.02	0.01	0.98	336 B
ArraySort	Job-WOYWHX	\master\corerun.exe	100000	5,338.014 us	20.5143 us	16.0162 us	53.3801 ns	1.07	0.01	0.93	191 B
ArraySort	Job-KOZKWL	\pr\corerun.exe	100000	4,975.340 us	33.2037 us	31.0588 us	49.7534 ns	1.00	0.00	1.00	191 B

ArraySort	Job-RWQZQA	Default	1000000	60,104.120 us	32.8821 us	47.1586 us	60.1041 ns	1.01	0.00	0.99	306 B
ArraySort	Job-WOYWHX	\master\corerun.exe	1000000	63,220.262 us	93.8403 us	87.7783 us	63.2203 ns	1.06	0.00	0.94	97 B
ArraySort	Job-KOZKWL	\pr\corerun.exe	1000000	59,504.806 us	102.7170 us	80.1947 us	59.5048 ns	1.00	0.00	1.00	97 B

ArraySort	Job-RWQZQA	Default	10000000	699,312.886 us	755.2464 us	1,107.0303 us	69.9313 ns	1.01	0.00	0.99	306 B
ArraySort	Job-WOYWHX	\master\corerun.exe	10000000	736,870.178 us	1,700.3041 us	1,590.4655 us	73.6870 ns	1.07	0.00	0.94	150 B
ArraySort	Job-KOZKWL	\pr\corerun.exe	10000000	691,165.742 us	442.2233 us	413.6559 us	69.1166 ns	1.00	0.00	1.00	150 B

For the simple command-line app, we now get results like:

I also tweaked the above to remove the app that copies the unsorted data over each array, such that we're then sorting an already sorted array each time. With that, on .NET Core 3.1 I get:

and on master I get:

and with this PR I get:

Finally, I checked GC pause times similar to what @jkotas did in dotnet/coreclr#27642 (comment). With .NET Core 3.1, we get average GC pause times around 500ms, and with .NET 5 (both master and this PR), we get average GC pause times around 15ms. This highlights one of the benefits of moving the sorting into managed code, separate from all the other benefits.

src/libraries/System.Private.CoreLib/src/System/Collections/Generic/ArraySortHelper.cs

A variety of tweaks to improve `Array.Sort<T>(T[])` performance and address a regression left over from moving the array sorting implementation from native to managed. The two most impactful are using `Unsafe.*` in `PickPivotAndPartition` to avoid bounds checks and aggressive inlining on `SwapIfGreater`. A few other small improvements to codegen round it out. I only made the unsafe changes in the `Sort<T>(T[])` implementation, and not in the more complicated implementations, such as for `Sort<T>(T[], Comparer<T>)` and `Sort<TKey, TValue>(TKey[], TValue[])`, but I did make some of the smaller changes for consistency across the file.

damageboy · 2020-04-24T15:57:14Z

FWIW, I've double checked my side and it is exactly 10% (before these changes), on both my Kaby Lake and AMD Ryzen machine. I will mention that my Intel machine is clean of the Intel JCC microcode update, so that may be a factor in this, depending's on @stephentoub's machine configuration.

stephentoub · 2020-04-24T15:58:16Z

Thanks. Are you able to test with this PR?

damageboy · 2020-04-24T16:52:02Z

Issue seems resolved.
Comapred 3.1.201, master without PR, PR branch.
perf now seems better than 3.1.201 in the higher sort problem sizes:

Method	Toolchain	N	Mean [us]	Error [us]	StdDev [us]	Time / N [ns]
ArraySort	`3.1.201`	100	0.9818	0.0152	0.0213	9.8177
ArraySort	`master/08285b1`	100	1.1140	0.0230	0.0384	11.1398
ArraySort	`pr/35297/2eef7dd`	100	1.0903	0.0215	0.0287	10.9032
ArraySort	`3.1.201`	1000	16.9130	1.0379	1.5213	16.9130
ArraySort	`master/08285b1`	1000	21.9628	0.4283	0.6143	21.9628
ArraySort	`pr/35297/2eef7dd`	1000	17.2827	0.3448	0.5665	17.2827
ArraySort	`3.1.201`	10000	440.7663	1.1282	1.5816	44.0766
ArraySort	`master/08285b1`	10000	507.5048	1.4836	1.2388	50.7505
ArraySort	`pr/35297/2eef7dd`	10000	457.2838	9.0489	8.0216	45.7284
ArraySort	`3.1.201`	100000	5,734.5160	5.6595	8.2957	57.3452
ArraySort	`master/08285b1`	100000	6,474.4154	57.1762	53.4827	64.7442
ArraySort	`pr/35297/2eef7dd`	100000	5,648.4342	64.5477	60.3780	56.4843
ArraySort	`3.1.201`	1000000	67,353.2352	52.7210	72.1651	67.3532
ArraySort	`master/08285b1`	1000000	77,408.7551	580.1940	542.7138	77.4088
ArraySort	`pr/35297/2eef7dd`	1000000	65,691.5563	173.6212	162.4054	65.6916
ArraySort	`3.1.201`	10000000	775,940.1604	693.8563	949.7583	77.5940
ArraySort	`master/08285b1`	10000000	898,762.7196	2,351.3387	2,199.4437	89.8763
ArraySort	`pr/35297/2eef7dd`	10000000	749,980.9434	2,472.2225	2,064.4178	74.9981

stephentoub · 2020-04-24T16:58:38Z

Great. Thanks for confirming!

nietras · 2020-04-27T14:38:24Z

src/libraries/System.Private.CoreLib/src/System/Collections/Generic/ArraySortHelper.cs

-                    while (pivot.CompareTo(keys[++left]) > 0) ;
-                    while (pivot.CompareTo(keys[--right]) < 0) ;
+                    while (Unsafe.IsAddressLessThan(ref leftRef, ref nextToLastRef) && pivot.CompareTo(leftRef = ref Unsafe.Add(ref leftRef, 1)) > 0) ;
+                    while (Unsafe.IsAddressGreaterThan(ref rightRef, ref zeroRef) && pivot.CompareTo(rightRef = ref Unsafe.Add(ref rightRef, -1)) < 0) ;


@stephentoub this will no longer result in an index-out-of-range exception thrown when bogus comparable, insteead, it silently swallows that case now.

Yes. Incorrect comparables wouldn't have always done so, only for some forms of incorrect.

Not sure I understand, does that mean you think this is not a problem?

Yes. You disagree?

I would have thought that it would be a breaking change, you will no longer get an exception. Behavior is different from earlier versions. Is it not?

I have no hard feelings about this, clearly my bar for acceptable changes, perf regressions etc. was set too high in some regards 😅 When I worked on this I was aiming for 100% fidelity with existing output.

My thinking when I chose to make this change was that the comparer is busted, and there are a multitude of possible behaviors that could result from a busted comparer... it could throw arbitrary exceptions, it could sporadically yield an incorrect comparison in a way no one would know, it could crash the process, etc. We could even be changing behavior in that regard just by changing how many times we invoked the comparer, or by changing the order in which we invoked it on the input data, or any number of other things. So I'm not concerned with taking a failure we may have only sometimes been able to detect (and for which a goal was never true detection) and making it into something which sometimes fails differently, and does so by not throwing instead of throwing.

@jkotas, any concerns?

I do not have concerns about the subtle behavior change with broken comparers. My primary concerns around the unsafe code are buffer overruns. It is easy to see that the bounds are checked in this case.

aim for 100% fidelity with existing output.

That was a good place to start and the initial PR that got merged maintained this fidelity.

Alright, good to know. Perf/code size will be better without the check 👍 I have tests that will fail due to this change, since I test against built-in Sort, but that is my concern then.

nietras · 2020-04-27T14:39:42Z

@stephentoub @jkotas guess this means dotnet/corefx#26859 could have been merged anyhow :)

stephentoub · 2020-04-27T14:48:42Z

guess this means dotnet/corefx#26859 could have been merged anyhow :)

There's a ton more unsafe code in that PR than in this one. The "unsafety" in this PR is scoped to just two functions, with extra scrutiny as called out in dotnet/corefx#26859 (comment).

nietras · 2020-04-27T15:50:49Z

a ton more unsafe code in that PR than in this one.

True, but mainly due to code having to be repeated so many times, code is the same across versions, if one is safe, is the other not? :) Guess source generators might "solve" this.

Dotnet-GitSync-Bot added the area-System.Runtime label Apr 22, 2020

stephentoub changed the title ~~Improve Array,Sort(array) performance~~ Improve Array.Sort(T[]) performance Apr 22, 2020

stephentoub force-pushed the arraysortperf branch from c9c9ccd to 5482eac Compare April 22, 2020 17:41

tannergooding reviewed Apr 22, 2020

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Collections/Generic/ArraySortHelper.cs Show resolved Hide resolved

tannergooding reviewed Apr 22, 2020

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Collections/Generic/ArraySortHelper.cs Show resolved Hide resolved

Gnbrkm41 reviewed Apr 22, 2020

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Collections/Generic/ArraySortHelper.cs Show resolved Hide resolved

stephentoub force-pushed the arraysortperf branch from 5482eac to dc739b0 Compare April 22, 2020 18:20

jkotas reviewed Apr 22, 2020

View reviewed changes

adamsitnik added the tenet-performance Performance related issue label Apr 23, 2020

stephentoub force-pushed the arraysortperf branch from dc739b0 to 78c521b Compare April 23, 2020 15:11

jkotas reviewed Apr 23, 2020

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Collections/Generic/ArraySortHelper.cs Show resolved Hide resolved

jkotas approved these changes Apr 23, 2020

View reviewed changes

stephentoub force-pushed the arraysortperf branch from 78c521b to 13c854c Compare April 23, 2020 21:26

stephentoub added 2 commits April 23, 2020 21:13

Address PR feedback, and more tweaks

2eef7dd

stephentoub force-pushed the arraysortperf branch from 13c854c to 2eef7dd Compare April 24, 2020 01:14

jaredpar mentioned this pull request Apr 24, 2020

OSX machines are de-provisioned during CI / PR runs leading to failures #34472

Closed

stephentoub merged commit f73ceee into dotnet:master Apr 24, 2020

stephentoub deleted the arraysortperf branch April 24, 2020 18:12

nietras reviewed Apr 27, 2020

View reviewed changes

richlander mentioned this pull request Jun 6, 2020

Improving P95+ latency #37534

Closed

AndyAyersMS mentioned this pull request Sep 3, 2020

[ARM64] Performance regression: Sorting arrays of primitive types #41741

Closed

ghost locked as resolved and limited conversation to collaborators Dec 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Array.Sort(T[]) performance #35297

Improve Array.Sort(T[]) performance #35297

stephentoub commented Apr 22, 2020 •

edited

Loading

damageboy commented Apr 24, 2020

stephentoub commented Apr 24, 2020

damageboy commented Apr 24, 2020

stephentoub commented Apr 24, 2020

nietras Apr 27, 2020

stephentoub Apr 27, 2020

nietras Apr 27, 2020

stephentoub Apr 27, 2020

nietras Apr 27, 2020

stephentoub Apr 27, 2020

jkotas Apr 27, 2020

nietras Apr 27, 2020

nietras commented Apr 27, 2020

stephentoub commented Apr 27, 2020

nietras commented Apr 27, 2020

Improve Array.Sort(T[]) performance #35297

Improve Array.Sort(T[]) performance #35297

Conversation

stephentoub commented Apr 22, 2020 • edited Loading

damageboy commented Apr 24, 2020

stephentoub commented Apr 24, 2020

damageboy commented Apr 24, 2020

stephentoub commented Apr 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nietras commented Apr 27, 2020

stephentoub commented Apr 27, 2020

nietras commented Apr 27, 2020

stephentoub commented Apr 22, 2020 •

edited

Loading