-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ARM64] Performance regression: Sorting arrays of primitive types #41741
Comments
What's the difference between the two tables? Is it just two different sets of runs? Never mind -- List vs Array. |
I'll investigate. |
I can repro this on an Rpi4:
|
Here's a bit more data, for the array case (from an extracted test, via perf, doing the same amount of work for each runtime). As you can see the cycles moved from native methods to jitted methods, so comparative analysis will be a bit trickier than normal.
Total cycle difference per perf is about 1.17b, or around 30%, this compares well to the BDN ratios above. Seems likely the top two methods in each case perform similar functions, so we'll compare them.
Both methods are slower in 5.0; the slowdown in Let's focus first on |
[edit: pasted in right version of C# Here's the codegen for the innermost loop of
We see the jitted version emits quite a few more instructions, and a number of things stand out. For reference, here's the C# source: private static void InsertionSort(Span<T> keys)
{
for (int i = 0; i < keys.Length - 1; i++)
{
T t = Unsafe.Add(ref MemoryMarshal.GetReference(keys), i + 1);
int j = i;
while (j >= 0 && (t == null || LessThan(ref t, ref Unsafe.Add(ref MemoryMarshal.GetReference(keys), j))))
{
Unsafe.Add(ref MemoryMarshal.GetReference(keys), j + 1) = Unsafe.Add(ref MemoryMarshal.GetReference(keys), j);
j--;
}
Unsafe.Add(ref MemoryMarshal.GetReference(keys), j + 1) = t;
}
} and here's the native source (simplified since static void InsertionSort(KIND keys[], KIND items[], int lo, int hi)
{
int i, j;
KIND t, ti = NULL;
for (i = lo; i < hi; i++)
{
j = i;
t = keys[i + 1];
// if(items != NULL)
// ti = items[i + 1];
while (j >= lo && t < keys[j])
{
keys[j + 1] = keys[j];
// if(items != NULL)
// items[j + 1] = items[j];
j--;
}
keys[j + 1] = t;
// if(items != NULL)
// items[j + 1] = ti;
}
} Analysis of jitted code
As a result the jitted code does 3 loads in the upper block, compared to 1 for native code, and 1 load in the lower block, compared to none for native code. So for the most part we see known issues and unfortunately none of this is actionable for 5.0. We are considering enhancing the loop optimizer for .NET 6.0, and this loop might make a good study candidate. cc @dotnet/jit-contrib Next I'll look at |
Here's the codegen for the inner loops of ;; 3.1 (via perf; Clang)
;; 0x1a0
2.20 40d118: cmp w11, w12
40d11c: ? b.ge 40d16c <ArrayHelpers<int>::PickPivotAndPartition(int*, int*, int, int)+0x1f4> // b.tcont
3.58 40d120: sbfiz x13, x11, #2, #32
40d124: sbfiz x14, x12, #2, #32
4.48 40d128: ldr w15, [x0, x14]
1.67 40d12c: ldr w16, [x0, x13]
0.62 40d130: str w15, [x0, x13]
1.76 40d134: str w16, [x0, x14]
;; 0x1c0
12.01 40d138: cmp w11, w9
40d13c: ? b.ge 40d150 <ArrayHelpers<int>::PickPivotAndPartition(int*, int*, int, int)+0x1d8> // b.tcont
3.77 40d140: add w11, w11, #0x1
7.81 40d144: ldr w13, [x0, w11, sxtw #2]
40d148: cmp w13, w10
40d14c: ? b.lt 40d138 <ArrayHelpers<int>::PickPivotAndPartition(int*, int*, int, int)+0x1c0> // b.tstop
;; 0x1d8
23.09 40d150: cmp w12, w2
40d154: ? b.le 40d118 <ArrayHelpers<int>::PickPivotAndPartition(int*, int*, int, int)+0x1a0>
6.31 40d158: sub w12, w12, #0x1
14.98 40d15c: ldr w13, [x0, w12, sxtw #2]
40d160: cmp w10, w13
40d164: ? b.lt 40d150 <ArrayHelpers<int>::PickPivotAndPartition(int*, int*, int, int)+0x1d8> // b.tstop
10.99 40d168: ? b 40d118 <ArrayHelpers<int>::PickPivotAndPartition(int*, int*, int, int)+0x1a0>
;; 5.0
G_M55484_IG09:
EB01005F cmp x2, x1
540000A2 bhs G_M55484_IG10
91001042 add x2, x2, #4
B9400045 ldr w5, [x2]
6B05009F cmp w4, w5
54FFFF6C bgt G_M55484_IG09
;; bbWeight=0.50 PerfScore 3.25
G_M55484_IG10:
EB00007F cmp x3, x0
540000A9 bls G_M55484_IG11
D1001063 sub x3, x3, #4
B9400065 ldr w5, [x3]
6B05009F cmp w4, w5
54FFFF6B blt G_M55484_IG10
;; bbWeight=4 PerfScore 26.00
G_M55484_IG11:
EB03005F cmp x2, x3
540000E2 bhs G_M55484_IG13
;; bbWeight=4 PerfScore 6.00
G_M55484_IG12:
B9400045 ldr w5, [x2]
B9400066 ldr w6, [x3]
B9000046 str w6, [x2]
B9000065 str w5, [x3]
EB03005F cmp x2, x3
54FFFDA3 blo G_M55484_IG09 Here the codegen is pretty comparable. It's not obvious why the jitted version appears to be slower. It could be an artifact of running perf over the entire process rather than isolating the part where the process is executing the Tier1 codegen. Repeating the perf measurements with |
So the likely suspect is codegen in I don't see anything we can do here for 5.0 in the jit. We might be able to rewrite the C# for |
Thinking about this a bit more, it seems odd that we can't promote the span argument to
This is a jit limitation. For the ARM64 ABI the struct is passed by value, and the first thing we end up doing is spilling it to the stack: IN002e: 000008 str x0, [fp,#16] // [V00 arg0]
IN002f: 00000C str x1, [fp,#24] // [V00 arg0+0x08] So this likely explains why we didn't see this regression on xArch; for those machines this structure is passed by reference and gets promoted, and so the code becomes more optimizable. A potential fix / workaround is to manually promote the span. Let me try this. |
This was something I was hoping to fix for 5.0 (#39326) but was unable to complete in time. |
@adamsitnik here is a link to an index page with links to every tests full history. |
Looking at history for the array case, the lone point in Nov is the 3.1 baseline; we regressed and then recovered most of it. Can't readily pin down what caused the recovery as the historical data uses SDK build versions. Current runs show fairly extreme bimodality. There may be some aspect of ARM64 micorarchitecture that we need to understand better. List case does not appear to have a 3.1 equivalent data point but has similar bimodality. |
Possible workaround: @@ -541,18 +543,20 @@ private static void DownHeap(Span<T> keys, int i, int n, int lo)
private static void InsertionSort(Span<T> keys)
{
- for (int i = 0; i < keys.Length - 1; i++)
+ // Workaround for 41741. Copy span to local to allow promotion on ARM64
+ Span<T> _keys = keys;
+ for (int i = 0; i < _keys.Length - 1; i++)
{
- T t = Unsafe.Add(ref MemoryMarshal.GetReference(keys), i + 1);
+ T t = Unsafe.Add(ref MemoryMarshal.GetReference(_keys), i + 1);
int j = i;
- while (j >= 0 && (t == null || LessThan(ref t, ref Unsafe.Add(ref MemoryMarshal.GetReference(keys), j))))
+ while (j >= 0 && (t == null || LessThan(ref t, ref Unsafe.Add(ref MemoryMarshal.GetReference(_keys), j))))
{
- Unsafe.Add(ref MemoryMarshal.GetReference(keys), j + 1) = Unsafe.Add(ref MemoryMarshal.GetReference(keys), j);
+ Unsafe.Add(ref MemoryMarshal.GetReference(_keys), j + 1) = Unsafe.Add(ref MemoryMarshal.GetReference(_keys), j);
j--;
}
- Unsafe.Add(ref MemoryMarshal.GetReference(keys), j + 1) = t!;
+ Unsafe.Add(ref MemoryMarshal.GetReference(_keys), j + 1) = t!;
}
} Inner loop codegen for ARM64 becomes: G_M6103_IG04:
11000446 add w6, w2, #1
93407CC6 sxtw x6, w6
D37EF4C6 lsl x6, x6, #2
8B0100C6 add x6, x6, x1
B90000C5 str w5, [x6]
51000442 sub w2, w2, #1
;; bbWeight=8 PerfScore 32.00
G_M6103_IG05:
7100005F cmp w2, #0
540000CB blt G_M6103_IG07
;; bbWeight=32 PerfScore 48.00
G_M6103_IG06:
93407C45 sxtw x5, w2
D37EF4A5 lsl x5, x5, #2
B8656825 ldr w5, [x1, x5]
6B0400BF cmp w5, w4
54FFFE8C bgt G_M6103_IG04 Which modulo IV widening and address mode formation is not too far off from native. On a local standalone test perf is now:
Though the RC1 runs are not very stable and drop down to 3400 sometimes, so there are still things here not well understood. |
BDN reports the workaround above gets back about half the perf; dovetails pretty well with my standalone data. Next steps:
|
so you would think differential profiling should be able to pin down the hotspots. Branch misses are also up but don't seem to impact IPC. Not sure what to make of this. |
Have been looking at method and loop alignment impact. Here's a chart showing what happens as we vary the mod32 alignment of the top of Tier1 The slight upslope in instructions (grey line) is the impact of the extra nop padding (here placed just after the prolog); as we pad more we execute more instructions overall. But the total instruction count is pretty well constant. The cycles vary more widely and the fluctuation in cycles is moderately well correlated with branch misses; impact of bad alignment varies. Note this isn't a completely pure experiment as modifying the alignment of one method may well impact alignment of others. But we can see that fluctuations here are in the 3-8% range. |
For the remainder of perf loss I suspect time may be going into one of the unmapped stubs? Per dotnet/coreclr#26897 I am going to build perf so I look at annotations on jitted code. |
Haven't had any luck yet with locally built version of perf. |
Marking as blocking release, at least until I can get a clearer picture of where the perf is going in 5.0. Currently having trouble getting accurate profiles. Using |
Took another look now that ;; 3.1 "insertion sort" inner loop (native)
94328597 ¦40cc90: ldr w13, [x19,w12,sxtw #2]
49260763 ¦40cc94: cmp w11, w13
57762715 ¦40cc98: ? b.ge 40ccb0
77756830 ¦40cc9c: add w14, w12, #0x1
47159165 ¦40cca0: sub w12, w12, #0x1
44848974 ¦40cca4: cmp w12, w21
25660062 ¦40cca8: str w13, [x19,w14,sxtw #2]
17110607 ¦40ccac: ? b.ge 40cc90
442575592 ¦40ccb0: add w12, w12, #0x1 // branch latency skid?
;; 5.0 "insertion sort" inner loop (jitted)
497165872 80: add x3, x29, #0x10
32905197 84: ldr x3, [x3]
88: add w5, w0, #0x1
23609205 8c: sxtw x5, w5
67146280 90: lsl x5, x5, #2
67408728 94: add x3, x5, x3
98: add x5, x29, #0x10
30088023 9c: ldr x5, [x5]
133792275 a0: ldr w4, [x5,x4]
a4: str w4, [x3]
67686545 a8: sub w0, w0, #0x1
51829740 ac: cmp w0, #0x0
63269723 b0: ? b.lt d0
25938857 b4: add x3, x29, #0x10
116122893 b8: ldr x3, [x3]
58047416 bc: sxtw x4, w0
4154869 c0: lsl x4, x4, #2
35269291 c4: ldr w3, [x3,x4]
162843745 c8: cmp w3, w1
11409506 cc: ? b.gt 80
1566344914 d0: add x3, x29, #0x10 // branch latency skid? Direct comparison is a bit tricky as the loops are rotated differently, basically each is two parts and the top of the native version matches the bottom of the jitted version, and vice-versa. Short of the source hack above to copy the span, there is not much we can do for 5.0. Going to move this to 6.0. |
Does this happen mainly on Linux? |
Linux arm64. Probably Windows arm64 too, but we don't track that. |
It looks like we regained the missing perf with #43870. 6.0 data now shows about 16 us which is roughly where we were in the 3.1 timeframe. The List case shows similar improvements Going to close this one as fixed. |
Those graphs also appear to show two distinct improvements in reducing noise - presumably the loop alignment work? If so, that's nice to see. |
This is ARM64 -- I don't think we do anything special for loops on this platform yet. Not really sure why the noise level dropped here. |
I see. @adamsitnik could it be BDN or test changes to randomize data alignment perhaps? |
We have the "randomize data alignment" ability but haven't enabled it for any benchmark yet. Also as Andy pointed out, loop alignment techniques are only done for xarch. |
Interesting, unexplained then. Thanks. |
Once I have Arm64 historical data, I will try to dig and narrow down changes. |
After running benchmarks for 3.1 vs 5.0 using "Ubuntu arm64 Qualcomm Machines" owned by the JIT Team, I've found few regressions related to sorting.
It can be related to the fact that we have moved the sorting implementation from native to managed code (cc @stephentoub).
@DrewScoggins is there any way to see the full historical data for ARM64?
Repro
System.Collections.Sort<Int32>.Array(Size: 512)
System.Collections.Sort<Int32>.List(Size: 512)
/cc @JulieLeeMSFT
category:cq
theme:needs-triage
skill-level:expert
cost:large
The text was updated successfully, but these errors were encountered: