-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jit can generate pointless movs #10315
Comments
Those aren't exactly pointless, they're zero extending casts from int to long. But yes, they're not necessary due to the fact that the loads that produced the values already zeroed out the upper 32 bits. I have a PR in progress that's supposed to deal with some similar cases but I'm not sure it will have any effect here, the |
Aside from the codegen-issue: current CPUs (at least the Intel ones) will "optimize" these |
AFAIK Intel CPUs optimize register to register moves only if the source and destination are different. So unfortunately in this case these moves may end up wasting a cycle. |
dotnet/coreclr#12676 should take care of the moves if the casts are done early, something like: ulong ushort0 = Unsafe.As<byte, ushort>(ref current); Otherwise it gets complicated and at the moment I don't know what needs to be done to fix this. |
When I read Intel® 64 and IA-32 Architectures Optimization Reference Manual -- Section 3.5.1.12 Zero-Latency MOV Instructions correct, so here the Anyway it would be better to avoid these |
Yep, in rather typical Intel optimization manual fashion they have left out some details. Agner Fog's optimization manual tells a bit more - including that same register moves are not eliminated. And that Of course, Agner Fog could be wrong but then he put a ton of effort into gather and writing down all this information that there's a good chance that he is right. |
Since the JIT has a habit to generate useless moves for various reasons it would be useful to get some numbers. Measuring some simple assembly code: mov ecx, 1 << 31
L1:
dec ecx
; case 1
; mov ebx, ecx
; mov ecx, ebx
; case 2
; mov ecx, ecx
test ecx, ecx
jnz L1 On my Haswell I get
So extra moves aren't without consequences and a single |
MOV r32, r32 is not exactly no-op in 64-bit mode - it zeroes upper 32 bits of registers |
Yes, as already mentioned, they're used to implement zero extending casts. However, in this case they're useless because the upper bits are already 0. In fact, they're probably useless most of the time because all (well, at least all "normal" ones, there may be some exceptions) 32 bit instructions zero out the upper 32 bits. |
What causes the generation of these movs? I'm getting quite a few in my hot loop here https://github.com/Zhentar/xxHash3.NET/blob/39961ae0e5f617216efc64797549de75a2b37dfa/xxHash3/xxHash3.cs#L292 and trying to figure out a workaround. p.s. I am very disappointed this issue already exists and I don't get to open one titled "Jit likes to mov it, mov it" |
😄 |
@Zhentar Can you post the generated assembly code? I'm pretty sure the moves you're seeing are zero extending casts too but I can't see how it all fits together and tell if there's any simple solution for this. And please do yourself and everyone else a favour and do not use As for the JIT, he definitely liked to mov it, mov it but a recent change calmed it down a bit. However, it's still a greedy bastard and wants a paycheck with many zeroes so it keeps zeroing in the hope it will get it. |
Sure, here: acc.B = AccumulateOnePair(acc.B, data.B, theKeys.B);
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
mov rax,qword ptr [rcx+8]
mov r9,qword ptr [rdx+8]
mov qword ptr [rsp+68h],r9
mov r9,qword ptr [r8+8]
mov qword ptr [rsp+60h],r9
mov r9d,dword ptr [rsp+68h]
mov r10d,dword ptr [rsp+6Ch]
mov r11d,r9d
add r11d,dword ptr [rsp+60h]
mov r11d,r11d
mov esi,r10d
add esi,dword ptr [rsp+64h]
mov esi,esi
imul r11,rsi
add rax,r11
mov r9d,r9d
add rax,r9
mov r9d,r10d
shl r9,20h
add rax,r9 |
So, I'm more worried about stack spills at However, it appears that you already changed accumulate to [MethodImpl(MethodImplOptions.AggressiveInlining)]
static ulong AccumulateOnePair(uint valueLeft, uint valueRight, uint keyLeft, uint keyRight)
{
return valueLeft + ((ulong)valueRight << 32) + Multiply32to64(valueLeft + keyLeft, valueRight + keyRight);
} and got rid of the structs. I'd go one step further and "hoist" the casts: [MethodImpl(MethodImplOptions.AggressiveInlining)]
static ulong AccumulateOnePair(ulong valueLeft, ulong valueRight, ulong keyLeft, ulong keyRight)
{
return valueLeft + (valueRight << 32) + (ulong)(uint)(valueLeft + keyLeft) * (ulong)(uint)(valueRight + keyRight);
} With this I'm seeing assembly code like 00007FFA1EDD37C4 45 8B 52 04 mov r10d,dword ptr [r10+4]
00007FFA1EDD37C8 4D 8D 98 80 00 00 00 lea r11,[r8+80h]
00007FFA1EDD37CF 45 8B 5B 10 mov r11d,dword ptr [r11+10h]
00007FFA1EDD37D3 49 8D B0 80 00 00 00 lea rsi,[r8+80h]
00007FFA1EDD37DA 8B 76 14 mov esi,dword ptr [rsi+14h]
00007FFA1EDD37DD 45 03 D9 add r11d,r9d
00007FFA1EDD37E0 41 03 F2 add esi,r10d
00007FFA1EDD37E3 4C 0F AF DE imul r11,rsi
00007FFA1EDD37E7 4C 03 C8 add r9,rax
00007FFA1EDD37EA 49 C1 E2 20 shl r10,20h which should be better. Structs are still causing problems, a bunch of pointless |
Woah! That slashed execution time for me by 20%! |
I'm getting a bunch of pointless moves when using ; Benchmarks.Program.UnsafeCopyTest()
sub rsp,38
xor eax,eax
mov rdx,[rcx+8]
cmp dword ptr [rdx+8],0
jle short M00_L01
M00_L00:
mov rdx,[rcx+8]
mov r8,rdx
mov dword ptr [rsp+30],0F3F7A0C0 // loading constant here
mov r9d,[rsp+30] // ... moving
mov [rsp+28],r9d // ... stuff
mov r9d,[rsp+28] // ... around
cmp eax,[r8+8]
jae short M00_L02
movsxd r10,eax
mov [r8+r10*4+10],r9d // ideally writing a constant here
inc eax
cmp [rdx+8],eax
jg short M00_L00
M00_L01:
add rsp,38
ret
M00_L02:
call CORINFO_HELP_RNGCHKFAIL
int 3
; Total bytes of code 78 source and further infoProduced by nightly build: Motivation is moving away from Generally it seems to work really well. Code generated for Note that it is possible to work around the problem by casting the LHS of the assignment using System;
using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
namespace Benchmarks
{
[DisassemblyDiagnoser(printSource: true)]
public class Program
{
private Color[] ColorMemory;
public Program()
{
ColorMemory = new Color[1 << 20];
}
[Benchmark]
public void UnsafeCopyTest()
{
for (int i = 0; i < ColorMemory.Length; i++)
ColorMemory[i] = ColorARGB.SomeColor;
}
public static void Main()
{
BenchmarkRunner.Run<Program>();
}
}
public struct Color
{
public int Value;
}
public struct ColorARGB
{
public static explicit operator ColorARGB(uint color) => Unsafe.As<uint, ColorARGB>(ref color);
public static implicit operator Color(ColorARGB color) => Unsafe.As<ColorARGB, Color>(ref color);
public static ColorARGB SomeColor => (ColorARGB)0xF3F7A0C0;
public byte B;
public byte G;
public byte R;
public byte A;
}
} |
Most likely the root cause is residual "address exposure" that is blocking promotion of some of the intermediate temp structs. Will update once I've had a chance to confirm/refute. |
@AndyAyersMS got a chance to look at it? Otherwise your guess sounds like it has a separate root cause and probably should get its own issue? |
@weltkante Thanks for the reminder. I haven't looked yet. Let me see if can find time in the next few days. |
Eugh https://twitter.com/uops_info/status/1367961302845386752
ICL065 erratum says: |
Will look into this next as I've been doing work to remove unnecessary |
The emitter already has a mechanism for eliminating unnecessary //------------------------------------------------------------------------
// AreUpper32BitsZero: check if some previously emitted
// instruction set the upper 32 bits of reg to zero.
//
// Arguments:
// reg - register of interest
//
// Return Value:
// true if previous instruction zeroed reg's upper 32 bits.
// false if it did not, or if we can't safely determine.
//
// Notes:
// Currently only looks back one instruction.
//
// movsx eax, ... might seem viable but we always encode this
// instruction with a 64 bit destination. See TakesRexWPrefix.
bool emitter::AreUpper32BitsZero(regNumber reg) Its current implementation only looks at the last emitted instruction to see if its destination register is zero-extended 32-bits. I did an experiment by allowing to look up to 4 previously emitted instructions to determine if the given register is zero-extended 32-bits. This eliminated the I don't think it's trivial to try to resolve this issue in the earlier phases. In morph, it would either require removing As of now, I think simply extending the existing logic in |
e.g.
Example
The Checksum routine Magma.Common/Checksum.cs
Generates asm with 3 pointless movs
category:cq
theme:basic-cq
skill-level:intermediate
cost:medium
impact:medium
The text was updated successfully, but these errors were encountered: