Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jit can generate pointless movs #10315

Closed
benaadams opened this issue May 11, 2018 · 22 comments · Fixed by #79381
Closed

Jit can generate pointless movs #10315

benaadams opened this issue May 11, 2018 · 22 comments · Fixed by #79381
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Milestone

Comments

@benaadams
Copy link
Member

benaadams commented May 11, 2018

e.g.

 mov    r8d,dword ptr [rax]  
 add    rax,4  
 mov    r8d,r8d                 ; pointless mov
 ...
 movzx  eax,byte ptr [rax]  
 mov    eax,eax                 ; pointless mov   

Example

The Checksum routine Magma.Common/Checksum.cs

// Internet Checksum as defined by RFC 791, RFC 793, RFC 1071, RFC 1141, RFC 1624
public static ushort Calcuate(ref byte buffer, int length)
{
    ref var current = ref buffer;
    ulong sum = 0;

    while (length >= sizeof(ulong))
    {
        length -= sizeof(ulong);

        var ulong0 = Unsafe.As<byte, ulong>(ref current);
        current = ref Unsafe.Add(ref current, sizeof(ulong));

        // Add with carry
        sum += ulong0;
        if (sum < ulong0)
        {
            sum++;
        }
    }

    if ((length & sizeof(uint)) != 0)
    {
        var uint0 = Unsafe.As<byte, uint>(ref current);
        current = ref Unsafe.Add(ref current, sizeof(uint));

        // Add with carry
        sum += uint0;
        if (sum < uint0)
        {
            sum++;
        }
    }

    if ((length & sizeof(ushort)) != 0)
    {
        var ushort0 = Unsafe.As<byte, ushort>(ref current);
        current = ref Unsafe.Add(ref current, sizeof(ushort));

        // Add with carry
        sum += ushort0;
        if (sum < ushort0)
        {
            sum++;
        }
    }

    if ((length & sizeof(byte)) != 0)
    {
        var byte0 = current;

        // Add with carry
        sum += byte0;
        if (sum < byte0)
        {
            sum++;
        }
    }

    // Fold down to 16 bits

    var uint1 = (uint)(sum >> 32);
    var uint2 = (uint)sum;

    // Add with carry
    uint1 += uint2;
    if (uint1 < uint2)
    {
        uint1++;
    }

    var ushort2 = (ushort)uint1;
    var ushort1 = (ushort)(uint1 >> 16);

    // Add with carry
    ushort1 = (ushort)(ushort1 + ushort2);
    if (ushort1 < ushort2)
    {
        ushort1++;
    }

    // Invert to get ones-complement result 
    return (ushort)~ushort1;
}

Generates asm with 3 pointless movs

 mov    rax,rcx  
 xor    ecx,ecx  
 cmp    edx,8  
 jl     00007FFB767741B4  
 add    edx,0FFFFFFF8h  
 mov    r8,qword ptr [rax]  
 add    rax,8  
 add    rcx,r8  
 cmp    rcx,r8  
 jae    00007FFB767741AF  
 inc    rcx  
 cmp    edx,8  
 jge    00007FFB7677419A  
 test   dl,4  
 je     00007FFB767741CE  
 mov    r8d,dword ptr [rax]  
 add    rax,4  
 mov    r8d,r8d                 ; pointless mov 
 add    rcx,r8  
 cmp    rcx,r8  
 jae    00007FFB767741CE  
 inc    rcx  
 test   dl,2  
 je     00007FFB767741E9  
 movzx  r8d,word ptr [rax]  
 add    rax,2  
 mov    r8d,r8d                 ; pointless mov   
 add    rcx,r8  
 cmp    rcx,r8  
 jae    00007FFB767741E9  
 inc    rcx  
 test   dl,1  
 je     00007FFB767741FE  
 movzx  eax,byte ptr [rax]  
 mov    eax,eax                 ; pointless mov   
 add    rcx,rax  
 cmp    rcx,rax  
 jae    00007FFB767741FE  
 inc    rcx  
 mov    rax,rcx  
 shr    rax,20h  
 add    eax,ecx  
 cmp    eax,ecx  
 jae    00007FFB7677420D  
 inc    eax  
 movzx  edx,ax  
 shr    eax,10h  
 movzx  eax,ax  
 add    eax,edx  
 movzx  eax,ax  
 cmp    eax,edx  
 jge    00007FFB76774224  
 inc    eax  
 movzx  eax,ax  
 not    eax  
 movzx  eax,ax  
 ret    

category:cq
theme:basic-cq
skill-level:intermediate
cost:medium
impact:medium

@mikedn
Copy link
Contributor

mikedn commented May 11, 2018

Those aren't exactly pointless, they're zero extending casts from int to long. But yes, they're not necessary due to the fact that the loads that produced the values already zeroed out the upper 32 bits.

I have a PR in progress that's supposed to deal with some similar cases but I'm not sure it will have any effect here, the Unsafe.Add that's between the def and the use might cause problems. I'll check later today.

@gfoidl
Copy link
Member

gfoidl commented May 11, 2018

Aside from the codegen-issue: current CPUs (at least the Intel ones) will "optimize" these movs away?!

@mikedn
Copy link
Contributor

mikedn commented May 11, 2018

Aside from the codegen-issue: current CPUs (at least the Intel ones) will "optimize" these movs away?!

AFAIK Intel CPUs optimize register to register moves only if the source and destination are different. So unfortunately in this case these moves may end up wasting a cycle.

@mikedn
Copy link
Contributor

mikedn commented May 11, 2018

dotnet/coreclr#12676 should take care of the moves if the casts are done early, something like:

ulong ushort0 = Unsafe.As<byte, ushort>(ref current);

Otherwise it gets complicated and at the moment I don't know what needs to be done to fix this.

@gfoidl
Copy link
Member

gfoidl commented May 11, 2018

only if the source and destination are different.

When I read Intel® 64 and IA-32 Architectures Optimization Reference Manual -- Section 3.5.1.12 Zero-Latency MOV Instructions correct, so here the mov can be eliminated. It doesn't state if it is definitely eliminated.

Anyway it would be better to avoid these movs in the asm.

@mikedn
Copy link
Contributor

mikedn commented May 11, 2018

It doesn't state if it is definitely eliminated.

Yep, in rather typical Intel optimization manual fashion they have left out some details. Agner Fog's optimization manual tells a bit more - including that same register moves are not eliminated. And that movzx eax, bl is no longer eliminated on Haswell. And that move elimination succeeds in more than 80% case.

Of course, Agner Fog could be wrong but then he put a ton of effort into gather and writing down all this information that there's a good chance that he is right.

@mikedn
Copy link
Contributor

mikedn commented May 11, 2018

Since the JIT has a habit to generate useless moves for various reasons it would be useful to get some numbers. Measuring some simple assembly code:

    mov ecx, 1 << 31
L1:
    dec ecx

    ; case 1
    ; mov ebx, ecx
    ; mov ecx, ebx

    ; case 2
    ; mov ecx, ecx

    test ecx, ecx
    jnz L1

On my Haswell I get

Case Time
No moves ~700ms
Case 1 ~1000ms
Case 2 ~1300ms

So extra moves aren't without consequences and a single mov ecx, ecx move is more costly than 2 moves via a separate register.

@voinokin
Copy link

MOV r32, r32 is not exactly no-op in 64-bit mode - it zeroes upper 32 bits of registers

@mikedn
Copy link
Contributor

mikedn commented May 21, 2018

MOV r32, r32 is not exactly no-op in 64-bit mode - it zeroes upper 32 bits of registers

Yes, as already mentioned, they're used to implement zero extending casts. However, in this case they're useless because the upper bits are already 0.

In fact, they're probably useless most of the time because all (well, at least all "normal" ones, there may be some exceptions) 32 bit instructions zero out the upper 32 bits.

@Zhentar
Copy link

Zhentar commented Mar 23, 2019

What causes the generation of these movs? I'm getting quite a few in my hot loop here https://github.com/Zhentar/xxHash3.NET/blob/39961ae0e5f617216efc64797549de75a2b37dfa/xxHash3/xxHash3.cs#L292 and trying to figure out a workaround.

p.s. I am very disappointed this issue already exists and I don't get to open one titled "Jit likes to mov it, mov it"

@stephentoub
Copy link
Member

stephentoub commented Mar 23, 2019

I am very disappointed this issue already exists and I don't get to open one titled "Jit likes to mov it, mov it"

😄

@mikedn
Copy link
Contributor

mikedn commented Mar 23, 2019

@Zhentar Can you post the generated assembly code? I'm pretty sure the moves you're seeing are zero extending casts too but I can't see how it all fits together and tell if there's any simple solution for this. And please do yourself and everyone else a favour and do not use var in that kind of code 😛

As for the JIT, he definitely liked to mov it, mov it but a recent change calmed it down a bit. However, it's still a greedy bastard and wants a paycheck with many zeroes so it keeps zeroing in the hope it will get it.

@Zhentar
Copy link

Zhentar commented Mar 23, 2019

Sure, here:

   acc.B = AccumulateOnePair(acc.B, data.B, theKeys.B);
   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       mov     rax,qword ptr [rcx+8]
       mov     r9,qword ptr [rdx+8]
       mov     qword ptr [rsp+68h],r9
       mov     r9,qword ptr [r8+8]
       mov     qword ptr [rsp+60h],r9
       mov     r9d,dword ptr [rsp+68h]
       mov     r10d,dword ptr [rsp+6Ch]
       mov     r11d,r9d
       add     r11d,dword ptr [rsp+60h]
       mov     r11d,r11d
       mov     esi,r10d
       add     esi,dword ptr [rsp+64h]
       mov     esi,esi
       imul    r11,rsi
       add     rax,r11
       mov     r9d,r9d
       add     rax,r9
       mov     r9d,r10d
       shl     r9,20h
       add     rax,r9

@mikedn
Copy link
Contributor

mikedn commented Mar 23, 2019

So, mov r11d, r11d, mov esi, esi, mov r9d, r9d are definitely uint -> ulong casts. The mov r9d, r10d before the shift is likely too a cast (it may be a real move if r10d is subsequently used). They are unnecessary because previous operations have already zero out the upper 32 bits of those registers but the JIT doesn't know it. I think at least some of these cases could be fixed but I'm not sure yet. And I'm not sure it's strictly necessary, at least for this case.

I'm more worried about stack spills at [rsp+68h] and [rsp+60h]. Looks like bad struct handling, unfortunately the JIT has lots of it.

However, it appears that you already changed accumulate to

[MethodImpl(MethodImplOptions.AggressiveInlining)]
static ulong AccumulateOnePair(uint valueLeft, uint valueRight, uint keyLeft, uint keyRight)
{
    return valueLeft + ((ulong)valueRight << 32) + Multiply32to64(valueLeft + keyLeft, valueRight + keyRight);
}

and got rid of the structs. I'd go one step further and "hoist" the casts:

[MethodImpl(MethodImplOptions.AggressiveInlining)]
static ulong AccumulateOnePair(ulong valueLeft, ulong valueRight, ulong keyLeft, ulong keyRight)
{
    return valueLeft + (valueRight << 32) + (ulong)(uint)(valueLeft + keyLeft) * (ulong)(uint)(valueRight + keyRight);
}

With this I'm seeing assembly code like

00007FFA1EDD37C4 45 8B 52 04          mov         r10d,dword ptr [r10+4]  
00007FFA1EDD37C8 4D 8D 98 80 00 00 00 lea         r11,[r8+80h]  
00007FFA1EDD37CF 45 8B 5B 10          mov         r11d,dword ptr [r11+10h]  
00007FFA1EDD37D3 49 8D B0 80 00 00 00 lea         rsi,[r8+80h]  
00007FFA1EDD37DA 8B 76 14             mov         esi,dword ptr [rsi+14h]  
00007FFA1EDD37DD 45 03 D9             add         r11d,r9d  
00007FFA1EDD37E0 41 03 F2             add         esi,r10d  
00007FFA1EDD37E3 4C 0F AF DE          imul        r11,rsi  
00007FFA1EDD37E7 4C 03 C8             add         r9,rax  
00007FFA1EDD37EA 49 C1 E2 20          shl         r10,20h

which should be better.

Structs are still causing problems, a bunch of pointless leas are generated. Need to take a closer look, I have a bone to pick with the JIT about these leas.

@Zhentar
Copy link

Zhentar commented Mar 23, 2019

Woah! That slashed execution time for me by 20%!

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@weltkante
Copy link
Contributor

weltkante commented May 27, 2020

I'm getting a bunch of pointless moves when using Unsafe.As - but its slightly different from what is reported here so far, seems like JIT doesn't do copy propagation after Unsafe.As. Should I open a separate issue for that?

; Benchmarks.Program.UnsafeCopyTest()
       sub       rsp,38
       xor       eax,eax
       mov       rdx,[rcx+8]
       cmp       dword ptr [rdx+8],0
       jle       short M00_L01
M00_L00:
       mov       rdx,[rcx+8]
       mov       r8,rdx
       mov       dword ptr [rsp+30],0F3F7A0C0 // loading constant here
       mov       r9d,[rsp+30]                 // ... moving
       mov       [rsp+28],r9d                 // ... stuff
       mov       r9d,[rsp+28]                 // ... around
       cmp       eax,[r8+8]
       jae       short M00_L02
       movsxd    r10,eax
       mov       [r8+r10*4+10],r9d            // ideally writing a constant here
       inc       eax
       cmp       [rdx+8],eax
       jg        short M00_L00
M00_L01:
       add       rsp,38
       ret
M00_L02:
       call      CORINFO_HELP_RNGCHKFAIL
       int       3
; Total bytes of code 78
source and further info

Produced by nightly build:
.NET Core SDK=5.0.100-preview.6.20273.3
.NET Core 5.0.0 (CoreCLR 5.0.20.27110, CoreFX 5.0.20.27110), X64 RyuJIT
(of course compiling source as Release and running without debugger)

Motivation is moving away from ExplicitLayout/ FieldOffset unions to a more programmer friendly API. The idea is to declare storage as Color array/span being byte order agnostic, and actually code being programmed using structs with concrete byte order.

Generally it seems to work really well. Code generated for MemoryMarshal.Cast(Span) and ref Unsafe.As optimizes away really well, just by-value Unsafe.As seems to generate a lot of redundant moves.

Note that it is possible to work around the problem by casting the LHS of the assignment ColorMemory[i] = ... via ref Unsafe.As into the correct type, but thats not possible in all situations, sometimes you do want by-value conversion, and having this generate significantly slower code is annoying (measured factor 2x-4x slower in different code snippets)

using System;
using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

namespace Benchmarks
{
    [DisassemblyDiagnoser(printSource: true)]
    public class Program
    {
        private Color[] ColorMemory;

        public Program()
        {
            ColorMemory = new Color[1 << 20];
        }

        [Benchmark]
        public void UnsafeCopyTest()
        {
            for (int i = 0; i < ColorMemory.Length; i++)
                ColorMemory[i] = ColorARGB.SomeColor;
        }

        public static void Main()
        {
            BenchmarkRunner.Run<Program>();
        }
    }

    public struct Color
    {
        public int Value;
    }

    public struct ColorARGB
    {
        public static explicit operator ColorARGB(uint color) => Unsafe.As<uint, ColorARGB>(ref color);
        public static implicit operator Color(ColorARGB color) => Unsafe.As<ColorARGB, Color>(ref color);

        public static ColorARGB SomeColor => (ColorARGB)0xF3F7A0C0;

        public byte B;
        public byte G;
        public byte R;
        public byte A;
    }
}

@AndyAyersMS
Copy link
Member

Most likely the root cause is residual "address exposure" that is blocking promotion of some of the intermediate temp structs. Will update once I've had a chance to confirm/refute.

@weltkante
Copy link
Contributor

@AndyAyersMS got a chance to look at it? Otherwise your guess sounds like it has a separate root cause and probably should get its own issue?

@AndyAyersMS
Copy link
Member

@weltkante Thanks for the reminder.

I haven't looked yet. Let me see if can find time in the next few days.

@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@benaadams
Copy link
Member Author

Eugh https://twitter.com/uops_info/status/1367961302845386752

I just noticed that with a recent microcode update, #Intel Ice Lake CPUs don't perform 'move elimination' any more for GPR registers. This is probably related to the ICL065 erratum

ICL065 erratum says:

image

@TIHan TIHan removed the JitUntriaged CLR JIT issues needing additional triage label Oct 31, 2022
@TIHan TIHan self-assigned this Oct 31, 2022
@TIHan
Copy link
Contributor

TIHan commented Nov 29, 2022

Will look into this next as I've been doing work to remove unnecessary movsx / movzx instructions lately.

@TIHan
Copy link
Contributor

TIHan commented Dec 7, 2022

The emitter already has a mechanism for eliminating unnecessary mov instructions.

//------------------------------------------------------------------------
// AreUpper32BitsZero: check if some previously emitted
//     instruction set the upper 32 bits of reg to zero.
//
// Arguments:
//    reg - register of interest
//
// Return Value:
//    true if previous instruction zeroed reg's upper 32 bits.
//    false if it did not, or if we can't safely determine.
//
// Notes:
//    Currently only looks back one instruction.
//
//    movsx eax, ... might seem viable but we always encode this
//    instruction with a 64 bit destination. See TakesRexWPrefix.

bool emitter::AreUpper32BitsZero(regNumber reg)

Its current implementation only looks at the last emitted instruction to see if its destination register is zero-extended 32-bits.

I did an experiment by allowing to look up to 4 previously emitted instructions to determine if the given register is zero-extended 32-bits. This eliminated the mov instructions that were mentioned in this issue's description.

I don't think it's trivial to try to resolve this issue in the earlier phases. In morph, it would either require removing CAST nodes or adding a new flag to force them to be a no-op. Removing CAST nodes in morph is troublesome and can lead to bad HIR. Adding a flag adds more complexity. Lowering could be possible, but we need a way to track the upper bits.

As of now, I think simply extending the existing logic in AreUpper32BitsZero would be the way to go.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Dec 8, 2022
@TIHan TIHan modified the milestones: Future, 8.0.0 Jan 17, 2023
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Feb 25, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Mar 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI optimization
Projects
None yet