Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loop Unrolling is not Enabled in Release Build #41063

Closed
dogac00 opened this issue Aug 19, 2020 · 9 comments
Closed

Loop Unrolling is not Enabled in Release Build #41063

dogac00 opened this issue Aug 19, 2020 · 9 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@dogac00
Copy link

dogac00 commented Aug 19, 2020

I'm trying the following code in SharpLab. In GCC this loop is unrolled and there are no jump statements in generated assembly.

public class Sample {
        public int SumOfFirstThreeElements(int [] arr) {
            int sum = 0;
            for (int i = 0; i < 3; i++)
                sum += arr[i];
            return sum;
        }
    }

// The generated IL is this:

.class private auto ansi '<Module>'
{
} // end of class <Module>

.class public auto ansi beforefieldinit Sample
    extends [System.Private.CoreLib]System.Object
{
    // Methods
    .method public hidebysig 
        instance int32 SumOfFirstThreeElements (
            int32[] arr
        ) cil managed 
    {
        // Method begins at RVA 0x2050
        // Code size 22 (0x16)
        .maxstack 3
        .locals init (
            [0] int32 sum,
            [1] int32 i
        )

        IL_0000: ldc.i4.0
        IL_0001: stloc.0
        IL_0002: ldc.i4.0
        IL_0003: stloc.1
        // sequence point: hidden
        IL_0004: br.s IL_0010
        // loop start (head: IL_0010)
            IL_0006: ldloc.0
            IL_0007: ldarg.1
            IL_0008: ldloc.1
            IL_0009: ldelem.i4
            IL_000a: add
            IL_000b: stloc.0
            IL_000c: ldloc.1
            IL_000d: ldc.i4.1
            IL_000e: add
            IL_000f: stloc.1

            IL_0010: ldloc.1
            IL_0011: ldc.i4.3
            IL_0012: blt.s IL_0006
        // end loop

        IL_0014: ldloc.0
        IL_0015: ret
    } // end of method Sample::SumOfFirstThreeElements

    .method public hidebysig specialname rtspecialname 
        instance void .ctor () cil managed 
    {
        // Method begins at RVA 0x2072
        // Code size 7 (0x7)
        .maxstack 8

        IL_0000: ldarg.0
        IL_0001: call instance void [System.Private.CoreLib]System.Object::.ctor()
        IL_0006: ret
    } // end of method Sample::.ctor

} // end of class Sample

As we can see a loop code is generated with jump statements. I think the performance would be better if the loop was unrolled.

SharpLab link.

Thanks.

@dogac00 dogac00 added the tenet-performance Performance related issue label Aug 19, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Aug 19, 2020
@Dotnet-GitSync-Bot
Copy link
Collaborator

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@SingleAccretion
Copy link
Contributor

SingleAccretion commented Aug 19, 2020

I am not sure why are you looking at the IL here though: it is produced by Roslyn, not an optimizing compiler. The JIT output here is perhaps more interesting in that it appears to have two (?) loops:

Sample.SumOfFirstThreeElements(Int32[])
    L0000: sub rsp, 0x28
    L0004: xor eax, eax
    L0006: xor edx, edx
    L0008: test rcx, rcx
    L000b: je short L0024
    L000d: cmp dword ptr [rcx+8], 3
    L0011: jl short L0024
    L0013: movsxd r8, edx
    L0016: add eax, [rcx+r8*4+0x10]
    L001b: inc edx
    L001d: cmp edx, 3
    L0020: jl short L0013
    L0022: jmp short L0038
    L0024: cmp edx, [rcx+8]
    L0027: jae short L003d
    L0029: movsxd r8, edx
    L002c: add eax, [rcx+r8*4+0x10]
    L0031: inc edx
    L0033: cmp edx, 3
    L0036: jl short L0024
    L0038: add rsp, 0x28
    L003c: ret
    L003d: call 0x00007ffe9888fc00
    L0042: int3

I also confirmed this is still the codegen on the recent-ish (month old) version of the runtime.

@dogac00
Copy link
Author

dogac00 commented Aug 19, 2020

@SingleAccretion Good point. I now looked at the JIT output too. GCC 9.2 produces an assembly output like this:

sum_of_first_three_elements(int*):
        mov     eax, DWORD PTR [rdi+4]
        add     eax, DWORD PTR [rdi]
        add     eax, DWORD PTR [rdi+8]
        ret

I hope that in .NET we can produce a similar output to this with bounds checking.

@SingleAccretion
Copy link
Contributor

SingleAccretion commented Aug 19, 2020

I am more or less sure we can try and bend JIT to do out bidding in this particular case. E. g., this version produces better ASM:

int sum = 0;
for (int i = 0; i <= 2; i++)
    sum += arr[i];

return sum;
G_M22200_IG01:
       sub      rsp, 40
						;; bbWeight=1    PerfScore 0.25
G_M22200_IG02:
       xor      eax, eax
       xor      edx, edx
       mov      r8d, dword ptr [rcx+8]
						;; bbWeight=1    PerfScore 2.50
G_M22200_IG03:
       cmp      edx, r8d
       jae      SHORT G_M22200_IG05
       movsxd   r9, edx
       add      eax, dword ptr [rcx+4*r9+16]
       inc      edx
       cmp      edx, 2
       jle      SHORT G_M22200_IG03
						;; bbWeight=4    PerfScore 20.00
G_M22200_IG04:
       add      rsp, 40
       ret      
						;; bbWeight=1    PerfScore 1.25
G_M22200_IG05:
       call     CORINFO_HELP_RNGCHKFAIL
       int3   

@dogac00
Copy link
Author

dogac00 commented Aug 19, 2020

Interesting. But most of code I've seen in loops compare with < than <=.

I tried the unsafe version of this code. And it also does not do loop unrolling optimization.

public class Sample {
    public unsafe int SumOfFirstThreeElements(int* arr) {
        int sum = 0;
        for (int i = 0; i < 3; i++)
            sum += arr[i];

        return sum;
    }
}

x86 JIT Assembly.

Sample.SumOfFirstThreeElements(Int32*)
    L0000: push ebp
    L0001: mov ebp, esp
    L0003: xor eax, eax
    L0005: xor ecx, ecx
    L0007: add eax, [edx+ecx*4]
    L000a: inc ecx
    L000b: cmp ecx, 3
    L000e: jl short L0007
    L0010: pop ebp
    L0011: ret

@AndyAyersMS
Copy link
Member

Thanks for raising this issue. You are observing a number of quirks in the jit's loop optimization strategy. We hope to address some of these in the not-too-distant future.

I think the performance would be better if the loop was unrolled.

Probably so. The jit can unroll loops but it's not clear if the criteria it uses is well-tuned. See eg #4248, #8107.

The JIT output here is perhaps more interesting in that it appears to have two (?) loops

This is a result of loop cloning -- because of the potential for exceptions the jit needs to produce a version of the loop that bounds checks each iteration. But an up-front test can determine that such a bounds check will always pass. So the jit generates a second loop with no bounds checking, and runs one version of the loop or the other, depending.

this version produces better ASM:

As with unrolling, the cloning heuristics are not well tuned. On this two-element case, I would expect cloning to kick in and the version without bounds checks to then get unrolled, but no... see #4929, #8558.

I tried the unsafe version of this code. And it also does not do loop unrolling optimization.

The jit's loop analysis doesn't understand unsafe accesses very well, so unsafe codegen won't trigger some of the optimizations that one sees with normal array accesses.

@AndyAyersMS AndyAyersMS added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed untriaged New issue has not been triaged by the area owner labels Aug 20, 2020
@AndyAyersMS AndyAyersMS added this to the Future milestone Aug 20, 2020
@EgorBo
Copy link
Member

EgorBo commented Aug 20, 2020

@AndyAyersMS

This is a result of loop cloning -- because of the potential for exceptions the jit needs to produce a version of the loop that bounds checks each iteration. But an up-front test can determine that such a bounds check will always pass. So the jit generates a second loop with no bounds checking, and runs one version of the loop or the other, depending.

very nice trick! so as far as I understand it looks like this:

public int SumOfFirstThreeElements(int[] arr)
{
    int sum = 0;
    for (int i = 0; i < 3; i++)
        sum += arr[i]; // bounds check each iteration
    return sum;
}

// is optimized to:

public int SumOfFirstThreeElements(int[] arr)
{
    int sum = 0;
    if (arr.Length < 3)
        goto unsafe_loop;

    for (int i = 0; i < 3; i++)
        sum += arr[i]
    return sum;

unsafe_loop:
    for (int i = 0; i < 3; i++)
        sum += arr[i]; // bound check each iteration
    return sum;
}

But does it make sense to do it for "3" iterations? (I mean any small constant).
Also I wonder if it should do something like this instead:

public int SumOfFirstThreeElements(int[] arr)
{
    int sum = 0;
    int i = 0;

    for (; i < Math.Min(3, arr.Length); i++)
        sum += arr[i]; // no bounds check 

    for (; i < 3; i++)
        sum += arr[i]; // unsafe area, bounds checks

    return sum;
}

(inspired by https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/Scalar/InductiveRangeCheckElimination.cpp#L24-L41)

@AndyAyersMS
Copy link
Member

I have seen quite a few examples where I believe the loop cloner's profitability heuristic is suspect. #8558 is the master issue for this.

@BruceForstall
Copy link
Member

Given that we have several existing issues covering loop unrolling and cloning, I'm going to close this one.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

6 participants