Loop Unrolling is not Enabled in Release Build #41063

dogac00 · 2020-08-19T22:44:31Z

I'm trying the following code in SharpLab. In GCC this loop is unrolled and there are no jump statements in generated assembly.

public class Sample {
        public int SumOfFirstThreeElements(int [] arr) {
            int sum = 0;
            for (int i = 0; i < 3; i++)
                sum += arr[i];
            return sum;
        }
    }

// The generated IL is this:

.class private auto ansi '<Module>'
{
} // end of class <Module>

.class public auto ansi beforefieldinit Sample
    extends [System.Private.CoreLib]System.Object
{
    // Methods
    .method public hidebysig 
        instance int32 SumOfFirstThreeElements (
            int32[] arr
        ) cil managed 
    {
        // Method begins at RVA 0x2050
        // Code size 22 (0x16)
        .maxstack 3
        .locals init (
            [0] int32 sum,
            [1] int32 i
        )

        IL_0000: ldc.i4.0
        IL_0001: stloc.0
        IL_0002: ldc.i4.0
        IL_0003: stloc.1
        // sequence point: hidden
        IL_0004: br.s IL_0010
        // loop start (head: IL_0010)
            IL_0006: ldloc.0
            IL_0007: ldarg.1
            IL_0008: ldloc.1
            IL_0009: ldelem.i4
            IL_000a: add
            IL_000b: stloc.0
            IL_000c: ldloc.1
            IL_000d: ldc.i4.1
            IL_000e: add
            IL_000f: stloc.1

            IL_0010: ldloc.1
            IL_0011: ldc.i4.3
            IL_0012: blt.s IL_0006
        // end loop

        IL_0014: ldloc.0
        IL_0015: ret
    } // end of method Sample::SumOfFirstThreeElements

    .method public hidebysig specialname rtspecialname 
        instance void .ctor () cil managed 
    {
        // Method begins at RVA 0x2072
        // Code size 7 (0x7)
        .maxstack 8

        IL_0000: ldarg.0
        IL_0001: call instance void [System.Private.CoreLib]System.Object::.ctor()
        IL_0006: ret
    } // end of method Sample::.ctor

} // end of class Sample

As we can see a loop code is generated with jump statements. I think the performance would be better if the loop was unrolled.

SharpLab link.

Thanks.

The text was updated successfully, but these errors were encountered:

Dotnet-GitSync-Bot · 2020-08-19T22:44:34Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

SingleAccretion · 2020-08-19T23:22:17Z

I am not sure why are you looking at the IL here though: it is produced by Roslyn, not an optimizing compiler. The JIT output here is perhaps more interesting in that it appears to have two (?) loops:

Sample.SumOfFirstThreeElements(Int32[])
    L0000: sub rsp, 0x28
    L0004: xor eax, eax
    L0006: xor edx, edx
    L0008: test rcx, rcx
    L000b: je short L0024
    L000d: cmp dword ptr [rcx+8], 3
    L0011: jl short L0024
    L0013: movsxd r8, edx
    L0016: add eax, [rcx+r8*4+0x10]
    L001b: inc edx
    L001d: cmp edx, 3
    L0020: jl short L0013
    L0022: jmp short L0038
    L0024: cmp edx, [rcx+8]
    L0027: jae short L003d
    L0029: movsxd r8, edx
    L002c: add eax, [rcx+r8*4+0x10]
    L0031: inc edx
    L0033: cmp edx, 3
    L0036: jl short L0024
    L0038: add rsp, 0x28
    L003c: ret
    L003d: call 0x00007ffe9888fc00
    L0042: int3

I also confirmed this is still the codegen on the recent-ish (month old) version of the runtime.

dogac00 · 2020-08-19T23:31:32Z

@SingleAccretion Good point. I now looked at the JIT output too. GCC 9.2 produces an assembly output like this:

sum_of_first_three_elements(int*):
        mov     eax, DWORD PTR [rdi+4]
        add     eax, DWORD PTR [rdi]
        add     eax, DWORD PTR [rdi+8]
        ret

I hope that in .NET we can produce a similar output to this with bounds checking.

SingleAccretion · 2020-08-19T23:34:16Z

I am more or less sure we can try and bend JIT to do out bidding in this particular case. E. g., this version produces better ASM:

int sum = 0;
for (int i = 0; i <= 2; i++)
    sum += arr[i];

return sum;

G_M22200_IG01:
       sub      rsp, 40
						;; bbWeight=1    PerfScore 0.25
G_M22200_IG02:
       xor      eax, eax
       xor      edx, edx
       mov      r8d, dword ptr [rcx+8]
						;; bbWeight=1    PerfScore 2.50
G_M22200_IG03:
       cmp      edx, r8d
       jae      SHORT G_M22200_IG05
       movsxd   r9, edx
       add      eax, dword ptr [rcx+4*r9+16]
       inc      edx
       cmp      edx, 2
       jle      SHORT G_M22200_IG03
						;; bbWeight=4    PerfScore 20.00
G_M22200_IG04:
       add      rsp, 40
       ret      
						;; bbWeight=1    PerfScore 1.25
G_M22200_IG05:
       call     CORINFO_HELP_RNGCHKFAIL
       int3

dogac00 · 2020-08-19T23:47:57Z

Interesting. But most of code I've seen in loops compare with < than <=.

I tried the unsafe version of this code. And it also does not do loop unrolling optimization.

public class Sample {
    public unsafe int SumOfFirstThreeElements(int* arr) {
        int sum = 0;
        for (int i = 0; i < 3; i++)
            sum += arr[i];

        return sum;
    }
}

x86 JIT Assembly.

Sample.SumOfFirstThreeElements(Int32*)
    L0000: push ebp
    L0001: mov ebp, esp
    L0003: xor eax, eax
    L0005: xor ecx, ecx
    L0007: add eax, [edx+ecx*4]
    L000a: inc ecx
    L000b: cmp ecx, 3
    L000e: jl short L0007
    L0010: pop ebp
    L0011: ret

AndyAyersMS · 2020-08-20T01:19:37Z

Thanks for raising this issue. You are observing a number of quirks in the jit's loop optimization strategy. We hope to address some of these in the not-too-distant future.

I think the performance would be better if the loop was unrolled.

Probably so. The jit can unroll loops but it's not clear if the criteria it uses is well-tuned. See eg #4248, #8107.

The JIT output here is perhaps more interesting in that it appears to have two (?) loops

This is a result of loop cloning -- because of the potential for exceptions the jit needs to produce a version of the loop that bounds checks each iteration. But an up-front test can determine that such a bounds check will always pass. So the jit generates a second loop with no bounds checking, and runs one version of the loop or the other, depending.

this version produces better ASM:

As with unrolling, the cloning heuristics are not well tuned. On this two-element case, I would expect cloning to kick in and the version without bounds checks to then get unrolled, but no... see #4929, #8558.

I tried the unsafe version of this code. And it also does not do loop unrolling optimization.

The jit's loop analysis doesn't understand unsafe accesses very well, so unsafe codegen won't trigger some of the optimizations that one sees with normal array accesses.

EgorBo · 2020-08-20T11:28:53Z

@AndyAyersMS

This is a result of loop cloning -- because of the potential for exceptions the jit needs to produce a version of the loop that bounds checks each iteration. But an up-front test can determine that such a bounds check will always pass. So the jit generates a second loop with no bounds checking, and runs one version of the loop or the other, depending.

very nice trick! so as far as I understand it looks like this:

public int SumOfFirstThreeElements(int[] arr)
{
    int sum = 0;
    for (int i = 0; i < 3; i++)
        sum += arr[i]; // bounds check each iteration
    return sum;
}

// is optimized to:

public int SumOfFirstThreeElements(int[] arr)
{
    int sum = 0;
    if (arr.Length < 3)
        goto unsafe_loop;

    for (int i = 0; i < 3; i++)
        sum += arr[i]
    return sum;

unsafe_loop:
    for (int i = 0; i < 3; i++)
        sum += arr[i]; // bound check each iteration
    return sum;
}

But does it make sense to do it for "3" iterations? (I mean any small constant).
Also I wonder if it should do something like this instead:

public int SumOfFirstThreeElements(int[] arr)
{
    int sum = 0;
    int i = 0;

    for (; i < Math.Min(3, arr.Length); i++)
        sum += arr[i]; // no bounds check 

    for (; i < 3; i++)
        sum += arr[i]; // unsafe area, bounds checks

    return sum;
}

(inspired by https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/Scalar/InductiveRangeCheckElimination.cpp#L24-L41)

AndyAyersMS · 2020-08-20T16:07:49Z

I have seen quite a few examples where I believe the loop cloner's profitability heuristic is suspect. #8558 is the master issue for this.

BruceForstall · 2020-10-10T00:42:37Z

Given that we have several existing issues covering loop unrolling and cloning, I'm going to close this one.

dogac00 added the tenet-performance Performance related issue label Aug 19, 2020

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Aug 19, 2020

AndyAyersMS added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed untriaged New issue has not been triaged by the area owner labels Aug 20, 2020

AndyAyersMS added this to the Future milestone Aug 20, 2020

BruceForstall closed this as completed Oct 10, 2020

BruceForstall mentioned this issue Oct 17, 2020

Improve JIT loop optimizations (.NET 6) #43549

Closed

25 tasks

ghost locked as resolved and limited conversation to collaborators Dec 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loop Unrolling is not Enabled in Release Build #41063

Loop Unrolling is not Enabled in Release Build #41063

dogac00 commented Aug 19, 2020 •

edited

Loading

Dotnet-GitSync-Bot commented Aug 19, 2020

SingleAccretion commented Aug 19, 2020 •

edited

Loading

dogac00 commented Aug 19, 2020

SingleAccretion commented Aug 19, 2020 •

edited

Loading

dogac00 commented Aug 19, 2020 •

edited

Loading

AndyAyersMS commented Aug 20, 2020

EgorBo commented Aug 20, 2020 •

edited

Loading

AndyAyersMS commented Aug 20, 2020

BruceForstall commented Oct 10, 2020

Loop Unrolling is not Enabled in Release Build #41063

Loop Unrolling is not Enabled in Release Build #41063

Comments

dogac00 commented Aug 19, 2020 • edited Loading

Dotnet-GitSync-Bot commented Aug 19, 2020

SingleAccretion commented Aug 19, 2020 • edited Loading

dogac00 commented Aug 19, 2020

SingleAccretion commented Aug 19, 2020 • edited Loading

dogac00 commented Aug 19, 2020 • edited Loading

AndyAyersMS commented Aug 20, 2020

EgorBo commented Aug 20, 2020 • edited Loading

AndyAyersMS commented Aug 20, 2020

BruceForstall commented Oct 10, 2020

dogac00 commented Aug 19, 2020 •

edited

Loading

SingleAccretion commented Aug 19, 2020 •

edited

Loading

SingleAccretion commented Aug 19, 2020 •

edited

Loading

dogac00 commented Aug 19, 2020 •

edited

Loading

EgorBo commented Aug 20, 2020 •

edited

Loading