Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT should propagate assertions and elide custom (bounds) checks #44040

Closed
Sergio0694 opened this issue Oct 30, 2020 · 16 comments
Closed

JIT should propagate assertions and elide custom (bounds) checks #44040

Sergio0694 opened this issue Oct 30, 2020 · 16 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI Priority:2 Work that is important, but not critical for the release tenet-performance Performance related issue
Milestone

Comments

@Sergio0694
Copy link
Contributor

Sergio0694 commented Oct 30, 2020

Description

Hi, I've noticed that the JIT currently doesn't seem to track what assertions are done on the state of registers/locals between different jump labels in the generated code, resulting in completely redundant conditional branches being repeated even though by looking at the assembly one can actually see the variables being tested couldn't possibly have changed. I've discovered this issue while working on the Span2D<T> type in the Microsoft.Toolkit.HighPerformance package (here), which in some cases it's actually hurt quite a bit by the lack of this optimization (unless users just manually switch to using unsafe code to work around it).

As suggested by @tannergooding - first here's a minimal repro that illustrates the issue:

C# repro code (click to expand)
public readonly struct Wrapper
{
    private readonly int length;

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public int Test(int i)
    {
        static void Throw() => throw new ArgumentException();

        if ((uint)i >= (uint)this.length)
        {
            Throw();
        }

        return 1;
    }
}

public static int Repro(Wrapper wrapper)
{
    int result = 0;

    for (int i = 0; i < 100; i++)
    {
        // We're calling Test twice here, but the same issue happens for any number
        // of times we repeat the call: each generates one extra redundant check
        result += wrapper.Test(i);
        result += wrapper.Test(i);
    }

    return result;
}

This produces the following assembly code:

asm x86-64 codegen (click to expand)
; Assembly listing for method Span2DBenchmark.Program:Foo(Span2DBenchmark.Wrapper):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
;* V00 arg0         [V00    ] (  0,  0   )  struct ( 8) zero-ref    ld-addr-op
;  V01 loc0         [V01,T01] (  6, 18   )     int  ->  rax        
;  V02 loc1         [V02,T00] (  6, 21   )     int  ->  rdx        
;  V03 OutArgs      [V03    ] (  1,  1   )  lclBlk (32) [rsp+0x00]   "OutgoingArgSpace"
;  V04 tmp1         [V04,T02] (  3,  9   )     int  ->  rcx         V00.length(offs=0x00) P-INDEP "field V00.length (fldOffset=0x0)"
;
; Lcl frame size = 40

G_M50921_IG01:
       sub      rsp, 40
						;; bbWeight=1    PerfScore 0.25
G_M50921_IG02:
       xor      eax, eax
       xor      edx, edx
						;; bbWeight=1    PerfScore 0.50
G_M50921_IG03:
       cmp      edx, ecx
       jae      SHORT G_M50921_IG07
						;; bbWeight=4    PerfScore 5.00
G_M50921_IG04:
       inc      eax
       cmp      edx, ecx
       jae      SHORT G_M50921_IG07
						;; bbWeight=4    PerfScore 6.00
G_M50921_IG05:
       inc      eax
       inc      edx
       cmp      edx, 100
       jl       SHORT G_M50921_IG03
						;; bbWeight=4    PerfScore 7.00
G_M50921_IG06:
       add      rsp, 40
       ret      
						;; bbWeight=1    PerfScore 1.25
G_M50921_IG07:
       call     Span2DBenchmark.Wrapper:<Test>g__Throw|1_0()
       int3     
						;; bbWeight=0    PerfScore 0.00

; Total bytes of code 38, prolog size 4, PerfScore 23.80, (MethodHash=665a3916) for method Span2DBenchmark.Program:Foo(Span2DBenchmark.Wrapper):int
; ============================================================

You can see that each call to Wrapper.Test is inline and produces that inc eax as expected. But, each call also adds one extra conditional jump that checks the current index in edx. After the first one (G_M50921_IG03), the others are just not needed, as the test is only working on edx and ecx, which are not modified by any of the following instructions until the end of the loop. The JIT doesn't seem to currently be aware of the state of these registers, so it just doesn't see that no instructions are writing to those registers to be able to know that those repeated conditional branches are just not necessary.

My original example

C# repro code (click to expand)
private static int Sum(Span2D<int> span)
{
    int height = span.Height;
    int width = span.Width;
    int sum = 0;

    for (int i = 0; i < height; i++)
    {
        for (int j = 0; j < width; j++)
        {
            sum += span[i, j];
        }
    }

    return sum;
}

This produces the following assembly code:

asm x86-64 codegen (click to expand)
; Assembly listing for method Span2DBenchmark.Benchmark:Anujfrewf(Microsoft.Toolkit.HighPerformance.Memory.Span2D`1[Int32]):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T01] (  6, 72   )   byref  ->  rcx         ld-addr-op
;  V01 loc0         [V01,T09] (  3,  6   )     int  ->   r8        
;  V02 loc1         [V02,T06] (  3, 21   )     int  ->  r10        
;  V03 loc2         [V03,T03] (  4, 34   )     int  ->  rax        
;  V04 loc3         [V04,T02] (  6, 45   )     int  ->  r11        
;  V05 loc4         [V05,T00] (  6, 76   )     int  ->  rsi        
;  V06 OutArgs      [V06    ] (  1,  1   )  lclBlk (32) [rsp+0x00]   "OutgoingArgSpace"
;  V07 tmp1         [V07,T05] (  2, 32   )    long  ->  rbx         "Inline stloc first use temp"
;* V08 tmp2         [V08    ] (  0,  0   )   byref  ->  zero-ref    "impAppendStmt"
;* V09 tmp3         [V09    ] (  0,  0   )  struct (16) zero-ref    ld-addr-op "Inlining Arg"
;  V10 tmp4         [V10,T04] (  2, 32   )   byref  ->  rdi         V09._pointer(offs=0x00) P-INDEP "field V09._pointer (fldOffset=0x0)"
;* V11 tmp5         [V11    ] (  0,  0   )     int  ->  zero-ref    V09._length(offs=0x08) P-INDEP "field V09._length (fldOffset=0x8)"
;  V12 cse0         [V12,T07] (  3, 18   )     int  ->  rdx         "CSE - aggressive"
;  V13 cse1         [V13,T08] (  3, 10   )     int  ->   r9         "CSE - aggressive"
;
; Lcl frame size = 40

G_M41662_IG01:
       push     rdi
       push     rsi
       push     rbp
       push     rbx
       sub      rsp, 40
						;; bbWeight=1    PerfScore 4.25
G_M41662_IG02:
       mov      edx, dword ptr [rcx+8]
       mov      r8d, edx
       mov      r9d, dword ptr [rcx+16]
       mov      r10d, r9d
       xor      eax, eax
       xor      r11d, r11d
       test     r8d, r8d
       jle      SHORT G_M41662_IG08
						;; bbWeight=1    PerfScore 6.25
G_M41662_IG03:
       xor      esi, esi
       test     r10d, r10d
       jle      SHORT G_M41662_IG07
						;; bbWeight=4    PerfScore 6.00
G_M41662_IG04:
       cmp      r11d, edx
       jae      SHORT G_M41662_IG09
						;; bbWeight=16    PerfScore 20.00
G_M41662_IG05:
       cmp      esi, r9d
       jae      SHORT G_M41662_IG09
						;; bbWeight=8    PerfScore 10.00
G_M41662_IG06:
       mov      rdi, bword ptr [rcx]
       mov      ebx, r11d
       mov      ebp, dword ptr [rcx+20]
       imul     rbx, rbp
       mov      ebp, esi
       add      rbx, rbp
       add      eax, dword ptr [rdi+4*rbx]
       inc      esi
       cmp      esi, r10d
       jl       SHORT G_M41662_IG04
						;; bbWeight=16    PerfScore 164.00
G_M41662_IG07:
       inc      r11d
       cmp      r11d, r8d
       jl       SHORT G_M41662_IG03
						;; bbWeight=4    PerfScore 6.00
G_M41662_IG08:
       add      rsp, 40
       pop      rbx
       pop      rbp
       pop      rsi
       pop      rdi
       ret      
						;; bbWeight=1    PerfScore 3.25
G_M41662_IG09:
       call     Microsoft.Toolkit.HighPerformance.Memory.Internals.ThrowHelper:ThrowIndexOutOfRangeException()
       int3     
						;; bbWeight=0    PerfScore 0.00

; Total bytes of code 99, prolog size 8, PerfScore 229.65, (MethodHash=8b1c5d41) for method Span2DBenchmark.Benchmark:Anujfrewf(Microsoft.Toolkit.HighPerformance.Memory.Span2D`1[Int32]):int
; ============================================================

Here the codegen is repeating the bounds check for i every single time the indexer is used, even though none of the registers being checked is every updated by the following code. Ideally the check should only be done once, and then that jl SHORT G_M41662_IG04 should instead jump to G_M41662_IG05 and only repeat the check for j, which is updated in that block of code. Like, that jump label to directly go back to the check for j is in fact generated, it's just not being used.

Benchmarks

I run a quick benchmark for this method against a version where I manually skipped the bounds check for i after the first iteration of the loop, and got about a 12% speed improvement due to this change alone. I think it might be worth it for the JIT to be able to track situations such as this one automatically, though admittedly I'm not sure how difficult it would be to enable this.

image

Configuration

  • .NET Core 5.0.0 (CoreCLR 5.0.20.47505, CoreFX 5.0.20.47505), X64 RyuJIT
  • Codegen done with disasmo on a local checked build of .NET 5 from master, cloned a couple weeks ago
@Sergio0694 Sergio0694 added the tenet-performance Performance related issue label Oct 30, 2020
@Dotnet-GitSync-Bot
Copy link
Collaborator

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Oct 30, 2020
@jeffschwMSFT jeffschwMSFT added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 30, 2020
@JulieLeeMSFT JulieLeeMSFT added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed untriaged New issue has not been triaged by the area owner labels Nov 30, 2020
@JulieLeeMSFT JulieLeeMSFT added this to the 6.0.0 milestone Nov 30, 2020
@JulieLeeMSFT JulieLeeMSFT assigned briansull and unassigned echesakov Nov 30, 2020
@JulieLeeMSFT
Copy link
Member

@briansull PTAL at bound check issues.

@briansull
Copy link
Contributor

I will investigate

@briansull
Copy link
Contributor

In your first test case the register edx holds the loop counter and changes from 0 to 99
this.length is not set and is zero (the default value), so the comparison is necessary as the value of edx is changing
and in fact the second time it will execute Throw();

@briansull
Copy link
Contributor

Here is my slightly modified variant of your first test case:

using System;
using System.Runtime.CompilerServices;

public class Issue_44040
{
    public readonly struct Wrapper
    {
        private readonly int length;

        public Wrapper(int _len)
        {
            length = _len;
        }
    
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public int Test(int i)
        {
            static void Throw() => throw new ArgumentException();
    
            if ((uint)i >= (uint)this.length)
            {
                Throw();
            }
    
            return 1;
        }
    }
    
    [MethodImpl(MethodImplOptions.NoInlining)]
    public static int Repro(Wrapper wrapper, int max)
    {
        int result = 0;
    
        for (int i = 0; i < max; i++)
        {
            // We're calling Test three times here, but the same issue happens for any
            // number of times we repeat
            result += wrapper.Test(i);
            result += wrapper.Test(i);
        }
    
        return result;
    }
    
    public static int Main(string[] args)
    {
        int max = 100;
        Wrapper w = new Wrapper(max);

        int res = Repro(w, max);

        Console.Write("Test Passed");
        return 100;
    }
}

@briansull
Copy link
Contributor

briansull commented Dec 6, 2020

And here is the assembly code for Repro, with Test(int) inlined three times:

; Assembly listing for method Issue_44040:Repro(Wrapper,int):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
;* V00 arg0         [V00    ] (  0,  0   )  struct ( 8) zero-ref    ld-addr-op
;  V01 arg1         [V01,T02] (  4,  7   )     int  ->  rdx        
;  V02 loc0         [V02,T00] (  8, 26   )     int  ->  rax        
;  V03 loc1         [V03,T01] (  5, 17   )     int  ->   r8        
;  V04 OutArgs      [V04    ] (  1,  1   )  lclBlk (32) [rsp+0x00]   "OutgoingArgSpace"
;  V05 tmp1         [V05,T03] (  2,  5   )     int  ->  rcx         V00.length(offs=0x00) P-INDEP "field V00.length (fldOffset=0x0)"
;
; Lcl frame size = 40

G_M36428_IG01:              ;; offset=0000H
       4883EC28             sub      rsp, 40
						;; bbWeight=1    PerfScore 0.25
G_M36428_IG02:              ;; offset=0004H
       33C0                 xor      eax, eax
       4533C0               xor      r8d, r8d
       85D2                 test     edx, edx
       7E13                 jle      SHORT G_M36428_IG05
						;; bbWeight=1    PerfScore 1.75
G_M36428_IG03:              ;; offset=000DH
       443BC1               cmp      r8d, ecx
       7313                 jae      SHORT G_M36428_IG06
						;; bbWeight=4    PerfScore 5.00
G_M36428_IG04:              ;; offset=0012H
       FFC0                 inc      eax
       FFC0                 inc      eax
       FFC0                 inc      eax
       41FFC0               inc      r8d
       443BC2               cmp      r8d, edx
       7CED                 jl       SHORT G_M36428_IG03
						;; bbWeight=4    PerfScore 9.00
G_M36428_IG05:              ;; offset=0020H
       4883C428             add      rsp, 40
       C3                   ret      
						;; bbWeight=1    PerfScore 1.25
G_M36428_IG06:              ;; offset=0025H
       E846FBFFFF           call     Wrapper:<Test>g__Throw|2_0()
       CC                   int3     
						;; bbWeight=0    PerfScore 0.00

; Total bytes of code 43, prolog size 4, PerfScore 21.55, instruction count 17 (MethodHash=bb8471b3) for method Issue_44040:Repro(Wrapper,int):int
; ============================================================

@briansull
Copy link
Contributor

Looks optimal to me:

Only one compare and jump to Throw()
And three increments of eax

@briansull
Copy link
Contributor

briansull commented Dec 6, 2020

For your second test case we have many known issue with optimizations of multi-dimensional arrays.
I would suggest that you always use Jagged arrays if you want the best performance.

sum += span[i][j];

@Sergio0694
Copy link
Contributor Author

Sergio0694 commented Dec 6, 2020

@briansull Thank you for taking a look!
The first test case does seem to work correctly now, was that addressed in a recent commit?
Because that codegen I posted was taken from master from a couple months ago.
Regardless, glad that first bit is solved at least 😄

As for the second case, that's not actually an ND array, but rather a custom type that lets users map arbitrary 2D memory regions (which might or might not be contiguous). It's the Span2D<T> type from the Microsoft.Toolkit.HighPerformance package. Here's a very small version of that type that reproduces the issue mentioned above:

Span2D type (click to expand):
public readonly ref struct Span2D<T>
{
    private readonly Span<T> span;
    private readonly int width;
    internal readonly int Stride;

    public int Height
    {
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        get => this.span.Length;
    }

    public int Width
    {
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        get => this.width;
    }

    public ref T this[int row, int column]
    {
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        get
        {
            if ((uint)row >= (uint)span.Length ||
                (uint)column >= (uint)this.width)
            {
                static void Throw() => throw new IndexOutOfRangeException();

                Throw();
            }

            return ref DangerousGetReferenceAt(row, column);
        }
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public ref T DangerousGetReferenceAt(int i, int j)
    {
        ref T r0 = ref MemoryMarshal.GetReference(this.span);
        nint index = ((nint)(uint)i * (nint)(uint)this.Stride) + (nint)(uint)j;

        return ref Unsafe.Add(ref r0, index);
    }
}

With this and the sample code in this issue, I do get that unnecessary bounds checks in the resulting codegen. So the issue was not really about ND arrays per se, but rather about the possibility of the JIT to just be able to track these kind of assertions in general, and not just for a specific type. As in, ideally a fix for this would not just improve the codegen for ND arrays in the future, but to all similar cases such as this one, so that types outside of the BCL such as Span2D<T> could benefit from that as well.

Right now this has quite the performance penalty unless users switch to completely unsafe code by using Span2D<T>.DangerousGetReferenceAt directly (which has its obvious downsides to it of course). Or, they could grab a span for a row and then just index that, which would be pretty much the same as using a jagged array as you suggested, but the resulting code would be more verbose. It would certainly be nicer if the JIT could just see this and produce the correct codegen on its own.

Do you think such an optimization could be added to the JIT in the future, our would something like this be technically too expensive to implement, at least for now? Thanks again for your time! 😊

@briansull
Copy link
Contributor

briansull commented Dec 7, 2020

Do you have a compile ready version of that demonstrates the issue, that would help me investigate this?

In your last post you code uses DangerousGetReferenceAt, which is the completely unsafe code that you want to avoid writing.

In the test case I would expect to see a double nested loop, where you want the JIT to somehow hoist out one of the bounds checks.
But I am not exactly sure what you want the JIT to be able to do.

@briansull
Copy link
Contributor

Also The JIT has to be fairly conservative in this area. We could have different thread that could change the values of width and/or Stride. readonly fields don't give you 100% protection from that as Reflection could be used to change these values as well.

@Sergio0694
Copy link
Contributor Author

Sergio0694 commented Dec 7, 2020

We could have different thread that could change the values of width and/or Stride. readonly fields don't give you 100% protection from that as Reflection could be used to change these values as well.

You're absolutely right, but what I wanted to highlight here was that the code was already not safe against that scenario anyway, since the JIT is just caching the value of that field into a register for the entire time and never checking it again. So if another thread even used reflection to change that field, that method would never be able to see the change anyway, no? 🤔

EDIT: also in this case that's a ref struct passed by copy, so I don't think there would be a way for anyone else to really change any of those fields at all?

Do you have a compile ready version of that demonstrates the issue, that would help me investigate this?

Sure thing! Consider the following full repro:

C# repro code (click to expand)
public static int Sum(Span2D<int> span)
{
    int height = span.Height;
    int width = span.Width;
    int sum = 0;

    for (int i = 0; i < height; i++)
    {
        for (int j = 0; j < width; j++)
        {
            sum += span[i, j];
        }
    }

    return sum;
}
public readonly ref struct Span2D<T>
{
    private readonly Span<T> span;
    private readonly int width;
    internal readonly int Stride;

    public int Height
    {
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        get => this.span.Length;
    }

    public int Width
    {
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        get => this.width;
    }

    public ref T this[int row, int column]
    {
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        get
        {
            if ((uint)row >= (uint)span.Length ||
                (uint)column >= (uint)this.width)
            {
                static void Throw() => throw new IndexOutOfRangeException();

                Throw();
            }

            return ref DangerousGetReferenceAt(row, column);
        }
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public ref T DangerousGetReferenceAt(int i, int j)
    {
        ref T r0 = ref MemoryMarshal.GetReference(this.span);
        nint index = ((nint)(uint)i * (nint)(uint)this.Stride) + (nint)(uint)j;

        return ref Unsafe.Add(ref r0, index);
    }
}
Resulting x64 codegen (click to expand)
; Assembly listing for method RefToArrayTest.Program:Sum(RefToArrayTest.Span2D`1[Int32]):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T01] (  6, 72   )   byref  ->  rcx         ld-addr-op
;  V01 loc0         [V01,T10] (  3,  6   )     int  ->   r8        
;  V02 loc1         [V02,T07] (  3, 21   )     int  ->  r10        
;  V03 loc2         [V03,T05] (  4, 34   )     int  ->  rax        
;  V04 loc3         [V04,T04] (  6, 45   )     int  ->  r11        
;  V05 loc4         [V05,T00] (  6, 76   )     int  ->  rsi        
;  V06 OutArgs      [V06    ] (  1,  1   )  lclBlk (32) [rsp+0x00]   "OutgoingArgSpace"
;  V07 tmp1         [V07,T06] (  2, 32   )    long  ->  rbx         "Inline stloc first use temp"
;  V08 tmp2         [V08,T02] (  2, 64   )   byref  ->  rdi         "impAppendStmt"
;  V09 tmp3         [V09,T03] (  2, 64   )  struct (16) [rsp+0x28]   do-not-enreg[SFB] must-init ld-addr-op "Inlining Arg"
;  V10 cse0         [V10,T08] (  3, 18   )     int  ->  rdx         "CSE - aggressive"
;  V11 cse1         [V11,T09] (  3, 10   )     int  ->   r9         "CSE - aggressive"
;
; Lcl frame size = 56

G_M54696_IG01:
       push     rdi
       push     rsi
       push     rbp
       push     rbx
       sub      rsp, 56
       vzeroupper 
       xor      rax, rax
       mov      qword ptr [rsp+28H], rax
       mov      qword ptr [rsp+30H], rax
						;; bbWeight=1    PerfScore 7.50
G_M54696_IG02:
       mov      edx, dword ptr [rcx+8]
       mov      r8d, edx
       mov      r9d, dword ptr [rcx+16]
       mov      r10d, r9d
       xor      eax, eax
       xor      r11d, r11d
       test     r8d, r8d
       jle      SHORT G_M54696_IG09
						;; bbWeight=1    PerfScore 6.25
G_M54696_IG03:
       xor      esi, esi
       test     r10d, r10d
       jle      SHORT G_M54696_IG08
						;; bbWeight=4    PerfScore 6.00
G_M54696_IG04:
       cmp      r11d, edx
       jae      SHORT G_M54696_IG10
						;; bbWeight=16    PerfScore 20.00
G_M54696_IG05:
       cmp      esi, r9d
       jae      SHORT G_M54696_IG10
						;; bbWeight=8    PerfScore 10.00
G_M54696_IG06:
       vmovdqu  xmm0, xmmword ptr [rcx]
       vmovdqu  xmmword ptr [rsp+28H], xmm0
						;; bbWeight=16    PerfScore 48.00
G_M54696_IG07:
       mov      rdi, bword ptr [rsp+28H]
       mov      ebx, r11d
       mov      ebp, dword ptr [rcx+20]
       imul     rbx, rbp
       mov      ebp, esi
       add      rbx, rbp
       add      eax, dword ptr [rdi+4*rbx]
       inc      esi
       cmp      esi, r10d
       jl       SHORT G_M54696_IG04
						;; bbWeight=16    PerfScore 148.00
G_M54696_IG08:
       inc      r11d
       cmp      r11d, r8d
       jl       SHORT G_M54696_IG03
						;; bbWeight=4    PerfScore 6.00
G_M54696_IG09:
       add      rsp, 56
       pop      rbx
       pop      rbp
       pop      rsi
       pop      rdi
       ret      
						;; bbWeight=1    PerfScore 3.25
G_M54696_IG10:
       call     RefToArrayTest.Span2D`1[Int32][System.Int32]:<get_Item>g__Throw|8_0()
       int3     
						;; bbWeight=0    PerfScore 0.00

; Total bytes of code 126, prolog size 23, PerfScore 267.80, instruction count 47 (MethodHash=f1042a57) for method RefToArrayTest.Program:Sum(RefToArrayTest.Span2D`1[Int32]):int
; ============================================================

My issue is specifically with the following bit from the middle of the codegen:

G_M54696_IG04:
       cmp      r11d, edx
       jae      SHORT G_M54696_IG10
G_M54696_IG05:
       cmp      esi, r9d
       jae      SHORT G_M54696_IG10
G_M54696_IG06:
       vmovdqu  xmm0, xmmword ptr [rcx]
       vmovdqu  xmmword ptr [rsp+28H], xmm0
G_M54696_IG07:
       mov      rdi, bword ptr [rsp+28H]
       mov      ebx, r11d
       mov      ebp, dword ptr [rcx+20]
       imul     rbx, rbp
       mov      ebp, esi
       add      rbx, rbp
       add      eax, dword ptr [rdi+4*rbx]
       inc      esi
       cmp      esi, r10d
       jl       SHORT G_M54696_IG04

From what I can see (please correct me if I'm mistaking!), the JIT is storing the following:

  • Loop counter i ---> r11d
  • Loop counter j ---> esi
  • Local height ---> r8d and edx
  • Local width ---> r9d and r10d

Now, the critical bit for this issue is that you can see the following:

  • G_M54696_IG04 ---> checks i < height
  • G_M54696_IG05 ---> checks j < width
  • G_M54696_IG06 ---> copy Span<T>.ByRef (also, stack spill?) ✅
  • G_M54696_IG07 --->
    • actual indexing ✅
    • increment only j, and jump back to G_M54696_IG04

Specifically, at the end of G_M54696_IG07:

G_M54696_IG07:
       ; [...]
       inc      esi
       cmp      esi, r10d
       jl       SHORT G_M54696_IG04

My point is that all the code between G_M54696_IG04 and G_M54696_IG07 never modifies r11d nor rdx, hence jumping back to G_M54696_IG04 here is unnecessary and inserts a redundant conditional branch again that's not needed. I'm thinking the JIT should be able to see that the only two variables involved in the check at G_M54696_IG04 are never touched in this code, conclude the first check can be inferred to still be true, and therefore skip it entirely and just jump back to G_M54696_IG05 instead.

This way, (i < height) would only be done once per each inner loop iteration, and (j < width) would still always be done, which is correct. All the variables involved in this code are just cached in registers, I'm not seeing how such an optimization could possibly break the code, even assuming eg. devs using reflection to access private fields (which is already technically be broken?).

Hope this makes sense, thank you again for taking the time to look into this! 😊

@briansull briansull removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Dec 7, 2020
@briansull
Copy link
Contributor

@AndyAyersMS

Would the code that you have added to do tail duplication of conditional blocks possibly kick in here?

If this block below were duplicated so that it had only one predecessor for each copy we could use assertion prop to fold away the conditional branch for the inner loop.

G_M54696_IG04:
       cmp      r11d, edx
       jae      SHORT G_M54696_IG10

@AndyAyersMS
Copy link
Member

Not currently, as there's also an assignment in the block:

------------ BB02 [016..01B) -> BB06 (cond), preds={BB01,BB06} succs={BB03,BB06}

***** BB02
STMT00007 (IL 0x016...0x017)
               [000024] -A---+------              *  ASG       int   
               [000023] D----+-N----              +--*  LCL_VAR   int    V05 loc4         
               [000022] -----+------              \--*  CNS_INT   int    0

***** BB02
STMT00027 (IL 0x02F...  ???)
               [000166] ------------              *  JTRUE     void  
     (  7,  5) [000163] J------N----              \--*  GE        int   
     (  3,  2) [000164] ------------                 +--*  LCL_VAR   int    V05 loc4         
     (  3,  2) [000165] ------------                 \--*  LCL_VAR   int    V02 loc1         

We could try and generalize what's there to allow a bit more code to be duplicated. We might also need other tweaks.

@briansull
Copy link
Contributor

OK, I will open an issue about this and link it to this issue.

@JulieLeeMSFT JulieLeeMSFT added the Priority:2 Work that is important, but not critical for the release label Feb 8, 2021
@JulieLeeMSFT JulieLeeMSFT modified the milestones: 6.0.0, Future Mar 29, 2021
@jakobbotsch
Copy link
Member

This looks fixed, the codegen for Repro is now:

; Method C:Repro(C+Wrapper):int
G_M58570_IG01:
       sub      rsp, 40
						;; size=4 bbWeight=1    PerfScore 0.25

G_M58570_IG02:
       xor      eax, eax
       xor      edx, edx
       align    [0 bytes for IG03]
						;; size=4 bbWeight=1    PerfScore 0.50

G_M58570_IG03:
       cmp      edx, ecx
       jae      SHORT G_M58570_IG05
       inc      eax
       inc      eax
       inc      eax
       inc      edx
       cmp      edx, 100
       jl       SHORT G_M58570_IG03
						;; size=17 bbWeight=4    PerfScore 14.00

G_M58570_IG04:
       add      rsp, 40
       ret      
						;; size=5 bbWeight=1    PerfScore 1.25

G_M58570_IG05:
       call     [C+Wrapper:<Test>g__Throw|1_0()]
       int3     
						;; size=7 bbWeight=0    PerfScore 0.00
; Total bytes of code: 37

Probably fixed by various enhancements around jump threading/RBO.

@ghost ghost locked as resolved and limited conversation to collaborators Jan 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI Priority:2 Work that is important, but not critical for the release tenet-performance Performance related issue
Projects
Archived in project
Development

No branches or pull requests

8 participants