Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ARM64] Possible perf regression: slicing #41704

Open
adamsitnik opened this issue Sep 1, 2020 · 11 comments
Open

[ARM64] Possible perf regression: slicing #41704

adamsitnik opened this issue Sep 1, 2020 · 11 comments
Assignees
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@adamsitnik
Copy link
Member

adamsitnik commented Sep 1, 2020

After running benchmarks for 3.1 vs 5.0 using "Ubuntu arm64 Qualcomm Machines" owned by the JIT Team, I've found few regressions related to slicing.

It looks like these are ARM64 specific regressions, I was not able to reproduce it for ARM (the 32-bit variant).

Repro

git clone https://github.com/dotnet/performance.git
py ./performance/scripts/benchmarks_ci.py -f netcoreapp3.1 netcoreapp5.0 --architecture arm64 --filter 'System.Memory.Slice*'
BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 16.04
Unknown processor
  [Host]     : .NET Core 3.1.8 (CoreCLR 4.700.20.41105, CoreFX 4.700.20.41903), Arm64 RyuJIT
  Job-PVNQZA : .NET Core 3.1.8 (CoreCLR 4.700.20.41105, CoreFX 4.700.20.41903), Arm64 RyuJIT
  Job-PXIHWO : .NET Core 5.0.0 (CoreCLR 5.0.20.41714, CoreFX 5.0.20.41714), Arm64 RyuJIT
Type Method Toolchain Mean Ratio
Slice<Byte> SpanStart netcoreapp3.1 3.831 ns 1.00
Slice<Byte> SpanStart netcoreapp5.0 2.550 ns 0.67
Slice<String> SpanStart netcoreapp3.1 11.526 ns 1.00
Slice<String> SpanStart netcoreapp5.0 16.482 ns 1.43
Slice<Byte> SpanStartLength netcoreapp3.1 3.782 ns 1.00
Slice<Byte> SpanStartLength netcoreapp5.0 3.202 ns 0.85
Slice<String> SpanStartLength netcoreapp3.1 11.720 ns 1.00
Slice<String> SpanStartLength netcoreapp5.0 16.823 ns 1.44
Slice<Byte> ReadOnlySpanStart netcoreapp3.1 3.801 ns 1.00
Slice<Byte> ReadOnlySpanStart netcoreapp5.0 2.867 ns 0.75
Slice<String> ReadOnlySpanStart netcoreapp3.1 9.144 ns 1.00
Slice<String> ReadOnlySpanStart netcoreapp5.0 16.039 ns 1.75
Slice<Byte> ReadOnlySpanStartLength netcoreapp3.1 3.779 ns 1.00
Slice<Byte> ReadOnlySpanStartLength netcoreapp5.0 3.156 ns 0.83
Slice<String> ReadOnlySpanStartLength netcoreapp3.1 9.279 ns 1.00
Slice<String> ReadOnlySpanStartLength netcoreapp5.0 16.418 ns 1.77
Slice<Byte> MemoryStart netcoreapp3.1 3.779 ns 1.00
Slice<Byte> MemoryStart netcoreapp5.0 6.418 ns 1.70
Slice<String> MemoryStart netcoreapp3.1 12.952 ns 1.00
Slice<String> MemoryStart netcoreapp5.0 25.550 ns 1.97
Slice<Byte> MemoryStartSpan netcoreapp3.1 6.416 ns 1.00
Slice<Byte> MemoryStartSpan netcoreapp5.0 10.515 ns 1.64
Slice<String> MemoryStartSpan netcoreapp3.1 24.265 ns 1.00
Slice<String> MemoryStartSpan netcoreapp5.0 33.036 ns 1.36
Slice<Byte> MemoryStartLength netcoreapp3.1 3.805 ns 1.00
Slice<Byte> MemoryStartLength netcoreapp5.0 5.899 ns 1.55
Slice<String> MemoryStartLength netcoreapp3.1 12.285 ns 1.00
Slice<String> MemoryStartLength netcoreapp5.0 18.409 ns 1.50
Slice<Byte> MemoryStartLengthSpan netcoreapp3.1 6.245 ns 1.00
Slice<Byte> MemoryStartLengthSpan netcoreapp5.0 9.975 ns 1.60
Slice<String> MemoryStartLengthSpan netcoreapp3.1 23.963 ns 1.00
Slice<String> MemoryStartLengthSpan netcoreapp5.0 31.040 ns 1.30
Slice<Byte> ReadOnlyMemoryStart netcoreapp3.1 3.807 ns 1.00
Slice<Byte> ReadOnlyMemoryStart netcoreapp5.0 6.394 ns 1.68
Slice<String> ReadOnlyMemoryStart netcoreapp3.1 9.371 ns 1.00
Slice<String> ReadOnlyMemoryStart netcoreapp5.0 22.140 ns 2.36
Slice<Byte> ReadOnlyMemoryStartSpan netcoreapp3.1 6.379 ns 1.00
Slice<Byte> ReadOnlyMemoryStartSpan netcoreapp5.0 9.150 ns 1.43
Slice<String> ReadOnlyMemoryStartSpan netcoreapp3.1 22.247 ns 1.00
Slice<String> ReadOnlyMemoryStartSpan netcoreapp5.0 32.229 ns 1.45
Slice<Byte> ReadOnlyMemoryStartLength netcoreapp3.1 3.733 ns 1.00
Slice<Byte> ReadOnlyMemoryStartLength netcoreapp5.0 6.005 ns 1.61
Slice<String> ReadOnlyMemoryStartLength netcoreapp3.1 8.833 ns 1.00
Slice<String> ReadOnlyMemoryStartLength netcoreapp5.0 16.008 ns 1.81
Slice<Byte> ReadOnlyMemoryStartLengthSpan netcoreapp3.1 6.087 ns 1.00
Slice<Byte> ReadOnlyMemoryStartLengthSpan netcoreapp5.0 9.430 ns 1.55
Slice<String> ReadOnlyMemoryStartLengthSpan netcoreapp3.1 22.489 ns 1.00
Slice<String> ReadOnlyMemoryStartLengthSpan netcoreapp5.0 29.502 ns 1.31
Slice<Byte> MemorySpanStart netcoreapp3.1 8.985 ns 1.00
Slice<Byte> MemorySpanStart netcoreapp5.0 11.703 ns 1.31
Slice<String> MemorySpanStart netcoreapp3.1 23.013 ns 1.00
Slice<String> MemorySpanStart netcoreapp5.0 23.544 ns 1.02
Slice<Byte> MemorySpanStartLength netcoreapp3.1 8.289 ns 1.00
Slice<Byte> MemorySpanStartLength netcoreapp5.0 9.989 ns 1.21
Slice<String> MemorySpanStartLength netcoreapp3.1 23.611 ns 1.00
Slice<String> MemorySpanStartLength netcoreapp5.0 23.401 ns 0.99
Slice<Byte> ReadOnlyMemorySpanStart netcoreapp3.1 8.519 ns 1.00
Slice<Byte> ReadOnlyMemorySpanStart netcoreapp5.0 11.698 ns 1.37
Slice<String> ReadOnlyMemorySpanStart netcoreapp3.1 19.716 ns 1.00
Slice<String> ReadOnlyMemorySpanStart netcoreapp5.0 22.038 ns 1.12
Slice<Byte> ReadOnlyMemorySpanStartLength netcoreapp3.1 6.770 ns 1.00
Slice<Byte> ReadOnlyMemorySpanStartLength netcoreapp5.0 10.912 ns 1.61
Slice<String> ReadOnlyMemorySpanStartLength netcoreapp3.1 21.624 ns 1.00
Slice<String> ReadOnlyMemorySpanStartLength netcoreapp5.0 22.292 ns 1.03

@kunalspathak is there any chance you could take a look at the produced assembly code and verify if this is an actual regression in code gen or not?

category:cq
theme:ssa
skill-level:expert
cost:large

@adamsitnik adamsitnik added arch-arm64 tenet-performance Performance related issue area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Sep 1, 2020
@adamsitnik adamsitnik added this to the 5.0.0 milestone Sep 1, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Sep 1, 2020
@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Sep 1, 2020
@JulieLeeMSFT
Copy link
Member

@kunalspathak please look into this.

CC @dotnet/jit-contrib

@kunalspathak
Copy link
Member

Noting down some observations, not necessarily related to the regression.
For Slice.ReadOnlySpanStartLength() , we don't inline calls to Slice() because of following code:

// Runtime does not support inlining of all shapes of runtime lookups

Another observation is in .NET 3.1, we didn't JIT SequenceEqual(), but I see that we JIT that method in .NET 5. It is contrary to the fact that it was marked with AggresiveOptimization in the past and was removed in #32371 to get it JIT during R2R.

@kunalspathak
Copy link
Member

kunalspathak commented Sep 4, 2020

The benchmarks that regressed operates on Slice<string> and Slice<byte>. The calls to Slice<string> don't get inlined as expected, but the ones for Slice<byte> gets inlined. My following analysis is for benchmarks for Slice<string> particularly ReadOnlyMemoryStart() but most likely it is the cause for other benchmarks for Slice<string>. I will share my findings for benchmarks under Slice<byte> separately, once I do that analysis.

The jit assembly for Slice() is unchanged, but the regression is inside the benchmark code ReadOnlyMemoryStart(). The underlying issue is that in .NET 3.1, we do a null check for obj and if it is null, call HELPER_RUNTIMELOOKUP. If not, we will just operate on that object. This is how we generated code in .NET 3.1 for such checks:

G_M18749_IG07:
            add     x0, fp, #32	// [V01 loc0]
            mov     w2, #5
            bl      ReadOnlyMemory`1:Slice(int):struct:this
            str     x0, [fp,#16]	// [V02 loc1]
            str     x1, [fp,#24]	// [V02 loc1+0x08]
            mov     x0, x20
            ldr     x2, [x21,#16]
            cbnz    x2, G_M18749_IG08    ; <----------- This condition checks if obj != null
            movz    x1, #0xd1ffab1e
            movk    x1, #0xd1ffab1e LSL #16
            movk    x1, #0xd1ffab1e LSL #32
            bl      CORINFO_HELP_RUNTIMEHANDLE_CLASS
            mov     x2, x0

G_M18749_IG08:
            add     x1, fp, #16	// [V02 loc1]
            mov     x0, x2
            bl      Slice`1:Consume(byref)
            mov     x0, x20
            ldr     x1, [x21,#24]
            cbnz    x1, G_M18749_IG09
            movz    x1, #0xd1ffab1e
            movk    x1, #0xd1ffab1e LSL #16
            movk    x1, #0xd1ffab1e LSL #32
            bl      CORINFO_HELP_RUNTIMEHANDLE_CLASS
            mov     x1, x0

For happy path (where obj != null), we would jump to G_M18749_IG08 and use the obj. Otherwise, we would call CORINFO_HELP_RUNTIMEHANDLE_CLASS. However, in .NET 5, we have flipped the condition of this check:

G_M6359_IG14:
            add     x0, fp, #64	// [V01 loc0]
            mov     w2, #5
            bl      ReadOnlyMemory`1:Slice(int):ReadOnlyMemory`1:this
            str     x0, [fp,#48]	// [V02 loc1]
            str     x1, [fp,#56]	// [V02 loc1+0x08]
            mov     x0, x20
            ldr     x1, [x21,#24]
            cbz     x1, G_M6359_IG16    ; <----------- This condition now checks if obj == null
						;; bbWeight=1    PerfScore 8.50
G_M6359_IG15:
            str     x1, [fp,#32]	// [V14 tmp11]
            b       G_M6359_IG17
						;; bbWeight=0.25 PerfScore 0.50
G_M6359_IG16:
            movz    x1, #0xd1ffab1e
            movk    x1, #0xd1ffab1e LSL #16
            movk    x1, #0xd1ffab1e LSL #32
            bl      CORINFO_HELP_RUNTIMEHANDLE_CLASS
            str     x0, [fp,#32]	// [V14 tmp11]
						;; bbWeight=0.25 PerfScore 0.88
G_M6359_IG17:
            add     x1, fp, #48	// [V02 loc1]
            ldr     x0, [fp,#32]	// [V14 tmp11]
            bl      Slice`1:Consume(byref)
            mov     x0, x20
            ldr     x1, [x21,#32]
            cbz     x1, G_M6359_IG19

Now, in happy path, we check for obj == null condition and if true, would call CORINFO_HELP_RUNTIMEHANDLE_CLASS, otherwise, we would goto G_M6359_IG15 and do a jump to G_M6359_IG17. Not only we introduce extra jumps in .NET 5, we also have to spill the values on stack. The local frame size in .NET 3.1 is just 40 vs. 168 for .NET 5.0.

The branches in G_M6359_IG15 and G_M6359_IG16 are weighted same as 0.25. As per @AndyAyersMS , I tried to bump up the weight of non-call branch higher, but that doesn't change the condition. I will further see what else we can do here.

Edit:

In .NET 3.1, the condition would just have 1 arm and so it gets flipped, however in .NET 5, we have 2 arms on that condition, one coming from "expandable generic dictionaries" work that we did in .NET 5, so most like given that, bumping the non-call branch won't make a difference.

.NET 3.1 tree:

               [000163] ------------              *  STMT      void  (IL   ???...  ???)
               [000162] -AC-G-------              \--*  ASG       long  
               [000161] D------N----                 +--*  LCL_VAR   long   V13 tmp9         
               [000160] --C-G-------                 \--*  QMARK     long  
               [000157] Q-----------    if              +--*  NE        int   
               [000150] ------------                    |  +--*  LCL_VAR   long   V13 tmp9         
               [000156] ------------                    |  \--*  CNS_INT   long   0
               [000159] --C-G-------    if              \--*  COLON     long  
               [000155] --C-G------- else                  +--*  CALL help long   HELPER.CORINFO_HELP_RUNTIMEHANDLE_CLASS
               [000138] ------------ arg0                  |  +--*  LCL_VAR   long   V12 tmp8         
               [000152] ------------ arg1                  |  \--*  CNS_INT(h) long   0xd1ffab1e token
               [000158] ------------ then                  \--*  NOP       void  

.NET 5.0 tree

               [000109] -AC-G+------              *  ASG       long  
               [000108] D----+-N----              +--*  LCL_VAR   long   V10 tmp7         
               [000107] --C-G+------              \--*  QMARK     long  
               [000097] J----+-N----    if           +--*  EQ        int   
               [000093] n----+------                 |  +--*  IND       long  
               [000092] -----+------                 |  |  \--*  ADD       long  
               [000090] #----+------                 |  |     +--*  IND       long  
               [000089] #----+------                 |  |     |  \--*  IND       long  
               [000088] -----+------                 |  |     |     \--*  ADD       long  
               [000086] -----+------                 |  |     |        +--*  LCL_VAR   long   V09 tmp6         
               [000087] -----+------                 |  |     |        \--*  CNS_INT   long   48
               [000091] -----+------                 |  |     \--*  CNS_INT   long   24
               [000096] -----+------                 |  \--*  CNS_INT   long   0
               [000106] --C-G+?-----    if           \--*  COLON     long  
               [000095] --C-G+?----- else               +--*  CALL help long   HELPER.CORINFO_HELP_RUNTIMEHANDLE_CLASS
               [000085] -----+?----- arg0 in x0         |  +--*  LCL_VAR   long   V09 tmp6         
               [000094] -----+?----- arg1 in x1         |  \--*  CNS_INT(h) long   0x7ffb0671e8e8 token
               [000098] n----+?----- then               \--*  IND       long  
               [000099] -----+?-----                       \--*  ADD       long  
               [000100] #----+?-----                          +--*  IND       long  
               [000101] #----+?-----                          |  \--*  IND       long  
               [000102] -----+?-----                          |     \--*  ADD       long  
               [000103] -----+?-----                          |        +--*  LCL_VAR   long   V09 tmp6         
               [000104] -----+?-----                          |        \--*  CNS_INT   long   48
               [000105] -----+?-----                          \--*  CNS_INT   long   24

@kunalspathak
Copy link
Member

Here is my analysis for benchmarks in Slice<byte>. The regression is coming from the inlined Slice<byte>() and the ctor for ReadOnlyMemory<byte> present inside Slice() as seen here.

Below is the assembly code for

Consume(memory.Slice(Size / 2)); // private const int Size = 10;

In .NET 3.1, here is disassembly. IG06 does this check and if within bounds, proceed to creating the ReadOnlyMemory() object in IG07 as seen here.

G_M45388_IG06:
        B94037B3          ldr     w19, [fp,#52]	// [V01 loc0+0x0c]
        7100167F          cmp     w19, #5
        540005A3          blo     G_M45388_IG12

G_M45388_IG07:
        F94017B4          ldr     x20, [fp,#40]	// [V01 loc0]
        AA1403E0          mov     x0, x20
        51001675          sub     w21, w19, #5
        2A1503E1          mov     w1, w21
        528000A2          mov     w2, #5
        F9000FA0          str     x0, [fp,#24]	// [V21 tmp18] ; this._object = _object
        B90023A2          str     w2, [fp,#32]	// [V22 tmp19] ; this._index = _index
        B90027A1          str     w1, [fp,#36]	// [V23 tmp20] ; this._length = _length
        910063A0          add     x0, fp, #24	// [V02 loc1]
        94000000          bl      Slice`1:Consume(byref)
        7100167F          cmp     w19, #5
        54000483          blo     G_M45388_IG13

In .NET 5, here is disassembly. There are more memory access in IG06 (4 vs. 7).

G_M6359_IG05:
        B9402FA0          ldr     w0, [fp,#44]	// [V23 tmp20]
        7100141F          cmp     w0, #5
        540006C3          blo     G_M6359_IG11
						;; bbWeight=0.50 PerfScore 1.75

G_M6359_IG06:
        B9402BA0          ldr     w0, [fp,#40]	// [V22 tmp19]
        11001400          add     w0, w0, #5  ; <-- In .NET 3.1, we constant prop value of _index and turns this to " = 5"
        B9402FA1          ldr     w1, [fp,#44]	// [V23 tmp20] ; <-- could have skipped because we already load it in IG05, the way it is skipped in .NET 3.1
        51001421          sub     w1, w1, #5
        F94013A2          ldr     x2, [fp,#32]	// [V21 tmp18]
        F9000BA2          str     x2, [fp,#16]	// [V24 tmp21] ; this._object = _object
        B9001BA0          str     w0, [fp,#24]	// [V25 tmp22] ; this._index = _index
        B9001FA1          str     w1, [fp,#28]	// [V26 tmp23] ; this._length = _length
        910043A0          add     x0, fp, #16	// [V02 loc1]
        94000000          bl      Slice`1:Consume(byref)
        B9402FA0          ldr     w0, [fp,#44]	// [V23 tmp20] ; <-- could have been skipped
        7100141F          cmp     w0, #5
        54000523          blo     G_M6359_IG11

I believe some of them are happening because we fail to constant propagate the value of _index which is zero. Instead we load the value from stack and add 5 to it. In .NET 3.1, we detected this and converted it to assigment.

.NET 3.1 dump:


***** BB05, stmt 11 (before)
N005 (  5,  6) [000118] -A------R---              *  ASG       int   
N004 (  1,  1) [000117] D------N----              +--*  LCL_VAR   int    V07 tmp4         d:2
N003 (  5,  6) [000073] ------------              \--*  ADD       int   
N001 (  3,  4) [000070] ------------                 +--*  LCL_FLD   int    V01 loc0         u:6[+8] Fseq[_index]
N002 (  1,  1) [000072] ------------                 \--*  CNS_INT   int    5

  VNApplySelectors:
    VNForHandle(_index) is $142, fieldType is int
      AX2: $142 != $143 ==> select([$283]store($281, $143, $2c1), $142) ==> select($281, $142).
      AX1: select([$241]store($281, $142, $40), $142) ==> $40.
    VNForMapSelect($380, $142):int returns $40 {IntCns 0}
  VNApplySelectors:
    VNForHandle(_index) is $142, fieldType is int
    VNForMapSelect($380, $142):int returns $40 {IntCns 0}
N001 [000070]   LCL_FLD   V01 loc0         u:6[+8] Fseq[_index] => $40 {IntCns 0}
N002 [000072]   CNS_INT   5 => $42 {IntCns 5}
N003 [000073]   ADD       => $42 {IntCns 5}
N004 [000117]   LCL_VAR   V07 tmp4         d:2 => $42 {IntCns 5}
N005 [000118]   ASG       => $42 {IntCns 5}

***** BB05, stmt 11 (after)
N005 (  5,  6) [000118] -A------R---              *  ASG       int    $42
N004 (  1,  1) [000117] D------N----              +--*  LCL_VAR   int    V07 tmp4         d:2 $42
N003 (  5,  6) [000073] ------------              \--*  ADD       int    $42
N001 (  3,  4) [000070] ------------                 +--*  LCL_FLD   int    V01 loc0         u:6[+8] Fseq[_index] $40
N002 (  1,  1) [000072] ------------                 \--*  CNS_INT   int    5 $42

.NET 5.0 dump:


***** BB04, STMT00019(before)
N005 (  3,  4) [000092] -A------R---              *  ASG       int   
N004 (  1,  1) [000091] D------N----              +--*  LCL_VAR   int    V07 tmp4         d:2
N003 (  3,  4) [000058] ------------              \--*  ADD       int   
N001 (  1,  1) [000056] ------------                 +--*  LCL_VAR   int    V10 tmp7         
N002 (  1,  2) [000057] ------------                 \--*  CNS_INT   int    5

N001 [000056]   LCL_VAR   V10 tmp7          => $282 {282}
N002 [000057]   CNS_INT   5 => $42 {IntCns 5}
N003 [000058]   ADD       => $206 {ADD($42, $282)}
N004 [000091]   LCL_VAR   V07 tmp4         d:2 => $206 {ADD($42, $282)}
N005 [000092]   ASG       => $206 {ADD($42, $282)}

***** BB04, STMT00019(after)
N005 (  3,  4) [000092] -A------R---              *  ASG       int    $206
N004 (  1,  1) [000091] D------N----              +--*  LCL_VAR   int    V07 tmp4         d:2 $206
N003 (  3,  4) [000058] ------------              \--*  ADD       int    $206
N001 (  1,  1) [000056] ------------                 +--*  LCL_VAR   int    V10 tmp7          $282
N002 (  1,  2) [000057] ------------                 \--*  CNS_INT   int    5 $42

I am still trying to see why we fail to detect _index being constant.

@kunalspathak
Copy link
Member

@CarolEidt pointed out that in .NET 3.1, this._index remains LCL_FLD and gets to SSA form while in .NET 5, we do struct promotion, but because of this condition, we don't convert it to SSA form. Because of this, we don't do constant propagation. We need to support multi-reg defs for SSA and might not be something we want to fix as part of perf regression fix. We should port this to .NET 6 (at least for benchmarks under Slice<byte>()).

@kunalspathak
Copy link
Member

Also, it turns out that the fix for Slice<string>() needs more thinking given the introduction of runtime lookup in .NET 5. We need to have a broader fix for this which should be done in .NET 6

@kunalspathak
Copy link
Member

Here are the benchmarks that regressed (taken from the description but filtered just the ones that regressed)

Class Method .NET5 / .NET 3.0 ratio
Slice ReadOnlyMemoryStart 2.36
Slice MemoryStart 1.97
Slice ReadOnlyMemoryStartLength 1.81
Slice ReadOnlySpanStartLength 1.77
Slice ReadOnlySpanStart 1.75
Slice MemoryStart 1.7
Slice ReadOnlyMemoryStart 1.68
Slice MemoryStartSpan 1.64
Slice ReadOnlyMemoryStartLength 1.61
Slice ReadOnlyMemorySpanStartLength 1.61
Slice MemoryStartLengthSpan 1.6
Slice MemoryStartLength 1.55
Slice ReadOnlyMemoryStartLengthSpan 1.55
Slice MemoryStartLength 1.5
Slice ReadOnlyMemoryStartSpan 1.45
Slice SpanStartLength 1.44
Slice SpanStart 1.43
Slice ReadOnlyMemoryStartSpan 1.43
Slice ReadOnlyMemorySpanStart 1.37
Slice MemoryStartSpan 1.36
Slice ReadOnlyMemoryStartLengthSpan 1.31
Slice MemorySpanStart 1.31
Slice MemoryStartLengthSpan 1.3
Slice MemorySpanStartLength 1.21
Slice ReadOnlyMemorySpanStart 1.12
Slice ReadOnlyMemorySpanStartLength 1.03
Slice MemorySpanStart 1.02

@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Nov 10, 2020
@JulieLeeMSFT JulieLeeMSFT added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Mar 23, 2021
@kunalspathak
Copy link
Member

Actionable item: Need to double check the numbers if we did anything in .NET 6.0 to address #41704 (comment).

@kunalspathak kunalspathak removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Apr 1, 2021
@kunalspathak
Copy link
Member

kunalspathak commented Apr 30, 2021

Here is my analysis after comparing the assembly of .NET 6 vs. .NET 3.1. Here is the diff for ReadOnlyMemory.Slice<byte> benchmark.

No zero register used

            mov     x1, #0
            str     x1, [fp,#48]	// [V133 tmp130]
            str     w1, [fp,#56]	// [V134 tmp131]
            str     w1, [fp,#60]	// [V135 tmp132]
            b       G_M31903_IG05

Update: This PR is out - #52269

Redundant ldr from same location

I noticed redundant ldr [fp, #48] and they could have been replaced with mov x1, x19. It can vary from case to case depending on the register availability, but clearly in below case, CSE could have avoided extra load from memory.

            blo     G_M31903_IG88
            ldr     x19, [fp,#48]	// [V133 tmp130]
            ldr     w1, [fp,#56]	// [V134 tmp131]
            add     w20, w1, #5
            ldr     w1, [fp,#60]	// [V135 tmp132]
            sub     w21, w1, #5
            ldr     x1, [fp,#48]	// [V133 tmp130]
            cbz     x1, G_M31903_IG13

Update: Related issue: #6761

Repeated loading of ._object field

We load value of ._object again and again in .NET6

            ldr     x19, [fp,#48]	// [V133 tmp130]
            ...
            ...
            str     x19, [fp,#32]	// [V136 tmp133]
            ...
            ; occurs 16 times

But in .NET 3.1 we load it in x20 once and do mov x21, x20 to store it in final destination, although instead of mov, we could have just done str x20, [fp, #24].

            ldr     x20, [fp,#40]	// [V01 loc0]
            mov     x21, x20
            ...
            str     x21, [fp,#24]	// [V149 tmp146]
            ...
            ...
            mov     x21, x20
            ...
            str     x21, [fp,#24]	// [V149 tmp146]
            ...
            occurs 16 times.

Constant propagation

Next, we still have problem of not const proping 5 in .NET 6 as discussed in #41704 (comment).

            ldr     w1, [fp,#56]	// [V134 tmp131]
            add     w20, w1, #5
            
            ...
            occurs 16 times.

But in .NEt3.1, we const prop and directly move 5:

            mov     w0, #5
            str     x21, [fp,#24]	// [V149 tmp146]  ; ._object
            str     w0, [fp,#32]	// [V150 tmp147]

Repeatative sub operation

Lastly, in .NET 6, we do not CSE _length - start and do sub every time.

            ldr     w1, [fp,#60]	// [V135 tmp132]
            sub     w21, w1, #5
            ...
            str     w21, [fp,#44]	// [V138 tmp135]
            ...
            ...
            sub     w21, w1, #5
            ...
            str     w21, [fp,#44]	// [V138 tmp135]
            ... 
            occurs 15 times

But in .NET3.1, we do subtraction once and CSE the result.

            sub     w22, w19, #5
            mov     w23, w22
            ...
            str     w23, [fp,#36]	// [V151 tmp148]
            ...
            ...
            mov     w23, w22
            ...
            str     w23, [fp,#36]	// [V151 tmp148]
            ...
            15 times

I will try to investigate little more and open separate issues for each of them.

@kunalspathak
Copy link
Member

So most of the regression is happening due to CSE not working for add and sub. During value numbering, we name V22 with unique VN making it not possible to CSE Add(V22, 5) operation.


***** BB05, STMT00027(before)
N005 (  3,  4) [000138] -A------R---              *  ASG       int   
N004 (  1,  1) [000137] D------N----              +--*  LCL_VAR   int    V08 tmp5         d:2
N003 (  3,  4) [000069] ------------              \--*  ADD       int   
N001 (  1,  1) [000067] ------------                 +--*  LCL_VAR   int    V22 tmp19        
N002 (  1,  2) [000068] ------------                 \--*  CNS_INT   int    5

N001 [000067]   LCL_VAR   V22 tmp19         => $242 {242}
N002 [000068]   CNS_INT   5 => $42 {IntCns 5}
N003 [000069]   ADD       => $206 {ADD($42, $242)}
N004 [000137]   LCL_VAR   V08 tmp5         d:2 => $206 {ADD($42, $242)}
N005 [000138]   ASG       => $206 {ADD($42, $242)}

***** BB05, STMT00027(after)
N005 (  3,  4) [000138] -A------R---              *  ASG       int    $206
N004 (  1,  1) [000137] D------N----              +--*  LCL_VAR   int    V08 tmp5         d:2 $206
N003 (  3,  4) [000069] ------------              \--*  ADD       int    $206
N001 (  1,  1) [000067] ------------                 +--*  LCL_VAR   int    V22 tmp19         $242
N002 (  1,  2) [000068] ------------                 \--*  CNS_INT   int    5 $42

====================================================================================================================================================================

***** BB12, STMT00048(before)
N005 (  3,  4) [000250] -A------R---              *  ASG       int   
N004 (  1,  1) [000249] D------N----              +--*  LCL_VAR   int    V16 tmp13        d:2
N003 (  3,  4) [000181] ------------              \--*  ADD       int   
N001 (  1,  1) [000179] ------------                 +--*  LCL_VAR   int    V22 tmp19        
N002 (  1,  2) [000180] ------------                 \--*  CNS_INT   int    5

N001 [000179]   LCL_VAR   V22 tmp19         => $24d {24d}
N002 [000180]   CNS_INT   5 => $42 {IntCns 5}
N003 [000181]   ADD       => $20f {ADD($42, $24d)}
N004 [000249]   LCL_VAR   V16 tmp13        d:2 => $20f {ADD($42, $24d)}
N005 [000250]   ASG       => $20f {ADD($42, $24d)}

***** BB12, STMT00048(after)
N005 (  3,  4) [000250] -A------R---              *  ASG       int    $20f
N004 (  1,  1) [000249] D------N----              +--*  LCL_VAR   int    V16 tmp13        d:2 $20f
N003 (  3,  4) [000181] ------------              \--*  ADD       int    $20f
N001 (  1,  1) [000179] ------------                 +--*  LCL_VAR   int    V22 tmp19         $24d
N002 (  1,  2) [000180] ------------                 \--*  CNS_INT   int    5 $42

The reason is what I mentioned in #41704 (comment). Since it is related to the struct multireg return, I would assign this to @sandreenko .

@sandreenko sandreenko modified the milestones: 6.0.0, Future Jul 9, 2021
@sandreenko
Copy link
Contributor

Multi-def SSA is a large work item that won't happen in 6.0. Marking it as Future for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

6 participants