Unsafe.As resulting in lots of unnecessary moves. #55357

MichalPetryka · 2021-07-08T18:26:53Z

Description

Using Unsafe.As on a struct results in a lot of unnecessary asm move instructions, especially when on method return and causes even more of them after inlining.

Configuration

Sharplab Core CLR v5.0.721.25508

Regression?

No idea

Data

Sharplab

using System.Runtime.Intrinsics.X86;
using System.Runtime.Intrinsics;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;

public static unsafe class Test
{
    public static UInt128 A(UInt128 v)
    {
        return ShiftA(v);
    }
    
    public static Vector128<byte> B(Vector128<byte> v)
    {
        return ShiftB(v);
    }
    
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static UInt128 ShiftA(UInt128 c)
    {
        if (Sse2.IsSupported)
		{
            Vector128<byte> val1 = Sse2.ShiftLeftLogical128BitLane(
                Unsafe.As<UInt128, Vector128<byte>>(ref c), 1);
            return Unsafe.As<Vector128<byte>, UInt128>(ref val1);
		}
        
        // not important in this case
        return default;
    }
    
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static Vector128<byte> ShiftB(Vector128<byte> c)
    {
        if (Sse2.IsSupported)
		{
            return Sse2.ShiftLeftLogical128BitLane(c, 1);
		}
        
        // not important in this case
		return default;
    }
    
    [StructLayout(LayoutKind.Explicit, Size = 16)]
    public readonly struct UInt128
    {
        [FieldOffset(0)]
        public readonly ulong long1;
        [FieldOffset(8)]
        public readonly ulong long2;
        
        public UInt128(ulong a, ulong b)
        {
            long1 = a;
            long2 = b;
        }
    }
}

Output:

; Core CLR 5.0.921.35908 on amd64

Test.A(UInt128)
    L0000: sub rsp, 0x28
    L0004: vzeroupper
    L0007: mov rax, [rdx]
    L000a: mov rax, [rdx+8]
    L000e: vmovupd xmm0, [rsp]
    L0013: vpslldq xmm0, xmm0, 1
    L0018: vmovapd [rsp], xmm0
    L001d: vmovdqu xmm0, [rsp]
    L0022: vmovdqu [rsp+0x18], xmm0
    L0028: vmovdqu xmm0, [rsp+0x18]
    L002e: vmovdqu [rcx], xmm0
    L0032: mov rax, rcx
    L0035: add rsp, 0x28
    L0039: ret

Test.B(System.Runtime.Intrinsics.Vector128`1<Byte>)
    L0000: vzeroupper
    L0003: vmovupd xmm0, [rdx]
    L0007: vpslldq xmm0, xmm0, 1
    L000c: vmovupd [rcx], xmm0
    L0010: mov rax, rcx
    L0013: ret

Test.ShiftA(UInt128)
    L0000: sub rsp, 0x18
    L0004: vzeroupper
    L0007: vmovupd xmm0, [rdx]
    L000b: vpslldq xmm0, xmm0, 1
    L0010: vmovapd [rsp], xmm0
    L0015: vmovdqu xmm0, [rsp]
    L001a: vmovdqu [rcx], xmm0
    L001e: mov rax, rcx
    L0021: add rsp, 0x18
    L0025: ret

Test.ShiftB(System.Runtime.Intrinsics.Vector128`1<Byte>)
    L0000: vzeroupper
    L0003: vmovupd xmm0, [rdx]
    L0007: vpslldq xmm0, xmm0, 1
    L000c: vmovupd [rcx], xmm0
    L0010: mov rax, rcx
    L0013: ret

Test+UInt128..ctor(UInt64, UInt64)
    L0000: mov [rcx], rdx
    L0003: mov [rcx+8], r8
    L0007: ret

category:cq
theme:structs
skill-level:expert
cost:medium
impact:medium

The text was updated successfully, but these errors were encountered:

ghost · 2021-07-08T18:26:57Z

Tagging subscribers to this area: @dotnet/area-system-threading-tasks
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Using Unsafe.As on a struct results in a lot of unnecessary asm move instructions, especially when on method return and causes even more of them after inlining.

Configuration

Sharplab Core CLR v5.0.721.25508

Regression?

No idea

Data

Sharplab

Author:	MichalPetryka
Assignees:	-
Labels:	`area-System.Threading.Tasks`, `tenet-performance`, `untriaged`
Milestone:	-

MichalPetryka · 2021-07-08T18:29:47Z

The area is wrong, should be area-System.Runtime.CompilerServices or area-CodeGen-coreclr.

GrabYourPitchforks · 2021-07-08T20:02:31Z

Related: #55064, #10315

sandreenko · 2021-07-08T20:29:26Z

We still don't have a good way to represent a struct cast without taking its address, it is a known problem that I was planning to fix for 6.0 but did not have time to finish.

There is no easy fix, but we are aware of the issue and will improve this scenario in 7.0.

JulieLeeMSFT · 2022-06-03T23:34:38Z

Moving to .NET 8 as we don't have enough time to work on this.

tannergooding · 2022-06-03T23:59:25Z

This one actually might've been fixed with #68739.

@MichalPetryka is that something you'd be willing to validate looks fixed on your end?

ghost · 2022-10-05T18:07:38Z

This issue has been marked needs-author-action and may be missing some important information.

JulieLeeMSFT · 2022-10-05T18:07:59Z

This one actually might've been fixed with #68739.

@MichalPetryka is that something you'd be willing to validate looks fixed on your end?

Pining @MichalPetryka.

MichalPetryka · 2022-10-18T18:41:10Z

Checked with RC2, the issue is still there @tannergooding.

tannergooding · 2022-10-25T15:25:43Z

That is likely not due to Unsafe.As then. The API is now a "nop" in importation: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/importercalls.cpp#L3752-L3760

MichalPetryka · 2023-04-13T00:35:51Z

With .Net 8 and Unsafe.BitCast, the codegen is now:

G_M000_IG01:                ;; offset=0000H
       4883EC18             sub      rsp, 24
       C5F877               vzeroupper 
 
G_M000_IG02:                ;; offset=0007H
       C5F81002             vmovups  xmm0, xmmword ptr [rdx]
       C5F811442408         vmovups  xmmword ptr [rsp+08H], xmm0
       C5F810442408         vmovups  xmm0, xmmword ptr [rsp+08H]
       C5F973F801           vpslldq  xmm0, xmm0, 1
       C5F81101             vmovups  xmmword ptr [rcx], xmm0
       488BC1               mov      rax, rcx
 
G_M000_IG03:                ;; offset=0023H
       4883C418             add      rsp, 24
       C3                   ret

Just 2 redundant moves from a stack spill left.

MichalPetryka · 2023-05-01T23:36:29Z

#85562 gives the desired codegen here with BitCast.

EDIT: actually seems like it's still not ideal on Linux due to SysV ABI:

; Method Test:A(Test+UInt128):Test+UInt128
G_M20876_IG01:  ;; offset=0000H
       sub      rsp, 40
       vzeroupper 
       mov      qword ptr [rsp+18H], rdi
       mov      qword ptr [rsp+20H], rsi
                        ;; size=17 bbWeight=1 PerfScore 3.25

G_M20876_IG02:  ;; offset=0011H
       vmovups  xmm0, xmmword ptr [rsp+18H]
       vpslldq  xmm0, xmm0, 1
       vmovaps  xmmword ptr [rsp], xmm0
       mov      rax, qword ptr [rsp]
       mov      rdx, qword ptr [rsp+08H]
                        ;; size=25 bbWeight=1 PerfScore 7.00

G_M20876_IG03:  ;; offset=002AH
       add      rsp, 40
       ret      
                        ;; size=5 bbWeight=1 PerfScore 1.25
; Total bytes of code: 47

jakobbotsch · 2023-07-24T09:06:11Z

@MichalPetryka What is the impact of #85562 expected to be here? I see no diffs in A with and without that PR, regardless of using Unsafe.As and Unsafe.BitCast. The codegen on my machine is

; Assembly listing for method Test:A(System.UInt128):System.UInt128 (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 0 single block inlinees; 3 inlinees without PGO data
; Final local variable assignments
;
;  V00 RetBuf       [V00,T01] (  4,  4   )   byref  ->  rcx         single-def
;  V01 arg0         [V01,T00] (  3,  6   )   byref  ->  rdx         single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  struct ( 0) [rsp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V03 tmp1         [V03    ] (  0,  0   )  struct (16) zero-ref    "Inline return value spill temp" <System.UInt128>
;* V04 tmp2         [V04    ] (  0,  0   )  simd16  ->  zero-ref    "spilled call-like call argument"
;  V05 tmp3         [V05    ] (  2,  4   )  struct (16) [rsp+08H]   do-not-enreg[SF] ld-addr-op "Inlining Arg" <System.UInt128>
;  V06 tmp4         [V06,T04] (  2,  4   )  simd16  ->  mm0         ld-addr-op "Inlining Arg" <System.Runtime.Intrinsics.Vector128`1[ubyte]>
;* V07 tmp5         [V07    ] (  0,  0   )    long  ->  zero-ref    "field V01._lower (fldOffset=0x0)" P-INDEP
;* V08 tmp6         [V08    ] (  0,  0   )    long  ->  zero-ref    "field V01._upper (fldOffset=0x8)" P-INDEP
;* V09 tmp7         [V09    ] (  0,  0   )    long  ->  zero-ref    "field V03._lower (fldOffset=0x0)" P-INDEP
;* V10 tmp8         [V10    ] (  0,  0   )    long  ->  zero-ref    "field V03._upper (fldOffset=0x8)" P-INDEP
;  V11 tmp9         [V11,T02] (  2,  4   )    long  ->  [rsp+08H]   do-not-enreg[] "field V05._lower (fldOffset=0x0)" P-DEP
;  V12 tmp10        [V12,T03] (  2,  4   )    long  ->  [rsp+10H]   do-not-enreg[] "field V05._upper (fldOffset=0x8)" P-DEP
;* V13 tmp11        [V13    ] (  0,  0   )  struct (16) zero-ref    "Promoted implicit byref" <System.UInt128>
;
; Lcl frame size = 24
; BEGIN METHOD Test:A(System.UInt128):System.UInt128

G_M34252_IG01:  ;; offset=0000H
       sub      rsp, 24
       vzeroupper
                                                ;; size=7 bbWeight=1 PerfScore 1.25
G_M34252_IG02:  ;; offset=0007H
       vmovups  xmm0, xmmword ptr [rdx]
       vmovups  xmmword ptr [rsp+08H], xmm0
       vmovups  xmm0, xmmword ptr [rsp+08H]
       vpslldq  xmm0, xmm0, 1
       vmovups  xmmword ptr [rcx], xmm0
       mov      rax, rcx
                                                ;; size=28 bbWeight=1 PerfScore 11.25
G_M34252_IG03:  ;; offset=0023H
       add      rsp, 24
       ret
                                                ;; size=5 bbWeight=1 PerfScore 1.25
; END METHOD Test:A(System.UInt128):System.UInt128

; Total bytes of code 40, prolog size 7, PerfScore 17.75, instruction count 10, allocated bytes for code 40 (MethodHash=4c3b7a33) for method Test:A(System.UInt128):System.UInt128 (FullOpts)
; ============================================================

with and without that change.

jakobbotsch · 2023-07-24T09:08:30Z

Ah hang on, user error :-) I see it now.

jakobbotsch · 2023-07-24T09:25:26Z

The reason the original Unsafe.As example is bad is just because of dependent promotion. If we allow physical promotion to handle it there is no issue in the codegen even when using Unsafe.As (i.e. set DOTNET_JitStressModeNames=STRESS_NO_OLD_PROMOTION):

; Assembly listing for method Test:A(System.UInt128):System.UInt128 (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 0 single block inlinees; 1 inlinees without PGO data
; Final local variable assignments
;
;  V00 RetBuf       [V00,T01] (  4,  4   )   byref  ->  rcx         single-def
;  V01 arg0         [V01,T00] (  3,  6   )   byref  ->  rdx         single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  struct ( 0) [rsp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V03 tmp1         [V03    ] (  0,  0   )  struct (16) zero-ref    "Inline return value spill temp" <System.UInt128>
;* V04 tmp2         [V04    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SF] ld-addr-op "Inlining Arg" <System.UInt128>
;  V05 tmp3         [V05,T02] (  2,  2   )  simd16  ->  mm0         ld-addr-op "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector128`1[ubyte]>
;* V06 tmp4         [V06    ] (  0,  0   )  simd16  ->  zero-ref    "V04.[000..016)"
;
; Lcl frame size = 0
; BEGIN METHOD Test:A(System.UInt128):System.UInt128

G_M34252_IG01:  ;; offset=0000H
       vzeroupper
                                                ;; size=3 bbWeight=1 PerfScore 1.00
G_M34252_IG02:  ;; offset=0003H
       vmovups  xmm0, xmmword ptr [rdx]
       vpslldq  xmm0, xmm0, 1
       vmovups  xmmword ptr [rcx], xmm0
       mov      rax, rcx
                                                ;; size=16 bbWeight=1 PerfScore 7.25
G_M34252_IG03:  ;; offset=0013H
       ret
                                                ;; size=1 bbWeight=1 PerfScore 1.00
; END METHOD Test:A(System.UInt128):System.UInt128

; Total bytes of code 20, prolog size 3, PerfScore 11.25, instruction count 6, allocated bytes for code 20 (MethodHash=4c3b7a33) for method Test:A(System.UInt128):System.UInt128 (FullOpts)
; ============================================================

Hopefully we can improve this more generally in .NET 9 as part of reducing the scope of dependent promotion.

MichalPetryka added the tenet-performance Performance related issue label Jul 8, 2021

dotnet-issue-labeler bot added area-System.Threading.Tasks untriaged New issue has not been triaged by the area owner labels Jul 8, 2021

jkotas added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-System.Threading.Tasks labels Jul 8, 2021

JulieLeeMSFT added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed untriaged New issue has not been triaged by the area owner labels Jul 8, 2021

JulieLeeMSFT assigned sandreenko Jul 8, 2021

JulieLeeMSFT added this to the 6.0.0 milestone Jul 8, 2021

sandreenko removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Jul 8, 2021

sandreenko modified the milestones: 6.0.0, 7.0.0 Jul 8, 2021

MineCake147E mentioned this issue Dec 23, 2021

Store System.Numerics.Complex values in vector registers #63067

Open

JulieLeeMSFT modified the milestones: 7.0.0, 8.0.0 Jun 3, 2022

JulieLeeMSFT unassigned sandreenko Jun 3, 2022

JulieLeeMSFT assigned MichalPetryka Oct 5, 2022

JulieLeeMSFT added the needs-author-action An issue or pull request that requires more info or actions from the author. label Oct 5, 2022

ghost added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed needs-author-action An issue or pull request that requires more info or actions from the author. labels Oct 18, 2022

TIHan self-assigned this Nov 29, 2022

TIHan removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Nov 29, 2022

SingleAccretion mentioned this issue Nov 30, 2022

Improve "one asg morphing" and move it to morphblock.cpp #79037

Merged

SingleAccretion unassigned MichalPetryka Jan 11, 2023

MichalPetryka mentioned this issue May 1, 2023

Intrinsify Unsafe.Read/Write/Copy, handle struct BitCast #85562

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label May 1, 2023

EgorBo closed this as completed in #85562 Jul 23, 2023

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jul 23, 2023

MichalPetryka mentioned this issue Jul 24, 2023

CoreCLR generates suboptimal codegen with structs passed via multiple registers #89374

Open

ghost locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unsafe.As resulting in lots of unnecessary moves. #55357

Unsafe.As resulting in lots of unnecessary moves. #55357

MichalPetryka commented Jul 8, 2021 •

edited by TIHan

Loading

ghost commented Jul 8, 2021

Description

Configuration

Regression?

Data

MichalPetryka commented Jul 8, 2021 •

edited

Loading

GrabYourPitchforks commented Jul 8, 2021

sandreenko commented Jul 8, 2021

JulieLeeMSFT commented Jun 3, 2022

tannergooding commented Jun 3, 2022

ghost commented Oct 5, 2022

JulieLeeMSFT commented Oct 5, 2022

MichalPetryka commented Oct 18, 2022

tannergooding commented Oct 25, 2022

MichalPetryka commented Apr 13, 2023 •

edited

Loading

MichalPetryka commented May 1, 2023 •

edited

Loading

jakobbotsch commented Jul 24, 2023

jakobbotsch commented Jul 24, 2023

jakobbotsch commented Jul 24, 2023

Unsafe.As resulting in lots of unnecessary moves. #55357

Unsafe.As resulting in lots of unnecessary moves. #55357

Comments

MichalPetryka commented Jul 8, 2021 • edited by TIHan Loading

Description

Configuration

Regression?

Data

ghost commented Jul 8, 2021

Description

Configuration

Regression?

Data

MichalPetryka commented Jul 8, 2021 • edited Loading

GrabYourPitchforks commented Jul 8, 2021

sandreenko commented Jul 8, 2021

JulieLeeMSFT commented Jun 3, 2022

tannergooding commented Jun 3, 2022

ghost commented Oct 5, 2022

JulieLeeMSFT commented Oct 5, 2022

MichalPetryka commented Oct 18, 2022

tannergooding commented Oct 25, 2022

MichalPetryka commented Apr 13, 2023 • edited Loading

MichalPetryka commented May 1, 2023 • edited Loading

jakobbotsch commented Jul 24, 2023

jakobbotsch commented Jul 24, 2023

jakobbotsch commented Jul 24, 2023

MichalPetryka commented Jul 8, 2021 •

edited by TIHan

Loading

MichalPetryka commented Jul 8, 2021 •

edited

Loading

MichalPetryka commented Apr 13, 2023 •

edited

Loading

MichalPetryka commented May 1, 2023 •

edited

Loading