Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unsafe.As resulting in lots of unnecessary moves. #55357

Closed
MichalPetryka opened this issue Jul 8, 2021 · 15 comments · Fixed by #85562
Closed

Unsafe.As resulting in lots of unnecessary moves. #55357

MichalPetryka opened this issue Jul 8, 2021 · 15 comments · Fixed by #85562
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@MichalPetryka
Copy link
Contributor

MichalPetryka commented Jul 8, 2021

Description

Using Unsafe.As on a struct results in a lot of unnecessary asm move instructions, especially when on method return and causes even more of them after inlining.

Configuration

Sharplab Core CLR v5.0.721.25508

Regression?

No idea

Data

Sharplab

using System.Runtime.Intrinsics.X86;
using System.Runtime.Intrinsics;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;

public static unsafe class Test
{
    public static UInt128 A(UInt128 v)
    {
        return ShiftA(v);
    }
    
    public static Vector128<byte> B(Vector128<byte> v)
    {
        return ShiftB(v);
    }
    
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static UInt128 ShiftA(UInt128 c)
    {
        if (Sse2.IsSupported)
		{
            Vector128<byte> val1 = Sse2.ShiftLeftLogical128BitLane(
                Unsafe.As<UInt128, Vector128<byte>>(ref c), 1);
            return Unsafe.As<Vector128<byte>, UInt128>(ref val1);
		}
        
        // not important in this case
        return default;
    }
    
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static Vector128<byte> ShiftB(Vector128<byte> c)
    {
        if (Sse2.IsSupported)
		{
            return Sse2.ShiftLeftLogical128BitLane(c, 1);
		}
        
        // not important in this case
		return default;
    }
    
    [StructLayout(LayoutKind.Explicit, Size = 16)]
    public readonly struct UInt128
    {
        [FieldOffset(0)]
        public readonly ulong long1;
        [FieldOffset(8)]
        public readonly ulong long2;
        
        public UInt128(ulong a, ulong b)
        {
            long1 = a;
            long2 = b;
        }
    }
}

Output:

; Core CLR 5.0.921.35908 on amd64

Test.A(UInt128)
    L0000: sub rsp, 0x28
    L0004: vzeroupper
    L0007: mov rax, [rdx]
    L000a: mov rax, [rdx+8]
    L000e: vmovupd xmm0, [rsp]
    L0013: vpslldq xmm0, xmm0, 1
    L0018: vmovapd [rsp], xmm0
    L001d: vmovdqu xmm0, [rsp]
    L0022: vmovdqu [rsp+0x18], xmm0
    L0028: vmovdqu xmm0, [rsp+0x18]
    L002e: vmovdqu [rcx], xmm0
    L0032: mov rax, rcx
    L0035: add rsp, 0x28
    L0039: ret

Test.B(System.Runtime.Intrinsics.Vector128`1<Byte>)
    L0000: vzeroupper
    L0003: vmovupd xmm0, [rdx]
    L0007: vpslldq xmm0, xmm0, 1
    L000c: vmovupd [rcx], xmm0
    L0010: mov rax, rcx
    L0013: ret

Test.ShiftA(UInt128)
    L0000: sub rsp, 0x18
    L0004: vzeroupper
    L0007: vmovupd xmm0, [rdx]
    L000b: vpslldq xmm0, xmm0, 1
    L0010: vmovapd [rsp], xmm0
    L0015: vmovdqu xmm0, [rsp]
    L001a: vmovdqu [rcx], xmm0
    L001e: mov rax, rcx
    L0021: add rsp, 0x18
    L0025: ret

Test.ShiftB(System.Runtime.Intrinsics.Vector128`1<Byte>)
    L0000: vzeroupper
    L0003: vmovupd xmm0, [rdx]
    L0007: vpslldq xmm0, xmm0, 1
    L000c: vmovupd [rcx], xmm0
    L0010: mov rax, rcx
    L0013: ret

Test+UInt128..ctor(UInt64, UInt64)
    L0000: mov [rcx], rdx
    L0003: mov [rcx+8], r8
    L0007: ret

category:cq
theme:structs
skill-level:expert
cost:medium
impact:medium

@MichalPetryka MichalPetryka added the tenet-performance Performance related issue label Jul 8, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Threading.Tasks untriaged New issue has not been triaged by the area owner labels Jul 8, 2021
@ghost
Copy link

ghost commented Jul 8, 2021

Tagging subscribers to this area: @dotnet/area-system-threading-tasks
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Using Unsafe.As on a struct results in a lot of unnecessary asm move instructions, especially when on method return and causes even more of them after inlining.

Configuration

Sharplab Core CLR v5.0.721.25508

Regression?

No idea

Data

Sharplab

Author: MichalPetryka
Assignees: -
Labels:

area-System.Threading.Tasks, tenet-performance, untriaged

Milestone: -

@MichalPetryka
Copy link
Contributor Author

MichalPetryka commented Jul 8, 2021

The area is wrong, should be area-System.Runtime.CompilerServices or area-CodeGen-coreclr.

@jkotas jkotas added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-System.Threading.Tasks labels Jul 8, 2021
@GrabYourPitchforks
Copy link
Member

Related: #55064, #10315

@JulieLeeMSFT JulieLeeMSFT added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed untriaged New issue has not been triaged by the area owner labels Jul 8, 2021
@JulieLeeMSFT JulieLeeMSFT added this to the 6.0.0 milestone Jul 8, 2021
@sandreenko sandreenko removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Jul 8, 2021
@sandreenko sandreenko modified the milestones: 6.0.0, 7.0.0 Jul 8, 2021
@sandreenko
Copy link
Contributor

We still don't have a good way to represent a struct cast without taking its address, it is a known problem that I was planning to fix for 6.0 but did not have time to finish.

There is no easy fix, but we are aware of the issue and will improve this scenario in 7.0.

@JulieLeeMSFT
Copy link
Member

Moving to .NET 8 as we don't have enough time to work on this.

@JulieLeeMSFT JulieLeeMSFT modified the milestones: 7.0.0, 8.0.0 Jun 3, 2022
@tannergooding
Copy link
Member

This one actually might've been fixed with #68739.

@MichalPetryka is that something you'd be willing to validate looks fixed on your end?

@JulieLeeMSFT JulieLeeMSFT added the needs-author-action An issue or pull request that requires more info or actions from the author. label Oct 5, 2022
@ghost
Copy link

ghost commented Oct 5, 2022

This issue has been marked needs-author-action and may be missing some important information.

@JulieLeeMSFT
Copy link
Member

This one actually might've been fixed with #68739.

@MichalPetryka is that something you'd be willing to validate looks fixed on your end?

Pining @MichalPetryka.

@MichalPetryka
Copy link
Contributor Author

Checked with RC2, the issue is still there @tannergooding.

@ghost ghost added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed needs-author-action An issue or pull request that requires more info or actions from the author. labels Oct 18, 2022
@tannergooding
Copy link
Member

That is likely not due to Unsafe.As then. The API is now a "nop" in importation: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/importercalls.cpp#L3752-L3760

@TIHan TIHan self-assigned this Nov 29, 2022
@TIHan TIHan removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Nov 29, 2022
@MichalPetryka
Copy link
Contributor Author

MichalPetryka commented Apr 13, 2023

With .Net 8 and Unsafe.BitCast, the codegen is now:

G_M000_IG01:                ;; offset=0000H
       4883EC18             sub      rsp, 24
       C5F877               vzeroupper 
 
G_M000_IG02:                ;; offset=0007H
       C5F81002             vmovups  xmm0, xmmword ptr [rdx]
       C5F811442408         vmovups  xmmword ptr [rsp+08H], xmm0
       C5F810442408         vmovups  xmm0, xmmword ptr [rsp+08H]
       C5F973F801           vpslldq  xmm0, xmm0, 1
       C5F81101             vmovups  xmmword ptr [rcx], xmm0
       488BC1               mov      rax, rcx
 
G_M000_IG03:                ;; offset=0023H
       4883C418             add      rsp, 24
       C3                   ret      

Just 2 redundant moves from a stack spill left.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label May 1, 2023
@MichalPetryka
Copy link
Contributor Author

MichalPetryka commented May 1, 2023

#85562 gives the desired codegen here with BitCast.

EDIT: actually seems like it's still not ideal on Linux due to SysV ABI:

; Method Test:A(Test+UInt128):Test+UInt128
G_M20876_IG01:  ;; offset=0000H
       sub      rsp, 40
       vzeroupper 
       mov      qword ptr [rsp+18H], rdi
       mov      qword ptr [rsp+20H], rsi
                        ;; size=17 bbWeight=1 PerfScore 3.25

G_M20876_IG02:  ;; offset=0011H
       vmovups  xmm0, xmmword ptr [rsp+18H]
       vpslldq  xmm0, xmm0, 1
       vmovaps  xmmword ptr [rsp], xmm0
       mov      rax, qword ptr [rsp]
       mov      rdx, qword ptr [rsp+08H]
                        ;; size=25 bbWeight=1 PerfScore 7.00

G_M20876_IG03:  ;; offset=002AH
       add      rsp, 40
       ret      
                        ;; size=5 bbWeight=1 PerfScore 1.25
; Total bytes of code: 47

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jul 23, 2023
@jakobbotsch
Copy link
Member

@MichalPetryka What is the impact of #85562 expected to be here? I see no diffs in A with and without that PR, regardless of using Unsafe.As and Unsafe.BitCast. The codegen on my machine is

; Assembly listing for method Test:A(System.UInt128):System.UInt128 (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 0 single block inlinees; 3 inlinees without PGO data
; Final local variable assignments
;
;  V00 RetBuf       [V00,T01] (  4,  4   )   byref  ->  rcx         single-def
;  V01 arg0         [V01,T00] (  3,  6   )   byref  ->  rdx         single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  struct ( 0) [rsp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V03 tmp1         [V03    ] (  0,  0   )  struct (16) zero-ref    "Inline return value spill temp" <System.UInt128>
;* V04 tmp2         [V04    ] (  0,  0   )  simd16  ->  zero-ref    "spilled call-like call argument"
;  V05 tmp3         [V05    ] (  2,  4   )  struct (16) [rsp+08H]   do-not-enreg[SF] ld-addr-op "Inlining Arg" <System.UInt128>
;  V06 tmp4         [V06,T04] (  2,  4   )  simd16  ->  mm0         ld-addr-op "Inlining Arg" <System.Runtime.Intrinsics.Vector128`1[ubyte]>
;* V07 tmp5         [V07    ] (  0,  0   )    long  ->  zero-ref    "field V01._lower (fldOffset=0x0)" P-INDEP
;* V08 tmp6         [V08    ] (  0,  0   )    long  ->  zero-ref    "field V01._upper (fldOffset=0x8)" P-INDEP
;* V09 tmp7         [V09    ] (  0,  0   )    long  ->  zero-ref    "field V03._lower (fldOffset=0x0)" P-INDEP
;* V10 tmp8         [V10    ] (  0,  0   )    long  ->  zero-ref    "field V03._upper (fldOffset=0x8)" P-INDEP
;  V11 tmp9         [V11,T02] (  2,  4   )    long  ->  [rsp+08H]   do-not-enreg[] "field V05._lower (fldOffset=0x0)" P-DEP
;  V12 tmp10        [V12,T03] (  2,  4   )    long  ->  [rsp+10H]   do-not-enreg[] "field V05._upper (fldOffset=0x8)" P-DEP
;* V13 tmp11        [V13    ] (  0,  0   )  struct (16) zero-ref    "Promoted implicit byref" <System.UInt128>
;
; Lcl frame size = 24
; BEGIN METHOD Test:A(System.UInt128):System.UInt128

G_M34252_IG01:  ;; offset=0000H
       sub      rsp, 24
       vzeroupper
                                                ;; size=7 bbWeight=1 PerfScore 1.25
G_M34252_IG02:  ;; offset=0007H
       vmovups  xmm0, xmmword ptr [rdx]
       vmovups  xmmword ptr [rsp+08H], xmm0
       vmovups  xmm0, xmmword ptr [rsp+08H]
       vpslldq  xmm0, xmm0, 1
       vmovups  xmmword ptr [rcx], xmm0
       mov      rax, rcx
                                                ;; size=28 bbWeight=1 PerfScore 11.25
G_M34252_IG03:  ;; offset=0023H
       add      rsp, 24
       ret
                                                ;; size=5 bbWeight=1 PerfScore 1.25
; END METHOD Test:A(System.UInt128):System.UInt128

; Total bytes of code 40, prolog size 7, PerfScore 17.75, instruction count 10, allocated bytes for code 40 (MethodHash=4c3b7a33) for method Test:A(System.UInt128):System.UInt128 (FullOpts)
; ============================================================

with and without that change.

@jakobbotsch
Copy link
Member

Ah hang on, user error :-) I see it now.

@jakobbotsch
Copy link
Member

The reason the original Unsafe.As example is bad is just because of dependent promotion. If we allow physical promotion to handle it there is no issue in the codegen even when using Unsafe.As (i.e. set DOTNET_JitStressModeNames=STRESS_NO_OLD_PROMOTION):

; Assembly listing for method Test:A(System.UInt128):System.UInt128 (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; 0 inlinees with PGO data; 0 single block inlinees; 1 inlinees without PGO data
; Final local variable assignments
;
;  V00 RetBuf       [V00,T01] (  4,  4   )   byref  ->  rcx         single-def
;  V01 arg0         [V01,T00] (  3,  6   )   byref  ->  rdx         single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  struct ( 0) [rsp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V03 tmp1         [V03    ] (  0,  0   )  struct (16) zero-ref    "Inline return value spill temp" <System.UInt128>
;* V04 tmp2         [V04    ] (  0,  0   )  struct (16) zero-ref    do-not-enreg[SF] ld-addr-op "Inlining Arg" <System.UInt128>
;  V05 tmp3         [V05,T02] (  2,  2   )  simd16  ->  mm0         ld-addr-op "Inline stloc first use temp" <System.Runtime.Intrinsics.Vector128`1[ubyte]>
;* V06 tmp4         [V06    ] (  0,  0   )  simd16  ->  zero-ref    "V04.[000..016)"
;
; Lcl frame size = 0
; BEGIN METHOD Test:A(System.UInt128):System.UInt128

G_M34252_IG01:  ;; offset=0000H
       vzeroupper
                                                ;; size=3 bbWeight=1 PerfScore 1.00
G_M34252_IG02:  ;; offset=0003H
       vmovups  xmm0, xmmword ptr [rdx]
       vpslldq  xmm0, xmm0, 1
       vmovups  xmmword ptr [rcx], xmm0
       mov      rax, rcx
                                                ;; size=16 bbWeight=1 PerfScore 7.25
G_M34252_IG03:  ;; offset=0013H
       ret
                                                ;; size=1 bbWeight=1 PerfScore 1.00
; END METHOD Test:A(System.UInt128):System.UInt128

; Total bytes of code 20, prolog size 3, PerfScore 11.25, instruction count 6, allocated bytes for code 20 (MethodHash=4c3b7a33) for method Test:A(System.UInt128):System.UInt128 (FullOpts)
; ============================================================

Hopefully we can improve this more generally in .NET 9 as part of reducing the scope of dependent promotion.

@ghost ghost locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants