-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add xarch andn
#64350
Add xarch andn
#64350
Conversation
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsThis adds a lowering for the pattern The arm specific version is moved into arm specific lowering where it functions unchanged, it generates a GT_AND_NOT node which is later emitted as a BitClear instruction. The xarch specific version has been added. It generates a new HWIntrinsic node for the Diffs are improvements. In all cases where the node is used the instruction count is reduced. In cases where the size has increased it is because the andn instruction plus 3 args encoding consumes one more byte than the combination of simpler and followed by not and their args. aspnet.run.windows.x64.checked.mch:
Detail diffs
benchmarks.run.windows.x64.checked.mch:
Detail diffs
coreclr_tests.pmi.windows.x64.checked.mch:
Detail diffs
libraries.pmi.windows.x64.checked.mch:
Detail diffs
Example: ; Assembly listing for method Microsoft.CodeAnalysis.CSharp.Symbols.SymbolCompletionState:get_NextIncompletePart():int:this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; No matching PGO data
; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
; V00 this [V00,T01] ( 3, 3 ) byref -> rcx this single-def
;# V01 OutArgs [V01 ] ( 1, 1 ) lclBlk ( 0) [rsp+00H] "OutgoingArgSpace"
; V02 tmp1 [V02,T00] ( 3, 6 ) int -> rax "dup spill"
;
; Lcl frame size = 0
G_M56083_IG01: ; gcrefRegs=00000000 {}, byrefRegs=00000000 {}, byref, nogc <-- Prolog IG
;; bbWeight=1 PerfScore 0.00
G_M56083_IG02: ; gcrefRegs=00000000 {}, byrefRegs=00000002 {rcx}, byref
; byrRegs +[rcx]
mov eax, dword ptr [rcx]
not eax
and eax, 0x1FFFF
lea edx, [rax-1]
- not edx
- and eax, edx
+ andn eax, edx, eax
;; bbWeight=1 PerfScore 3.50
G_M56083_IG03: ; , epilog, nogc, extend
ret
;; bbWeight=1 PerfScore 1.00
-; Total bytes of code 17, prolog size 0, PerfScore 6.20, instruction count 7, allocated bytes for code 17 (MethodHash=781a24ec) for method Microsoft.CodeAnalysis.CSharp.Symbols.SymbolCompletionState:get_NextIncompletePart():int:this
+; Total bytes of code 18, prolog size 0, PerfScore 6.30, instruction count 6, allocated bytes for code 18 (MethodHash=781a24ec) for method Microsoft.CodeAnalysis.CSharp.Symbols.SymbolCompletionState:get_NextIncompletePart():int:this
; ============================================================
Unwind Info:
>> Start offset : 0x000000 (not in unwind data)
>> End offset : 0xd1ffab1e (not in unwind data)
Version : 1
Flags : 0x00
SizeOfProlog : 0x00
CountOfUnwindCodes: 0
FrameRegister : none (0)
FrameOffset : N/A (no FrameRegister) (Value=0)
UnwindCodes :
/cc @dotnet/jit-contrib
|
Could you point the place in the manual where it describes which flags are set with |
@kunalspathak, The actual instruction pages call out AMD
Intel
and so are in general agreement, other than |
@tannergooding - Thanks for the pointers. This is what I see in AMD manual, so perhaps we need to change from runtime/src/coreclr/jit/instrsxarch.h Line 74 in 519fc25
|
@kunalspathak could you review when you get chance please? |
Definitely pumped for this, good stuff @Wraith2 👍🏼 This sort of thing will help my ARM64 builds where I haven't done much arch-specific fine-tuning (and can't really justify the cost of doing so yet) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for your contribution.
@kunalspathak it's the number one gets enabling |
This adds a lowering for the pattern
AND(x, NOT(y))
andAND(NOT(x),y))
using the existing arm specific logic as a base.The arm specific version is moved into arm specific lowering where it functions unchanged, it generates a GT_AND_NOT node which is later emitted as a BitClear instruction.
The xarch specific version has been added. It generates a new HWIntrinsic node for the
andn
instruction taking advantage of the support for the instruction already in place in the emitter. The instruction format needed to be added to the whitelist of formats that are recognised as setting the ZF flag to allow the best optimization.Diffs are improvements. In all cases where the node is used the instruction count is reduced. In cases where the size has increased it is because the andn instruction plus 3 args encoding consumes one more byte than the combination of simpler and followed by not and their args.
aspnet.run.windows.x64.checked.mch:
Detail diffs
benchmarks.run.windows.x64.checked.mch:
Detail diffs
coreclr_tests.pmi.windows.x64.checked.mch:
Detail diffs
libraries.pmi.windows.x64.checked.mch:
Detail diffs
Example:
/cc @dotnet/jit-contrib