-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARM64-SVE: Add ShiftRightArithmeticForDivide #104279
Conversation
Note regarding the
|
1 similar comment
Note regarding the
|
Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics |
@@ -201,6 +201,7 @@ HARDWARE_INTRINSIC(Sve, SaturatingIncrementByActiveElementCount, | |||
HARDWARE_INTRINSIC(Sve, Scale, -1, -1, false, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_sve_fscale, INS_sve_fscale}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_BaseTypeFromFirstArg|HW_Flag_EmbeddedMaskedOperation|HW_Flag_LowMaskedOperation|HW_Flag_HasRMWSemantics) | |||
HARDWARE_INTRINSIC(Sve, ShiftLeftLogical, -1, -1, false, {INS_sve_lsl, INS_sve_lsl, INS_sve_lsl, INS_sve_lsl, INS_sve_lsl, INS_sve_lsl, INS_sve_lsl, INS_sve_lsl, INS_invalid, INS_invalid}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_BaseTypeFromFirstArg|HW_Flag_EmbeddedMaskedOperation|HW_Flag_LowMaskedOperation|HW_Flag_HasRMWSemantics) | |||
HARDWARE_INTRINSIC(Sve, ShiftRightArithmetic, -1, -1, false, {INS_sve_asr, INS_invalid, INS_sve_asr, INS_invalid, INS_sve_asr, INS_invalid, INS_sve_asr, INS_invalid, INS_invalid, INS_invalid}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_BaseTypeFromFirstArg|HW_Flag_EmbeddedMaskedOperation|HW_Flag_LowMaskedOperation|HW_Flag_HasRMWSemantics) | |||
HARDWARE_INTRINSIC(Sve, ShiftRightArithmeticForDivide, -1, -1, false, {INS_sve_asrd, INS_invalid, INS_sve_asrd, INS_invalid, INS_sve_asrd, INS_invalid, INS_sve_asrd, INS_invalid, INS_invalid, INS_invalid}, HW_Category_ShiftRightByImmediate, HW_Flag_Scalable|HW_Flag_EmbeddedMaskedOperation|HW_Flag_LowMaskedOperation|HW_Flag_HasRMWSemantics|HW_Flag_HasImmediateOperand) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to also add handling in lookupImmBounds
in hwintrinsicarm64.cpp
and in ContainCheckHWIntrinsic()
in lowerarmarch.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a special case for this API to ContainCheckHWIntrinsic
. I'm running into issues with lookupImmBounds
where if we are passed a ConditionalSelect
intrinsic, we need to be able to unwrap the inner intrinsic, but this method is only passed the name of the intrinsic. There are a few places where lookupImmBounds
is called and we don't have rich enough intrinsic data yet to do this unwrapping (such as in the importer), though this probably happens before we wrap the intrinsic with ConditionalSelect
, right? I did a quick audit of all the other places where we call lookupImmBounds
, and at the other call sites, we are either doing the lookup for a specific intrinsic anyway, or we are taking care to look up in the unwrapped intrinsic thanks to other changes in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kunalspathak do you think it's fine to leave lookupImmBounds
as-is for now, since we're handling the embedded mask case in the caller? Or should I tweak lookupImmBounds
to take the intrinsic's GenTree
node so we can do the embedded mask check in the method itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
src/coreclr/jit/lsraarm64.cpp
Outdated
// operand and the intrinsic does not have an alternative non-const fallback form. | ||
// However, for a case when the operand can take only two possible values - zero and one | ||
// the codegen can use cbnz to do conditional branch, so such register is not needed. | ||
auto needsBranchTargetReg = [this](const HWIntrinsic& intrinsic, const GenTreeHWIntrinsic* tree) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you monitor the TP diffs because of this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this intrinsic, I think we should just special case it instead of wrapping common code paths in lamba.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not seeing any diffs, though I can undo this if you'd prefer less churn.
HWIntrinsicImmOpHelper helper(this, intrinEmbMask.op2, op2->AsHWIntrinsic()); | ||
for (helper.EmitBegin(); !helper.Done(); helper.EmitCaseEnd()) | ||
{ | ||
GetEmitter()->emitInsSve_R_R_I(insEmbMask, emitSize, reg1, reg2, helper.ImmValue(), opt, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For table generated variant, can you verify using JitStress
what code looks like for special cases below (where we use movprfx
, etc.)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I'm getting for the reflection scenario, without any JitStress:
G_M55611_IG01: ;; offset=0x0000
stp fp, lr, [sp, #-0x10]!
mov fp, sp
;; size=8 bbWeight=1 PerfScore 1.50
G_M55611_IG02: ;; offset=0x0008
uxtb w0, w0
sub w1, w0, #1
cmp w1, #8
bhs G_M55611_IG12
ptrue p0.b
movprfx z0.b, p0/z, z0.b
adr x1, [G_M55611_IG03]
add x1, x1, x0, LSL #3
sub x1, x1, #8
br x1
;; size=40 bbWeight=1 PerfScore 9.50
G_M55611_IG03: ;; offset=0x0030
asrd z0.b, p0/m, z0.b, #1
b G_M55611_IG11
;; size=8 bbWeight=1 PerfScore 4.00
G_M55611_IG04: ;; offset=0x0038
asrd z0.b, p0/m, z0.b, #2
b G_M55611_IG11
;; size=8 bbWeight=1 PerfScore 4.00
G_M55611_IG05: ;; offset=0x0040
asrd z0.b, p0/m, z0.b, #3
b G_M55611_IG11
;; size=8 bbWeight=1 PerfScore 4.00
G_M55611_IG06: ;; offset=0x0048
asrd z0.b, p0/m, z0.b, #4
b G_M55611_IG11
;; size=8 bbWeight=1 PerfScore 4.00
G_M55611_IG07: ;; offset=0x0050
asrd z0.b, p0/m, z0.b, #5
b G_M55611_IG11
;; size=8 bbWeight=1 PerfScore 4.00
G_M55611_IG08: ;; offset=0x0058
asrd z0.b, p0/m, z0.b, #6
b G_M55611_IG11
;; size=8 bbWeight=1 PerfScore 4.00
G_M55611_IG09: ;; offset=0x0060
asrd z0.b, p0/m, z0.b, #7
b G_M55611_IG11
;; size=8 bbWeight=1 PerfScore 4.00
G_M55611_IG10: ;; offset=0x0068
asrd z0.b, p0/m, z0.b, #8
;; size=4 bbWeight=1 PerfScore 3.00
G_M55611_IG11: ;; offset=0x006C
ldp fp, lr, [sp], #0x10
ret lr
;; size=8 bbWeight=1 PerfScore 2.00
G_M55611_IG12: ;; offset=0x0074
bl CORINFO_HELP_THROW_ARGUMENTOUTOFRANGEEXCEPTION
brk_windows #0
;; size=8 bbWeight=0 PerfScore 0.00
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And with JitStress=2
:
G_M55611_IG01: ;; offset=0x0000
stp fp, lr, [sp, #-0x20]!
mov fp, sp
movz x1, #0x5678
movk x1, #0x1234 LSL #16
movk x1, #0xDEF0 LSL #32
movk x1, #0x9ABC LSL #48
str x1, [fp, #0x18] // [V04 GsCookie]
;; size=28 bbWeight=1 PerfScore 4.50
G_M55611_IG02: ;; offset=0x001C
uxtb w0, w0
;; size=4 bbWeight=1 PerfScore 0.50
G_M55611_IG03: ;; offset=0x0020
sub w1, w0, #1
;; size=4 bbWeight=1 PerfScore 0.50
G_M55611_IG04: ;; offset=0x0024
cmp w1, #8
;; size=4 bbWeight=1 PerfScore 0.50
G_M55611_IG05: ;; offset=0x0028
bhs G_M55611_IG36
;; size=4 bbWeight=1 PerfScore 1.00
G_M55611_IG06: ;; offset=0x002C
ptrue p0.b
;; size=4 bbWeight=1 PerfScore 2.00
G_M55611_IG07: ;; offset=0x0030
movprfx z0.b, p0/z, z0.b
;; size=4 bbWeight=1 PerfScore 2.00
G_M55611_IG08: ;; offset=0x0034
adr x1, [G_M55611_IG12]
;; size=4 bbWeight=1 PerfScore 0.50
G_M55611_IG09: ;; offset=0x0038
add x1, x1, x0, LSL #3
;; size=4 bbWeight=1 PerfScore 1.00
G_M55611_IG10: ;; offset=0x003C
sub x1, x1, #8
;; size=4 bbWeight=1 PerfScore 0.50
G_M55611_IG11: ;; offset=0x0040
br x1
;; size=4 bbWeight=1 PerfScore 1.00
G_M55611_IG12: ;; offset=0x0044
asrd z0.b, p0/m, z0.b, #1
;; size=4 bbWeight=1 PerfScore 3.00
G_M55611_IG13: ;; offset=0x0048
b G_M55611_IG27
;; size=4 bbWeight=1 PerfScore 1.00
G_M55611_IG14: ;; offset=0x004C
asrd z0.b, p0/m, z0.b, #2
;; size=4 bbWeight=1 PerfScore 3.00
G_M55611_IG15: ;; offset=0x0050
b G_M55611_IG27
;; size=4 bbWeight=1 PerfScore 1.00
G_M55611_IG16: ;; offset=0x0054
asrd z0.b, p0/m, z0.b, #3
;; size=4 bbWeight=1 PerfScore 3.00
G_M55611_IG17: ;; offset=0x0058
b G_M55611_IG27
;; size=4 bbWeight=1 PerfScore 1.00
G_M55611_IG18: ;; offset=0x005C
asrd z0.b, p0/m, z0.b, #4
;; size=4 bbWeight=1 PerfScore 3.00
G_M55611_IG19: ;; offset=0x0060
b G_M55611_IG27
;; size=4 bbWeight=1 PerfScore 1.00
G_M55611_IG20: ;; offset=0x0064
asrd z0.b, p0/m, z0.b, #5
;; size=4 bbWeight=1 PerfScore 3.00
G_M55611_IG21: ;; offset=0x0068
b G_M55611_IG27
;; size=4 bbWeight=1 PerfScore 1.00
G_M55611_IG22: ;; offset=0x006C
asrd z0.b, p0/m, z0.b, #6
;; size=4 bbWeight=1 PerfScore 3.00
G_M55611_IG23: ;; offset=0x0070
b G_M55611_IG27
;; size=4 bbWeight=1 PerfScore 1.00
G_M55611_IG24: ;; offset=0x0074
asrd z0.b, p0/m, z0.b, #7
;; size=4 bbWeight=1 PerfScore 3.00
G_M55611_IG25: ;; offset=0x0078
b G_M55611_IG27
;; size=4 bbWeight=1 PerfScore 1.00
G_M55611_IG26: ;; offset=0x007C
asrd z0.b, p0/m, z0.b, #8
;; size=4 bbWeight=1 PerfScore 3.00
G_M55611_IG27: ;; offset=0x0080
movz xip0, #0x5678
;; size=4 bbWeight=1 PerfScore 0.50
G_M55611_IG28: ;; offset=0x0084
movk xip0, #0x1234 LSL #16
;; size=4 bbWeight=1 PerfScore 0.50
G_M55611_IG29: ;; offset=0x0088
movk xip0, #0xDEF0 LSL #32
;; size=4 bbWeight=1 PerfScore 0.50
G_M55611_IG30: ;; offset=0x008C
movk xip0, #0x9ABC LSL #48
;; size=4 bbWeight=1 PerfScore 0.50
G_M55611_IG31: ;; offset=0x0090
ldr xip1, [fp, #0x18] // [V04 GsCookie]
;; size=4 bbWeight=1 PerfScore 2.00
G_M55611_IG32: ;; offset=0x0094
cmp xip0, xip1
;; size=4 bbWeight=1 PerfScore 0.50
G_M55611_IG33: ;; offset=0x0098
beq G_M55611_IG35
;; size=4 bbWeight=1 PerfScore 1.00
G_M55611_IG34: ;; offset=0x009C
bl CORINFO_HELP_FAIL_FAST
;; size=4 bbWeight=1 PerfScore 1.00
G_M55611_IG35: ;; offset=0x00A0
ldp fp, lr, [sp], #0x20
ret lr
;; size=8 bbWeight=1 PerfScore 2.00
G_M55611_IG36: ;; offset=0x00A8
bl CORINFO_HELP_THROW_ARGUMENTOUTOFRANGEEXCEPTION
;; size=4 bbWeight=0 PerfScore 0.00
G_M55611_IG37: ;; offset=0x00AC
brk_windows #0
;; size=4 bbWeight=0 PerfScore 0.00
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Part of #99957.
Note that the changes I made toLinearScan::BuildHWIntrinsic
are a trivial refactor of existing logic to handle intrinsics with both immediate operands and embedded masks.Stress test output:
@dotnet/arm64-contrib PTAL, thanks!