Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM64-SVE: Add ShiftRightArithmeticForDivide #104279

Merged
merged 4 commits into from
Jul 3, 2024

Conversation

amanasifkhalid
Copy link
Member

@amanasifkhalid amanasifkhalid commented Jul 2, 2024

Part of #99957. Note that the changes I made to LinearScan::BuildHWIntrinsic are a trivial refactor of existing logic to handle intrinsics with both immediate operands and embedded masks.

Stress test output:

===================Running default===================
------------------- {} -------------------
Passed test: _Sve_r::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_ShiftRightArithmeticForDivide_sbyte() : 17
Passed test: _Sve_r::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_ShiftRightArithmeticForDivide_short() : 17
Passed test: _Sve_r::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_ShiftRightArithmeticForDivide_int() : 17
Passed test: _Sve_r::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_ShiftRightArithmeticForDivide_long() : 17
===================Running jitstress===================
------------------- {'JitMinOpts': '1'} -------------------
------------------- {'JitStress': '1'} -------------------
------------------- {'JitStress': '2'} -------------------
------------------- {'JitStress': '1', 'TieredCompilation': '1'} -------------------
------------------- {'JitStress': '2', 'TieredCompilation': '1'} -------------------
------------------- {'TailcallStress': '1'} -------------------
------------------- {'ReadyToRun': '0'} -------------------
===================Running jitstressregs===================
------------------- {'JitStressRegs': '1'} -------------------
------------------- {'JitStressRegs': '2'} -------------------
------------------- {'JitStressRegs': '3'} -------------------
------------------- {'JitStressRegs': '4'} -------------------
------------------- {'JitStressRegs': '8'} -------------------
------------------- {'JitStressRegs': '0x10'} -------------------
------------------- {'JitStressRegs': '0x80'} -------------------
------------------- {'JitStressRegs': '0x1000'} -------------------
------------------- {'JitStressRegs': '0x2000'} -------------------
===================Running jitstress2-jitstressregs===================
------------------- {'JitStress': '2', 'JitStressRegs': '1'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '2'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '3'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '4'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '8'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x10'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x80'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x1000'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x2000'} -------------------

===================Running default===================
------------------- {} -------------------
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_ShiftRightArithmeticForDivide_sbyte() : 17
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_ShiftRightArithmeticForDivide_short() : 17
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_ShiftRightArithmeticForDivide_int() : 17
Passed test: _Sve_ro::JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_ShiftRightArithmeticForDivide_long() : 17
===================Running jitstress===================
------------------- {'JitMinOpts': '1'} -------------------
------------------- {'JitStress': '1'} -------------------
------------------- {'JitStress': '2'} -------------------
------------------- {'JitStress': '1', 'TieredCompilation': '1'} -------------------
------------------- {'JitStress': '2', 'TieredCompilation': '1'} -------------------
------------------- {'TailcallStress': '1'} -------------------
------------------- {'ReadyToRun': '0'} -------------------
===================Running jitstressregs===================
------------------- {'JitStressRegs': '1'} -------------------
------------------- {'JitStressRegs': '2'} -------------------
------------------- {'JitStressRegs': '3'} -------------------
------------------- {'JitStressRegs': '4'} -------------------
------------------- {'JitStressRegs': '8'} -------------------
------------------- {'JitStressRegs': '0x10'} -------------------
------------------- {'JitStressRegs': '0x80'} -------------------
------------------- {'JitStressRegs': '0x1000'} -------------------
------------------- {'JitStressRegs': '0x2000'} -------------------
===================Running jitstress2-jitstressregs===================
------------------- {'JitStress': '2', 'JitStressRegs': '1'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '2'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '3'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '4'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '8'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x10'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x80'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x1000'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x2000'} -------------------

@dotnet/arm64-contrib PTAL, thanks!

Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

1 similar comment
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

@amanasifkhalid amanasifkhalid added the arm-sve Work related to arm64 SVE/SVE2 support label Jul 2, 2024
@@ -201,6 +201,7 @@ HARDWARE_INTRINSIC(Sve, SaturatingIncrementByActiveElementCount,
HARDWARE_INTRINSIC(Sve, Scale, -1, -1, false, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_sve_fscale, INS_sve_fscale}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_BaseTypeFromFirstArg|HW_Flag_EmbeddedMaskedOperation|HW_Flag_LowMaskedOperation|HW_Flag_HasRMWSemantics)
HARDWARE_INTRINSIC(Sve, ShiftLeftLogical, -1, -1, false, {INS_sve_lsl, INS_sve_lsl, INS_sve_lsl, INS_sve_lsl, INS_sve_lsl, INS_sve_lsl, INS_sve_lsl, INS_sve_lsl, INS_invalid, INS_invalid}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_BaseTypeFromFirstArg|HW_Flag_EmbeddedMaskedOperation|HW_Flag_LowMaskedOperation|HW_Flag_HasRMWSemantics)
HARDWARE_INTRINSIC(Sve, ShiftRightArithmetic, -1, -1, false, {INS_sve_asr, INS_invalid, INS_sve_asr, INS_invalid, INS_sve_asr, INS_invalid, INS_sve_asr, INS_invalid, INS_invalid, INS_invalid}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_BaseTypeFromFirstArg|HW_Flag_EmbeddedMaskedOperation|HW_Flag_LowMaskedOperation|HW_Flag_HasRMWSemantics)
HARDWARE_INTRINSIC(Sve, ShiftRightArithmeticForDivide, -1, -1, false, {INS_sve_asrd, INS_invalid, INS_sve_asrd, INS_invalid, INS_sve_asrd, INS_invalid, INS_sve_asrd, INS_invalid, INS_invalid, INS_invalid}, HW_Category_ShiftRightByImmediate, HW_Flag_Scalable|HW_Flag_EmbeddedMaskedOperation|HW_Flag_LowMaskedOperation|HW_Flag_HasRMWSemantics|HW_Flag_HasImmediateOperand)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to also add handling in lookupImmBounds in hwintrinsicarm64.cpp and in ContainCheckHWIntrinsic() in lowerarmarch.cpp

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a special case for this API to ContainCheckHWIntrinsic. I'm running into issues with lookupImmBounds where if we are passed a ConditionalSelect intrinsic, we need to be able to unwrap the inner intrinsic, but this method is only passed the name of the intrinsic. There are a few places where lookupImmBounds is called and we don't have rich enough intrinsic data yet to do this unwrapping (such as in the importer), though this probably happens before we wrap the intrinsic with ConditionalSelect, right? I did a quick audit of all the other places where we call lookupImmBounds, and at the other call sites, we are either doing the lookup for a specific intrinsic anyway, or we are taking care to look up in the unwrapped intrinsic thanks to other changes in this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kunalspathak do you think it's fine to leave lookupImmBounds as-is for now, since we're handling the embedded mask case in the caller? Or should I tweak lookupImmBounds to take the intrinsic's GenTree node so we can do the embedded mask check in the method itself?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

// operand and the intrinsic does not have an alternative non-const fallback form.
// However, for a case when the operand can take only two possible values - zero and one
// the codegen can use cbnz to do conditional branch, so such register is not needed.
auto needsBranchTargetReg = [this](const HWIntrinsic& intrinsic, const GenTreeHWIntrinsic* tree) -> bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you monitor the TP diffs because of this change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this intrinsic, I think we should just special case it instead of wrapping common code paths in lamba.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing any diffs, though I can undo this if you'd prefer less churn.

HWIntrinsicImmOpHelper helper(this, intrinEmbMask.op2, op2->AsHWIntrinsic());
for (helper.EmitBegin(); !helper.Done(); helper.EmitCaseEnd())
{
GetEmitter()->emitInsSve_R_R_I(insEmbMask, emitSize, reg1, reg2, helper.ImmValue(), opt,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For table generated variant, can you verify using JitStress what code looks like for special cases below (where we use movprfx, etc.)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I'm getting for the reflection scenario, without any JitStress:

G_M55611_IG01:  ;; offset=0x0000
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
                                                ;; size=8 bbWeight=1 PerfScore 1.50
G_M55611_IG02:  ;; offset=0x0008
            uxtb    w0, w0
            sub     w1, w0, #1
            cmp     w1, #8
            bhs     G_M55611_IG12
            ptrue   p0.b
            movprfx z0.b, p0/z, z0.b
            adr     x1, [G_M55611_IG03]
            add     x1, x1, x0,  LSL #3
            sub     x1, x1, #8
            br      x1
                                                ;; size=40 bbWeight=1 PerfScore 9.50
G_M55611_IG03:  ;; offset=0x0030
            asrd    z0.b, p0/m, z0.b, #1
            b       G_M55611_IG11
                                                ;; size=8 bbWeight=1 PerfScore 4.00
G_M55611_IG04:  ;; offset=0x0038
            asrd    z0.b, p0/m, z0.b, #2
            b       G_M55611_IG11
                                                ;; size=8 bbWeight=1 PerfScore 4.00
G_M55611_IG05:  ;; offset=0x0040
            asrd    z0.b, p0/m, z0.b, #3
            b       G_M55611_IG11
                                                ;; size=8 bbWeight=1 PerfScore 4.00
G_M55611_IG06:  ;; offset=0x0048
            asrd    z0.b, p0/m, z0.b, #4
            b       G_M55611_IG11
                                                ;; size=8 bbWeight=1 PerfScore 4.00
G_M55611_IG07:  ;; offset=0x0050
            asrd    z0.b, p0/m, z0.b, #5
            b       G_M55611_IG11
                                                ;; size=8 bbWeight=1 PerfScore 4.00
G_M55611_IG08:  ;; offset=0x0058
            asrd    z0.b, p0/m, z0.b, #6
            b       G_M55611_IG11
                                                ;; size=8 bbWeight=1 PerfScore 4.00
G_M55611_IG09:  ;; offset=0x0060
            asrd    z0.b, p0/m, z0.b, #7
            b       G_M55611_IG11
                                                ;; size=8 bbWeight=1 PerfScore 4.00
G_M55611_IG10:  ;; offset=0x0068
            asrd    z0.b, p0/m, z0.b, #8
                                                ;; size=4 bbWeight=1 PerfScore 3.00
G_M55611_IG11:  ;; offset=0x006C
            ldp     fp, lr, [sp], #0x10
            ret     lr
                                                ;; size=8 bbWeight=1 PerfScore 2.00
G_M55611_IG12:  ;; offset=0x0074
            bl      CORINFO_HELP_THROW_ARGUMENTOUTOFRANGEEXCEPTION
            brk_windows #0
                                                ;; size=8 bbWeight=0 PerfScore 0.00

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And with JitStress=2:

G_M55611_IG01:  ;; offset=0x0000
            stp     fp, lr, [sp, #-0x20]!
            mov     fp, sp
            movz    x1, #0x5678
            movk    x1, #0x1234 LSL #16
            movk    x1, #0xDEF0 LSL #32
            movk    x1, #0x9ABC LSL #48
            str     x1, [fp, #0x18]     // [V04 GsCookie]
                                                ;; size=28 bbWeight=1 PerfScore 4.50
G_M55611_IG02:  ;; offset=0x001C
            uxtb    w0, w0
                                                ;; size=4 bbWeight=1 PerfScore 0.50
G_M55611_IG03:  ;; offset=0x0020
            sub     w1, w0, #1
                                                ;; size=4 bbWeight=1 PerfScore 0.50
G_M55611_IG04:  ;; offset=0x0024
            cmp     w1, #8
                                                ;; size=4 bbWeight=1 PerfScore 0.50
G_M55611_IG05:  ;; offset=0x0028
            bhs     G_M55611_IG36
                                                ;; size=4 bbWeight=1 PerfScore 1.00
G_M55611_IG06:  ;; offset=0x002C
            ptrue   p0.b
                                                ;; size=4 bbWeight=1 PerfScore 2.00
G_M55611_IG07:  ;; offset=0x0030
            movprfx z0.b, p0/z, z0.b
                                                ;; size=4 bbWeight=1 PerfScore 2.00
G_M55611_IG08:  ;; offset=0x0034
            adr     x1, [G_M55611_IG12]
                                                ;; size=4 bbWeight=1 PerfScore 0.50
G_M55611_IG09:  ;; offset=0x0038
            add     x1, x1, x0,  LSL #3
                                                ;; size=4 bbWeight=1 PerfScore 1.00
G_M55611_IG10:  ;; offset=0x003C
            sub     x1, x1, #8
                                                ;; size=4 bbWeight=1 PerfScore 0.50
G_M55611_IG11:  ;; offset=0x0040
            br      x1
                                                ;; size=4 bbWeight=1 PerfScore 1.00
G_M55611_IG12:  ;; offset=0x0044
            asrd    z0.b, p0/m, z0.b, #1
                                                ;; size=4 bbWeight=1 PerfScore 3.00
G_M55611_IG13:  ;; offset=0x0048
            b       G_M55611_IG27
                                                ;; size=4 bbWeight=1 PerfScore 1.00
G_M55611_IG14:  ;; offset=0x004C
            asrd    z0.b, p0/m, z0.b, #2
                                                ;; size=4 bbWeight=1 PerfScore 3.00
G_M55611_IG15:  ;; offset=0x0050
            b       G_M55611_IG27
                                                ;; size=4 bbWeight=1 PerfScore 1.00
G_M55611_IG16:  ;; offset=0x0054
            asrd    z0.b, p0/m, z0.b, #3
                                                ;; size=4 bbWeight=1 PerfScore 3.00
G_M55611_IG17:  ;; offset=0x0058
            b       G_M55611_IG27
                                                ;; size=4 bbWeight=1 PerfScore 1.00
G_M55611_IG18:  ;; offset=0x005C
            asrd    z0.b, p0/m, z0.b, #4
                                                ;; size=4 bbWeight=1 PerfScore 3.00
G_M55611_IG19:  ;; offset=0x0060
            b       G_M55611_IG27
                                                ;; size=4 bbWeight=1 PerfScore 1.00
G_M55611_IG20:  ;; offset=0x0064
            asrd    z0.b, p0/m, z0.b, #5
                                                ;; size=4 bbWeight=1 PerfScore 3.00
G_M55611_IG21:  ;; offset=0x0068
            b       G_M55611_IG27
                                                ;; size=4 bbWeight=1 PerfScore 1.00
G_M55611_IG22:  ;; offset=0x006C
            asrd    z0.b, p0/m, z0.b, #6
                                                ;; size=4 bbWeight=1 PerfScore 3.00
G_M55611_IG23:  ;; offset=0x0070
            b       G_M55611_IG27
                                                ;; size=4 bbWeight=1 PerfScore 1.00
G_M55611_IG24:  ;; offset=0x0074
            asrd    z0.b, p0/m, z0.b, #7
                                                ;; size=4 bbWeight=1 PerfScore 3.00
G_M55611_IG25:  ;; offset=0x0078
            b       G_M55611_IG27
                                                ;; size=4 bbWeight=1 PerfScore 1.00
G_M55611_IG26:  ;; offset=0x007C
            asrd    z0.b, p0/m, z0.b, #8
                                                ;; size=4 bbWeight=1 PerfScore 3.00
G_M55611_IG27:  ;; offset=0x0080
            movz    xip0, #0x5678
                                                ;; size=4 bbWeight=1 PerfScore 0.50
G_M55611_IG28:  ;; offset=0x0084
            movk    xip0, #0x1234 LSL #16
                                                ;; size=4 bbWeight=1 PerfScore 0.50
G_M55611_IG29:  ;; offset=0x0088
            movk    xip0, #0xDEF0 LSL #32
                                                ;; size=4 bbWeight=1 PerfScore 0.50
G_M55611_IG30:  ;; offset=0x008C
            movk    xip0, #0x9ABC LSL #48
                                                ;; size=4 bbWeight=1 PerfScore 0.50
G_M55611_IG31:  ;; offset=0x0090
            ldr     xip1, [fp, #0x18]   // [V04 GsCookie]
                                                ;; size=4 bbWeight=1 PerfScore 2.00
G_M55611_IG32:  ;; offset=0x0094
            cmp     xip0, xip1
                                                ;; size=4 bbWeight=1 PerfScore 0.50
G_M55611_IG33:  ;; offset=0x0098
            beq     G_M55611_IG35
                                                ;; size=4 bbWeight=1 PerfScore 1.00
G_M55611_IG34:  ;; offset=0x009C
            bl      CORINFO_HELP_FAIL_FAST
                                                ;; size=4 bbWeight=1 PerfScore 1.00
G_M55611_IG35:  ;; offset=0x00A0
            ldp     fp, lr, [sp], #0x20
            ret     lr
                                                ;; size=8 bbWeight=1 PerfScore 2.00
G_M55611_IG36:  ;; offset=0x00A8
            bl      CORINFO_HELP_THROW_ARGUMENTOUTOFRANGEEXCEPTION
                                                ;; size=4 bbWeight=0 PerfScore 0.00
G_M55611_IG37:  ;; offset=0x00AC
            brk_windows #0
                                                ;; size=4 bbWeight=0 PerfScore 0.00

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@amanasifkhalid amanasifkhalid merged commit 89d63f9 into dotnet:main Jul 3, 2024
162 of 167 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Aug 3, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants