Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM64-SVE: Add MultiplyAddRotateComplexBySelectedScalar #105002

Merged
merged 3 commits into from
Jul 17, 2024

Conversation

amanasifkhalid
Copy link
Member

@amanasifkhalid amanasifkhalid commented Jul 17, 2024

Part of #99957. I had to add a new test template to cover APIs with three vector args, and two immediates; the new template is identical to _SveImmTernOpTestTemplate.template, except that I added additional Imm and InvalidImm arguments. JIT changes are a bit more involved for this API because it is the first ARM64 intrinsic with five operands. This means the HWIntrinsic struct now has an additional GenTree* member. If the TP impact of this is too high, I can get rid of the new member; the code for handling the fifth operand will look a bit less elegant, but it's only for this API, for now.

With @kunalspathak's fix in #104998 merged in locally, all stress tests pass. The SVE encoding unit tests passed, too (I had to tweak the emitter and tests to take the rotation value in its encoded form to match the API-level behavior). Since the jump table generation is a bit odd for this API, here's what the codegen looks like:

G_M19160_IG01:  ;; offset=0x0000
            stp     fp, lr, [sp, #-0x10]!
            mov     fp, sp
                                                ;; size=8 bbWeight=1 PerfScore 1.50
G_M19160_IG02:  ;; offset=0x0008
            uxtb    w0, w0
            cmp     w0, #2
            bhs     G_M19160_IG13
            uxtb    w1, w1
            cmp     w1, #4
            bhs     G_M19160_IG13
            lsl     w1, w1, #1
            orr     w0, w0, w1
            adr     x2, [G_M19160_IG03]
            add     x2, x2, x0,  LSL #3
            br      x2
                                                ;; size=44 bbWeight=1 PerfScore 8.00
G_M19160_IG03:  ;; offset=0x0034
            fcmla   z0.s, z1.s, z2.s[0], #0
            b       G_M19160_IG11
                                                ;; size=8 bbWeight=1 PerfScore 5.00
G_M19160_IG04:  ;; offset=0x003C
            fcmla   z0.s, z1.s, z2.s[1], #0
            b       G_M19160_IG11
                                                ;; size=8 bbWeight=1 PerfScore 5.00
G_M19160_IG05:  ;; offset=0x0044
            fcmla   z0.s, z1.s, z2.s[0], #90
            b       G_M19160_IG11
                                                ;; size=8 bbWeight=1 PerfScore 5.00
G_M19160_IG06:  ;; offset=0x004C
            fcmla   z0.s, z1.s, z2.s[1], #90
            b       G_M19160_IG11
                                                ;; size=8 bbWeight=1 PerfScore 5.00
G_M19160_IG07:  ;; offset=0x0054
            fcmla   z0.s, z1.s, z2.s[0], #180
            b       G_M19160_IG11
                                                ;; size=8 bbWeight=1 PerfScore 5.00
G_M19160_IG08:  ;; offset=0x005C
            fcmla   z0.s, z1.s, z2.s[1], #180
            b       G_M19160_IG11
                                                ;; size=8 bbWeight=1 PerfScore 5.00
G_M19160_IG09:  ;; offset=0x0064
            fcmla   z0.s, z1.s, z2.s[0], #270
            b       G_M19160_IG11
                                                ;; size=8 bbWeight=1 PerfScore 5.00
G_M19160_IG10:  ;; offset=0x006C
            fcmla   z0.s, z1.s, z2.s[1], #270
                                                ;; size=4 bbWeight=1 PerfScore 4.00
G_M19160_IG11:  ;; offset=0x0070
            and     w0, w0, #1
            lsr     w1, w1, #1
                                                ;; size=8 bbWeight=1 PerfScore 1.50
G_M19160_IG12:  ;; offset=0x0078
            ldp     fp, lr, [sp], #0x10
            ret     lr
                                                ;; size=8 bbWeight=1 PerfScore 2.00
G_M19160_IG13:  ;; offset=0x0080
            bl      CORINFO_HELP_THROW_ARGUMENTOUTOFRANGEEXCEPTION
            brk_windows #0
                                                ;; size=8 bbWeight=0 PerfScore 0.00

@dotnet/arm64-contrib PTAL, thanks!

@amanasifkhalid amanasifkhalid added the arm-sve Work related to arm64 SVE/SVE2 support label Jul 17, 2024
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI new-api-needs-documentation labels Jul 17, 2024
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@@ -190,6 +190,7 @@ HARDWARE_INTRINSIC(Sve, MinNumberAcross,
HARDWARE_INTRINSIC(Sve, Multiply, -1, 2, true, {INS_sve_mul, INS_sve_mul, INS_sve_mul, INS_sve_mul, INS_sve_mul, INS_sve_mul, INS_sve_mul, INS_sve_mul, INS_sve_fmul, INS_sve_fmul}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_OptionalEmbeddedMaskedOperation|HW_Flag_HasRMWSemantics|HW_Flag_LowMaskedOperation)
HARDWARE_INTRINSIC(Sve, MultiplyAdd, -1, -1, false, {INS_sve_mla, INS_sve_mla, INS_sve_mla, INS_sve_mla, INS_sve_mla, INS_sve_mla, INS_sve_mla, INS_sve_mla, INS_invalid, INS_invalid}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_EmbeddedMaskedOperation|HW_Flag_HasRMWSemantics|HW_Flag_LowMaskedOperation|HW_Flag_FmaIntrinsic|HW_Flag_SpecialCodeGen)
HARDWARE_INTRINSIC(Sve, MultiplyAddRotateComplex, -1, -1, false, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_sve_fcmla, INS_sve_fcmla}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_EmbeddedMaskedOperation|HW_Flag_LowMaskedOperation|HW_Flag_HasRMWSemantics|HW_Flag_HasImmediateOperand)
HARDWARE_INTRINSIC(Sve, MultiplyAddRotateComplexBySelectedScalar, -1, 5, false, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_sve_fcmla, INS_invalid}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_HasImmediateOperand|HW_Flag_LowVectorOperation|HW_Flag_HasRMWSemantics|HW_Flag_SpecialCodeGen|HW_Flag_SpecialImport)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have HW_Category_SIMDByIndexedElement.

Copy link
Member Author

@amanasifkhalid amanasifkhalid Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that category makes sense, though this API seems different enough from other HW_Category_SIMDByIndexedElement APIs that we would have to add several special cases on that category's path (for example, all the 5-operand handling, plus the fact that we index the third vector register in pairs of elements, so the default bounds calculation for the index doesn't work). Do we want to move this logic over?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

synced offline...ok to skip this but to add a comment in lsraarm64.cpp explaining why we do not mark it such and why we just need 1 internal register although we have 2 immediates.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please monitor the TP cost and also check the memory cost by adding op5?

@amanasifkhalid
Copy link
Member Author

can you please monitor the TP cost and also check the memory cost by adding op5?

Sure: Looks like the TP cost is trivial. Is there a tool we use to measure diffs in memory usage?

@kunalspathak
Copy link
Member

Is there a tool we use to measure diffs in memory usage?

ignore that. we have updated the struct which gets allocated on the stack and not from arena allocator.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@amanasifkhalid
Copy link
Member Author

/ba-g blocking failures are timeouts

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI arm-sve Work related to arm64 SVE/SVE2 support new-api-needs-documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants