Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AArch64][SVE] Detect MOV (imm, pred, zeroing/merging) #116032

Merged
merged 4 commits into from
Nov 15, 2024

Conversation

rj-jesus
Copy link
Contributor

@rj-jesus rj-jesus commented Nov 13, 2024

Add patterns to fold MOV (scalar, predicated) to MOV (imm, pred,
merging) or MOV (imm, pred, zeroing) as appropriate.

This affects the @llvm.aarch64.sve.dup intrinsics, which currently
generate the MOV (scalar, predicated) instructions, even when the
immediate forms are possible. For example:

svuint8_t mov_z_b(svbool_t p) {
  return svdup_u8_z(p, 1);
}

Currently generates:

mov_z_b(__SVBool_t):
        mov     z0.b, #0
        mov     w8, #1
        mov     z0.b, p0/m, w8
        ret

Instead of:

mov_z_b(__SVBool_t):
        mov     z0.b, p0/z, #1
        ret

https://godbolt.org/z/GPzrW1nY8

Add patterns to fold MOV (scalar, predicated) to MOV (imm, pred,
merging) or MOV (imm, pred, zeroing) as appropriate.
@llvmbot
Copy link
Member

llvmbot commented Nov 13, 2024

@llvm/pr-subscribers-backend-aarch64

Author: Ricardo Jesus (rj-jesus)

Changes

Add patterns to fold MOV (scalar, predicated) to MOV (imm, pred,
merging) or MOV (imm, pred, zeroing) as appropriate.


Full diff: https://github.com/llvm/llvm-project/pull/116032.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td (+20)
  • (added) llvm/test/CodeGen/AArch64/sve-mov-imm-pred.ll (+83)
diff --git a/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td b/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
index c10653e05841cd..d0b4b71a93f641 100644
--- a/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
@@ -892,6 +892,26 @@ let Predicates = [HasSVEorSME] in {
   def : Pat<(nxv2i64 (splat_vector (i64 (SVECpyDupImm64Pat i32:$a, i32:$b)))),
             (DUP_ZI_D $a, $b)>;
 
+  // Duplicate Int immediate to active vector elements (zeroing).
+  def : Pat<(nxv16i8 (AArch64dup_mt PPR:$pg, (i32 (SVECpyDupImm8Pat i32:$a, i32:$b)), (SVEDup0Undef))),
+            (CPY_ZPzI_B $pg, $a, $b)>;
+  def : Pat<(nxv8i16 (AArch64dup_mt PPR:$pg, (i32 (SVECpyDupImm16Pat i32:$a, i32:$b)), (SVEDup0Undef))),
+            (CPY_ZPzI_H $pg, $a, $b)>;
+  def : Pat<(nxv4i32 (AArch64dup_mt PPR:$pg, (i32 (SVECpyDupImm32Pat i32:$a, i32:$b)), (SVEDup0Undef))),
+            (CPY_ZPzI_S $pg, $a, $b)>;
+  def : Pat<(nxv2i64 (AArch64dup_mt PPR:$pg, (i64 (SVECpyDupImm64Pat i32:$a, i32:$b)), (SVEDup0Undef))),
+            (CPY_ZPzI_D $pg, $a, $b)>;
+
+  // Duplicate Int immediate to active vector elements (merging).
+  def : Pat<(nxv16i8 (AArch64dup_mt PPR:$pg, (i32 (SVECpyDupImm8Pat i32:$a, i32:$b)), (nxv16i8 ZPR:$z))),
+            (CPY_ZPmI_B $z, $pg, $a, $b)>;
+  def : Pat<(nxv8i16 (AArch64dup_mt PPR:$pg, (i32 (SVECpyDupImm16Pat i32:$a, i32:$b)), (nxv8i16 ZPR:$z))),
+            (CPY_ZPmI_H $z, $pg, $a, $b)>;
+  def : Pat<(nxv4i32 (AArch64dup_mt PPR:$pg, (i32 (SVECpyDupImm32Pat i32:$a, i32:$b)), (nxv4i32 ZPR:$z))),
+            (CPY_ZPmI_S $z, $pg, $a, $b)>;
+  def : Pat<(nxv2i64 (AArch64dup_mt PPR:$pg, (i64 (SVECpyDupImm64Pat i32:$a, i32:$b)), (nxv2i64 ZPR:$z))),
+            (CPY_ZPmI_D $z, $pg, $a, $b)>;
+
   // Duplicate immediate FP into all vector elements.
   def : Pat<(nxv2f16 (splat_vector (f16 fpimm:$val))),
             (DUP_ZR_H (MOVi32imm (bitcast_fpimm_to_i32 f16:$val)))>;
diff --git a/llvm/test/CodeGen/AArch64/sve-mov-imm-pred.ll b/llvm/test/CodeGen/AArch64/sve-mov-imm-pred.ll
new file mode 100644
index 00000000000000..43be70c9590fb0
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/sve-mov-imm-pred.ll
@@ -0,0 +1,83 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve < %s | FileCheck %s
+
+; Zeroing.
+
+define dso_local <vscale x 16 x i8> @mov_z_b(<vscale x 16 x i1> %pg) {
+; CHECK-LABEL: mov_z_b:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov z0.b, p0/z, #1 // =0x1
+; CHECK-NEXT:    ret
+  %r = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> zeroinitializer, <vscale x 16 x i1> %pg, i8 1)
+  ret <vscale x 16 x i8> %r
+}
+
+define dso_local <vscale x 8 x i16> @mov_z_h(<vscale x 8 x i1> %pg) {
+; CHECK-LABEL: mov_z_h:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov z0.h, p0/z, #1 // =0x1
+; CHECK-NEXT:    ret
+  %r = tail call <vscale x 8 x i16> @llvm.aarch64.sve.dup.nxv8i16(<vscale x 8 x i16> zeroinitializer, <vscale x 8 x i1> %pg, i16 1)
+  ret <vscale x 8 x i16> %r
+}
+
+define dso_local <vscale x 4 x i32> @mov_z_s(<vscale x 4 x i1> %pg) {
+; CHECK-LABEL: mov_z_s:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov z0.s, p0/z, #1 // =0x1
+; CHECK-NEXT:    ret
+  %r = tail call <vscale x 4 x i32> @llvm.aarch64.sve.dup.nxv4i32(<vscale x 4 x i32> zeroinitializer, <vscale x 4 x i1> %pg, i32 1)
+  ret <vscale x 4 x i32> %r
+}
+
+define dso_local <vscale x 2 x i64> @mov_z_d(<vscale x 2 x i1> %pg) {
+; CHECK-LABEL: mov_z_d:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov z0.d, p0/z, #1 // =0x1
+; CHECK-NEXT:    ret
+  %r = tail call <vscale x 2 x i64> @llvm.aarch64.sve.dup.nxv2i64(<vscale x 2 x i64> zeroinitializer, <vscale x 2 x i1> %pg, i64 1)
+  ret <vscale x 2 x i64> %r
+}
+
+; Merging.
+
+define dso_local <vscale x 16 x i8> @mov_m_b(<vscale x 16 x i8> %zd, <vscale x 16 x i1> %pg) {
+; CHECK-LABEL: mov_m_b:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov z0.b, p0/m, #1 // =0x1
+; CHECK-NEXT:    ret
+  %r = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> %zd, <vscale x 16 x i1> %pg, i8 1)
+  ret <vscale x 16 x i8> %r
+}
+
+define dso_local <vscale x 8 x i16> @mov_m_h(<vscale x 8 x i16> %zd, <vscale x 8 x i1> %pg) {
+; CHECK-LABEL: mov_m_h:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov z0.h, p0/m, #1 // =0x1
+; CHECK-NEXT:    ret
+  %r = tail call <vscale x 8 x i16> @llvm.aarch64.sve.dup.nxv8i16(<vscale x 8 x i16> %zd, <vscale x 8 x i1> %pg, i16 1)
+  ret <vscale x 8 x i16> %r
+}
+
+define dso_local <vscale x 4 x i32> @mov_m_s(<vscale x 4 x i32> %zd, <vscale x 4 x i1> %pg) {
+; CHECK-LABEL: mov_m_s:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov z0.s, p0/m, #1 // =0x1
+; CHECK-NEXT:    ret
+  %r = tail call <vscale x 4 x i32> @llvm.aarch64.sve.dup.nxv4i32(<vscale x 4 x i32> %zd, <vscale x 4 x i1> %pg, i32 1)
+  ret <vscale x 4 x i32> %r
+}
+
+define dso_local <vscale x 2 x i64> @mov_m_d(<vscale x 2 x i64> %zd, <vscale x 2 x i1> %pg) {
+; CHECK-LABEL: mov_m_d:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov z0.d, p0/m, #1 // =0x1
+; CHECK-NEXT:    ret
+  %r = tail call <vscale x 2 x i64> @llvm.aarch64.sve.dup.nxv2i64(<vscale x 2 x i64> %zd, <vscale x 2 x i1> %pg, i64 1)
+  ret <vscale x 2 x i64> %r
+}
+
+declare <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8>, <vscale x 16 x i1>, i8)
+declare <vscale x 8 x i16> @llvm.aarch64.sve.dup.nxv8i16(<vscale x 8 x i16>, <vscale x 8 x i1>, i16)
+declare <vscale x 4 x i32> @llvm.aarch64.sve.dup.nxv4i32(<vscale x 4 x i32>, <vscale x 4 x i1>, i32)
+declare <vscale x 2 x i64> @llvm.aarch64.sve.dup.nxv2i64(<vscale x 2 x i64>, <vscale x 2 x i1>, i64)

Copy link
Collaborator

@paulwalker-arm paulwalker-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you say where the uses are coming from? I ask because I'm wondering if we should instead be lowering the intrinsics to select instructions to allow generic combines to do their thing. Plus we already have patterns to spot the select based sequences.

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td Outdated Show resolved Hide resolved
llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td Outdated Show resolved Hide resolved
@rj-jesus
Copy link
Contributor Author

Can you say where the uses are coming from? I ask because I'm wondering if we should instead be lowering the intrinsics to select instructions to allow generic combines to do their thing. Plus we already have patterns to spot the select based sequences.

The uses are coming from ACLE intrinsics as in the description above. Would you rather see it done with a select in LowerSVEIntrinsicDUP? I think that should also work well. After sending this patch through I've realised we are also missing patterns for the MOV (immediate, unpredicated) case, e.g.:

svuint64_t foo() {
  svbool_t pg = svptrue_b64();
  return svdup_u64_z(pg, 1);
}

which generates:

foo:
        ptrue   p0.d
        mov     z0.d, #0
        mov     w8, #1
        mov     z0.d, p0/m, x8
        ret

instead of:

foo:
        mov     z0.d, #1
        ret

So, we could probably address all these cases by going the select way. What do you think?

@paulwalker-arm
Copy link
Collaborator

Thanks for the clarification. I just wasn't sure if the example was a real reproducer or just a means to demonstrate the problem. Given this is ACLE related, let's park the decision to replace DUP_MERGE_PASSTHRU for now and stick with adding the new patterns.

The missed optimisation of the all active case isn't too surprising given there's already an unpredicated builtin that should be used instead. That said, I see no harm in updating instCombineSVEDup to replace such instances with a vector splat.

@rj-jesus
Copy link
Contributor Author

Thanks for the feedback, I've now pushed the patterns into sve_int_dup_imm_pred_zero and sve_int_dup_imm_pred_merge as appropriate - is what you had in mind?

As for the unpredicated case, does it make sense to make that change here as well?
As you say, there are unpredicated variants that can already be used, though it might still be a nice fold to have.

Copy link
Collaborator

@paulwalker-arm paulwalker-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor quibble with the tests but otherwise this looks good to me.

llvm/test/CodeGen/AArch64/sve-mov-imm-pred.ll Outdated Show resolved Hide resolved
@paulwalker-arm
Copy link
Collaborator

As for the unpredicated case, does it make sense to make that change here as well?

I recommend fixing it as a separate PR.

@rj-jesus
Copy link
Contributor Author

I recommend fixing it as a separate PR.

Sounds good, thanks, and many thanks for the feedback!

@rj-jesus rj-jesus merged commit e543650 into llvm:main Nov 15, 2024
8 checks passed
@rj-jesus rj-jesus deleted the rjj/aarch64-sve-mov-imm-pred branch November 15, 2024 09:11
akshayrdeodhar pushed a commit to akshayrdeodhar/llvm-project that referenced this pull request Nov 18, 2024
Add patterns to fold MOV (scalar, predicated) to MOV (imm, pred,
merging) or MOV (imm, pred, zeroing) as appropriate.

This affects the `@llvm.aarch64.sve.dup` intrinsics, which currently
generate MOV (scalar, predicated) instructions even when the
immediate forms are possible. For example:
```
svuint8_t mov_z_b(svbool_t p) {
  return svdup_u8_z(p, 1);
}
```
Currently generates:
```
mov_z_b(__SVBool_t):
        mov     z0.b, #0
        mov     w8, llvm#1
        mov     z0.b, p0/m, w8
        ret
```
Instead of:
```
mov_z_b(__SVBool_t):
        mov     z0.b, p0/z, llvm#1
        ret
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants