[RISCV] Shrink vslideup's LMUL when lowering fixed insert_subvector #65997

lukel97 · 2023-09-11T18:55:59Z

Similar to #65598, if we're using a vslideup to insert a fixed length vector into another vector, then we can work out the minimum number of registers it will need to slide up across given the minimum VLEN, and shrink the type operated on to reduce LMUL accordingly.

This is somewhat dependent on #66211 , since it introduces a subregister copy that triggers a crash with -early-live-intervals in one of the tests.

Stacked upon #66211

llvmbot · 2023-09-11T18:56:57Z

@llvm/pr-subscribers-backend-risc-v

Changes

Similar to #65598, if we're using a vslideup to insert a fixed length vector into another vector, then we can work out the minimum number of registers it will need to slide up across given the minimum VLEN, and shrink the type operated on to reduce LMUL accordingly.

This is somewhat dependent on #65916, since it introduces a subregister copy that triggers a crash with -early-live-intervals in one of the tests.

Patch is 47.04 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/65997.diff

6 Files Affected:

(modified) llvm/lib/CodeGen/TwoAddressInstructionPass.cpp (+21-2)
(modified) llvm/lib/Target/RISCV/RISCVISelLowering.cpp (+18)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-insert-subvector.ll (+21-24)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-combine.ll (+40-40)
(modified) llvm/test/CodeGen/RISCV/rvv/fpclamptosat_vec.ll (+150-180)
(added) llvm/test/CodeGen/RISCV/rvv/twoaddressinstruction-subreg-liveness-update.mir (+42)

diff --git a/llvm/lib/CodeGen/TwoAddressInstructionPass.cpp b/llvm/lib/CodeGen/TwoAddressInstructionPass.cpp
index 45f61262faf9391..e36bffc91b91d95 100644
--- a/llvm/lib/CodeGen/TwoAddressInstructionPass.cpp
+++ b/llvm/lib/CodeGen/TwoAddressInstructionPass.cpp
@@ -1871,11 +1871,30 @@ bool TwoAddressInstructionPass::runOnMachineFunction(MachineFunction &Func) {
             SlotIndex Idx = LIS->getInstructionIndex(*mi);
             for (auto &S : LI.subranges()) {
               if ((S.LaneMask & LaneMask).none()) {
+                // If Idx is 160B, and we have a subrange that isn't in
+                // %reg.subidx like so:
+                //
+                // [152r,160r)[160r,256r)
+                //
+                // Merge the two segments together so the subrange becomes:
+                //
+                // [152r,256r)
                 LiveRange::iterator UseSeg = S.FindSegmentContaining(Idx);
-                LiveRange::iterator DefSeg = std::next(UseSeg);
-                S.MergeValueNumberInto(DefSeg->valno, UseSeg->valno);
+                if (UseSeg != S.end()) {
+                  LiveRange::iterator DefSeg = std::next(UseSeg);
+                  assert(DefSeg != S.end());
+                  S.MergeValueNumberInto(DefSeg->valno, UseSeg->valno);
+                }
+                // Otherwise, it should have only one segment that starts at
+                // 160r which we should remove.
+                else {
+                  assert(S.containsOneValue());
+                  assert(S.begin()->start == Idx.getRegSlot());
+                  S.removeSegment(S.begin());
+                }
               }
             }
+            LI.removeEmptySubRanges();
 
             // The COPY no longer has a use of %reg.
             LIS->shrinkToUses(&LI);
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 4ff264635cda248..3a81179d1cfc120 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -8606,6 +8606,18 @@ SDValue RISCVTargetLowering::lowerINSERT_SUBVECTOR(SDValue Op,
       ContainerVT = getContainerForFixedLengthVector(VecVT);
       Vec = convertToScalableVector(ContainerVT, Vec, DAG, Subtarget);
     }
+
+    // Shrink down Vec so we're performing the slideup on a smaller LMUL.
+    unsigned LastIdx = OrigIdx + SubVecVT.getVectorNumElements() - 1;
+    MVT OrigContainerVT = ContainerVT;
+    SDValue OrigVec = Vec;
+    if (auto ShrunkVT =
+            getSmallestVTForIndex(ContainerVT, LastIdx, DL, DAG, Subtarget)) {
+      ContainerVT = *ShrunkVT;
+      Vec = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, ContainerVT, Vec,
+                        DAG.getVectorIdxConstant(0, DL));
+    }
+
     SubVec = DAG.getNode(ISD::INSERT_SUBVECTOR, DL, ContainerVT,
                          DAG.getUNDEF(ContainerVT), SubVec,
                          DAG.getConstant(0, DL, XLenVT));
@@ -8636,6 +8648,12 @@ SDValue RISCVTargetLowering::lowerINSERT_SUBVECTOR(SDValue Op,
                            SlideupAmt, Mask, VL, Policy);
     }
 
+    // If we performed the slideup on a smaller LMUL, insert the result back
+    // into the rest of the vector.
+    if (ContainerVT != OrigContainerVT)
+      SubVec = DAG.getNode(ISD::INSERT_SUBVECTOR, DL, OrigContainerVT, OrigVec,
+                           SubVec, DAG.getVectorIdxConstant(0, DL));
+
     if (VecVT.isFixedLengthVector())
       SubVec = convertFromScalableVector(VecVT, SubVec, DAG, Subtarget);
     return DAG.getBitcast(Op.getValueType(), SubVec);
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-insert-subvector.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-insert-subvector.ll
index 1d6a45ed36f335c..6a9212ed309a8ef 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-insert-subvector.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-insert-subvector.ll
@@ -14,7 +14,7 @@ define  @insert_nxv8i32_v2i32_0( %vec, ptr %
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetivli zero, 2, e32, mf2, ta, ma
 ; CHECK-NEXT:    vle32.v v12, (a0)
-; CHECK-NEXT:    vsetivli zero, 2, e32, m4, tu, ma
+; CHECK-NEXT:    vsetivli zero, 2, e32, m1, tu, ma
 ; CHECK-NEXT:    vmv.v.v v8, v12
 ; CHECK-NEXT:    ret
   %sv = load <2 x i32>, ptr %svp
@@ -27,7 +27,7 @@ define  @insert_nxv8i32_v2i32_2( %vec, ptr %
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetivli zero, 2, e32, mf2, ta, ma
 ; CHECK-NEXT:    vle32.v v12, (a0)
-; CHECK-NEXT:    vsetivli zero, 4, e32, m4, tu, ma
+; CHECK-NEXT:    vsetivli zero, 4, e32, m1, tu, ma
 ; CHECK-NEXT:    vslideup.vi v8, v12, 2
 ; CHECK-NEXT:    ret
   %sv = load <2 x i32>, ptr %svp
@@ -40,7 +40,7 @@ define  @insert_nxv8i32_v2i32_6( %vec, ptr %
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetivli zero, 2, e32, mf2, ta, ma
 ; CHECK-NEXT:    vle32.v v12, (a0)
-; CHECK-NEXT:    vsetivli zero, 8, e32, m4, tu, ma
+; CHECK-NEXT:    vsetivli zero, 8, e32, m2, tu, ma
 ; CHECK-NEXT:    vslideup.vi v8, v12, 6
 ; CHECK-NEXT:    ret
   %sv = load <2 x i32>, ptr %svp
@@ -51,22 +51,19 @@ define  @insert_nxv8i32_v2i32_6( %vec, ptr %
 define  @insert_nxv8i32_v8i32_0( %vec, ptr %svp) {
 ; LMULMAX2-LABEL: insert_nxv8i32_v8i32_0:
 ; LMULMAX2:       # %bb.0:
-; LMULMAX2-NEXT:    vsetivli zero, 8, e32, m2, ta, ma
-; LMULMAX2-NEXT:    vle32.v v12, (a0)
-; LMULMAX2-NEXT:    vsetivli zero, 8, e32, m4, tu, ma
-; LMULMAX2-NEXT:    vmv.v.v v8, v12
+; LMULMAX2-NEXT:    vsetivli zero, 8, e32, m2, tu, ma
+; LMULMAX2-NEXT:    vle32.v v8, (a0)
 ; LMULMAX2-NEXT:    ret
 ;
 ; LMULMAX1-LABEL: insert_nxv8i32_v8i32_0:
 ; LMULMAX1:       # %bb.0:
+; LMULMAX1-NEXT:    addi a1, a0, 16
 ; LMULMAX1-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
-; LMULMAX1-NEXT:    vle32.v v12, (a0)
-; LMULMAX1-NEXT:    addi a0, a0, 16
-; LMULMAX1-NEXT:    vle32.v v16, (a0)
-; LMULMAX1-NEXT:    vsetivli zero, 4, e32, m4, tu, ma
-; LMULMAX1-NEXT:    vmv.v.v v8, v12
-; LMULMAX1-NEXT:    vsetivli zero, 8, e32, m4, tu, ma
-; LMULMAX1-NEXT:    vslideup.vi v8, v16, 4
+; LMULMAX1-NEXT:    vle32.v v12, (a1)
+; LMULMAX1-NEXT:    vsetvli zero, zero, e32, m1, tu, ma
+; LMULMAX1-NEXT:    vle32.v v8, (a0)
+; LMULMAX1-NEXT:    vsetivli zero, 8, e32, m2, tu, ma
+; LMULMAX1-NEXT:    vslideup.vi v8, v12, 4
 ; LMULMAX1-NEXT:    ret
   %sv = load <8 x i32>, ptr %svp
   %v = call  @llvm.vector.insert.v8i32.nxv8i32( %vec, <8 x i32> %sv, i64 0)
@@ -84,14 +81,14 @@ define  @insert_nxv8i32_v8i32_8( %vec, ptr %
 ;
 ; LMULMAX1-LABEL: insert_nxv8i32_v8i32_8:
 ; LMULMAX1:       # %bb.0:
-; LMULMAX1-NEXT:    addi a1, a0, 16
 ; LMULMAX1-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
-; LMULMAX1-NEXT:    vle32.v v12, (a1)
+; LMULMAX1-NEXT:    vle32.v v12, (a0)
+; LMULMAX1-NEXT:    addi a0, a0, 16
 ; LMULMAX1-NEXT:    vle32.v v16, (a0)
 ; LMULMAX1-NEXT:    vsetivli zero, 12, e32, m4, tu, ma
-; LMULMAX1-NEXT:    vslideup.vi v8, v16, 8
+; LMULMAX1-NEXT:    vslideup.vi v8, v12, 8
 ; LMULMAX1-NEXT:    vsetivli zero, 16, e32, m4, tu, ma
-; LMULMAX1-NEXT:    vslideup.vi v8, v12, 12
+; LMULMAX1-NEXT:    vslideup.vi v8, v16, 12
 ; LMULMAX1-NEXT:    ret
   %sv = load <8 x i32>, ptr %svp
   %v = call  @llvm.vector.insert.v8i32.nxv8i32( %vec, <8 x i32> %sv, i64 8)
@@ -166,7 +163,7 @@ define void @insert_v8i32_v2i32_0(ptr %vp, ptr %svp) {
 ; LMULMAX2-NEXT:    vle32.v v8, (a1)
 ; LMULMAX2-NEXT:    vsetivli zero, 8, e32, m2, ta, ma
 ; LMULMAX2-NEXT:    vle32.v v10, (a0)
-; LMULMAX2-NEXT:    vsetivli zero, 2, e32, m2, tu, ma
+; LMULMAX2-NEXT:    vsetivli zero, 2, e32, m1, tu, ma
 ; LMULMAX2-NEXT:    vmv.v.v v10, v8
 ; LMULMAX2-NEXT:    vsetivli zero, 8, e32, m2, ta, ma
 ; LMULMAX2-NEXT:    vse32.v v10, (a0)
@@ -197,7 +194,7 @@ define void @insert_v8i32_v2i32_2(ptr %vp, ptr %svp) {
 ; LMULMAX2-NEXT:    vle32.v v8, (a1)
 ; LMULMAX2-NEXT:    vsetivli zero, 8, e32, m2, ta, ma
 ; LMULMAX2-NEXT:    vle32.v v10, (a0)
-; LMULMAX2-NEXT:    vsetivli zero, 4, e32, m2, tu, ma
+; LMULMAX2-NEXT:    vsetivli zero, 4, e32, m1, tu, ma
 ; LMULMAX2-NEXT:    vslideup.vi v10, v8, 2
 ; LMULMAX2-NEXT:    vsetivli zero, 8, e32, m2, ta, ma
 ; LMULMAX2-NEXT:    vse32.v v10, (a0)
@@ -508,9 +505,9 @@ define void @insert_v2i64_nxv16i64(ptr %psv0, ptr %psv1, * %o
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetivli zero, 2, e64, m1, ta, ma
 ; CHECK-NEXT:    vle64.v v8, (a0)
-; CHECK-NEXT:    vle64.v v16, (a1)
-; CHECK-NEXT:    vsetivli zero, 6, e64, m8, tu, ma
-; CHECK-NEXT:    vslideup.vi v8, v16, 4
+; CHECK-NEXT:    vle64.v v12, (a1)
+; CHECK-NEXT:    vsetivli zero, 6, e64, m4, tu, ma
+; CHECK-NEXT:    vslideup.vi v8, v12, 4
 ; CHECK-NEXT:    vs8r.v v8, (a2)
 ; CHECK-NEXT:    ret
   %sv0 = load <2 x i64>, ptr %psv0
@@ -539,7 +536,7 @@ define void @insert_v2i64_nxv16i64_lo2(ptr %psv, * %out) {
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetivli zero, 2, e64, m1, ta, ma
 ; CHECK-NEXT:    vle64.v v8, (a0)
-; CHECK-NEXT:    vsetivli zero, 4, e64, m8, ta, ma
+; CHECK-NEXT:    vsetivli zero, 4, e64, m2, ta, ma
 ; CHECK-NEXT:    vslideup.vi v16, v8, 2
 ; CHECK-NEXT:    vs8r.v v16, (a1)
 ; CHECK-NEXT:    ret
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-combine.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-combine.ll
index f52ba6f51d5c897..805557905117add 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-combine.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-combine.ll
@@ -27,13 +27,13 @@ define void @widen_3xv4i16(ptr %x, ptr %z) {
 ; CHECK-NEXT:    vsetivli zero, 4, e16, mf2, ta, ma
 ; CHECK-NEXT:    vle16.v v8, (a0)
 ; CHECK-NEXT:    addi a2, a0, 8
-; CHECK-NEXT:    vle16.v v10, (a2)
+; CHECK-NEXT:    vle16.v v9, (a2)
 ; CHECK-NEXT:    addi a0, a0, 16
-; CHECK-NEXT:    vle16.v v12, (a0)
-; CHECK-NEXT:    vsetivli zero, 8, e16, m2, tu, ma
-; CHECK-NEXT:    vslideup.vi v8, v10, 4
+; CHECK-NEXT:    vle16.v v10, (a0)
+; CHECK-NEXT:    vsetivli zero, 8, e16, m1, tu, ma
+; CHECK-NEXT:    vslideup.vi v8, v9, 4
 ; CHECK-NEXT:    vsetivli zero, 12, e16, m2, tu, ma
-; CHECK-NEXT:    vslideup.vi v8, v12, 8
+; CHECK-NEXT:    vslideup.vi v8, v10, 8
 ; CHECK-NEXT:    vse16.v v8, (a1)
 ; CHECK-NEXT:    ret
   %a = load <4 x i16>, ptr %x
@@ -75,17 +75,17 @@ define void @widen_4xv4i16_unaligned(ptr %x, ptr %z) {
 ; CHECK-NO-MISALIGN-NEXT:    vsetivli zero, 8, e8, mf2, ta, ma
 ; CHECK-NO-MISALIGN-NEXT:    vle8.v v8, (a0)
 ; CHECK-NO-MISALIGN-NEXT:    addi a2, a0, 8
-; CHECK-NO-MISALIGN-NEXT:    vle8.v v10, (a2)
+; CHECK-NO-MISALIGN-NEXT:    vle8.v v9, (a2)
 ; CHECK-NO-MISALIGN-NEXT:    addi a2, a0, 16
-; CHECK-NO-MISALIGN-NEXT:    vle8.v v12, (a2)
+; CHECK-NO-MISALIGN-NEXT:    vle8.v v10, (a2)
 ; CHECK-NO-MISALIGN-NEXT:    addi a0, a0, 24
-; CHECK-NO-MISALIGN-NEXT:    vle8.v v14, (a0)
-; CHECK-NO-MISALIGN-NEXT:    vsetivli zero, 8, e16, m2, tu, ma
-; CHECK-NO-MISALIGN-NEXT:    vslideup.vi v8, v10, 4
+; CHECK-NO-MISALIGN-NEXT:    vle8.v v12, (a0)
+; CHECK-NO-MISALIGN-NEXT:    vsetvli zero, zero, e16, m1, tu, ma
+; CHECK-NO-MISALIGN-NEXT:    vslideup.vi v8, v9, 4
 ; CHECK-NO-MISALIGN-NEXT:    vsetivli zero, 12, e16, m2, tu, ma
-; CHECK-NO-MISALIGN-NEXT:    vslideup.vi v8, v12, 8
+; CHECK-NO-MISALIGN-NEXT:    vslideup.vi v8, v10, 8
 ; CHECK-NO-MISALIGN-NEXT:    vsetivli zero, 16, e16, m2, ta, ma
-; CHECK-NO-MISALIGN-NEXT:    vslideup.vi v8, v14, 12
+; CHECK-NO-MISALIGN-NEXT:    vslideup.vi v8, v12, 12
 ; CHECK-NO-MISALIGN-NEXT:    vse16.v v8, (a1)
 ; CHECK-NO-MISALIGN-NEXT:    ret
 ;
@@ -188,17 +188,17 @@ define void @strided_constant_mismatch_4xv4i16(ptr %x, ptr %z) {
 ; CHECK-NEXT:    vsetivli zero, 4, e16, mf2, ta, ma
 ; CHECK-NEXT:    vle16.v v8, (a0)
 ; CHECK-NEXT:    addi a2, a0, 2
-; CHECK-NEXT:    vle16.v v10, (a2)
+; CHECK-NEXT:    vle16.v v9, (a2)
 ; CHECK-NEXT:    addi a2, a0, 6
-; CHECK-NEXT:    vle16.v v12, (a2)
+; CHECK-NEXT:    vle16.v v10, (a2)
 ; CHECK-NEXT:    addi a0, a0, 8
-; CHECK-NEXT:    vle16.v v14, (a0)
-; CHECK-NEXT:    vsetivli zero, 8, e16, m2, tu, ma
-; CHECK-NEXT:    vslideup.vi v8, v10, 4
+; CHECK-NEXT:    vle16.v v12, (a0)
+; CHECK-NEXT:    vsetivli zero, 8, e16, m1, tu, ma
+; CHECK-NEXT:    vslideup.vi v8, v9, 4
 ; CHECK-NEXT:    vsetivli zero, 12, e16, m2, tu, ma
-; CHECK-NEXT:    vslideup.vi v8, v12, 8
+; CHECK-NEXT:    vslideup.vi v8, v10, 8
 ; CHECK-NEXT:    vsetivli zero, 16, e16, m2, ta, ma
-; CHECK-NEXT:    vslideup.vi v8, v14, 12
+; CHECK-NEXT:    vslideup.vi v8, v12, 12
 ; CHECK-NEXT:    vse16.v v8, (a1)
 ; CHECK-NEXT:    ret
   %a = load <4 x i16>, ptr %x
@@ -258,17 +258,17 @@ define void @strided_runtime_mismatch_4xv4i16(ptr %x, ptr %z, i64 %s, i64 %t) {
 ; RV32-NEXT:    vsetivli zero, 4, e16, mf2, ta, ma
 ; RV32-NEXT:    vle16.v v8, (a0)
 ; RV32-NEXT:    add a0, a0, a2
-; RV32-NEXT:    vle16.v v10, (a0)
+; RV32-NEXT:    vle16.v v9, (a0)
 ; RV32-NEXT:    add a0, a0, a4
-; RV32-NEXT:    vle16.v v12, (a0)
+; RV32-NEXT:    vle16.v v10, (a0)
 ; RV32-NEXT:    add a0, a0, a2
-; RV32-NEXT:    vle16.v v14, (a0)
-; RV32-NEXT:    vsetivli zero, 8, e16, m2, tu, ma
-; RV32-NEXT:    vslideup.vi v8, v10, 4
+; RV32-NEXT:    vle16.v v12, (a0)
+; RV32-NEXT:    vsetivli zero, 8, e16, m1, tu, ma
+; RV32-NEXT:    vslideup.vi v8, v9, 4
 ; RV32-NEXT:    vsetivli zero, 12, e16, m2, tu, ma
-; RV32-NEXT:    vslideup.vi v8, v12, 8
+; RV32-NEXT:    vslideup.vi v8, v10, 8
 ; RV32-NEXT:    vsetivli zero, 16, e16, m2, ta, ma
-; RV32-NEXT:    vslideup.vi v8, v14, 12
+; RV32-NEXT:    vslideup.vi v8, v12, 12
 ; RV32-NEXT:    vse16.v v8, (a1)
 ; RV32-NEXT:    ret
 ;
@@ -277,17 +277,17 @@ define void @strided_runtime_mismatch_4xv4i16(ptr %x, ptr %z, i64 %s, i64 %t) {
 ; RV64-NEXT:    vsetivli zero, 4, e16, mf2, ta, ma
 ; RV64-NEXT:    vle16.v v8, (a0)
 ; RV64-NEXT:    add a0, a0, a2
-; RV64-NEXT:    vle16.v v10, (a0)
+; RV64-NEXT:    vle16.v v9, (a0)
 ; RV64-NEXT:    add a0, a0, a3
-; RV64-NEXT:    vle16.v v12, (a0)
+; RV64-NEXT:    vle16.v v10, (a0)
 ; RV64-NEXT:    add a0, a0, a2
-; RV64-NEXT:    vle16.v v14, (a0)
-; RV64-NEXT:    vsetivli zero, 8, e16, m2, tu, ma
-; RV64-NEXT:    vslideup.vi v8, v10, 4
+; RV64-NEXT:    vle16.v v12, (a0)
+; RV64-NEXT:    vsetivli zero, 8, e16, m1, tu, ma
+; RV64-NEXT:    vslideup.vi v8, v9, 4
 ; RV64-NEXT:    vsetivli zero, 12, e16, m2, tu, ma
-; RV64-NEXT:    vslideup.vi v8, v12, 8
+; RV64-NEXT:    vslideup.vi v8, v10, 8
 ; RV64-NEXT:    vsetivli zero, 16, e16, m2, ta, ma
-; RV64-NEXT:    vslideup.vi v8, v14, 12
+; RV64-NEXT:    vslideup.vi v8, v12, 12
 ; RV64-NEXT:    vse16.v v8, (a1)
 ; RV64-NEXT:    ret
 ;
@@ -296,17 +296,17 @@ define void @strided_runtime_mismatch_4xv4i16(ptr %x, ptr %z, i64 %s, i64 %t) {
 ; ZVE64F-NEXT:    vsetivli zero, 4, e16, mf2, ta, ma
 ; ZVE64F-NEXT:    vle16.v v8, (a0)
 ; ZVE64F-NEXT:    add a0, a0, a2
-; ZVE64F-NEXT:    vle16.v v10, (a0)
+; ZVE64F-NEXT:    vle16.v v9, (a0)
 ; ZVE64F-NEXT:    add a0, a0, a3
-; ZVE64F-NEXT:    vle16.v v12, (a0)
+; ZVE64F-NEXT:    vle16.v v10, (a0)
 ; ZVE64F-NEXT:    add a0, a0, a2
-; ZVE64F-NEXT:    vle16.v v14, (a0)
-; ZVE64F-NEXT:    vsetivli zero, 8, e16, m2, tu, ma
-; ZVE64F-NEXT:    vslideup.vi v8, v10, 4
+; ZVE64F-NEXT:    vle16.v v12, (a0)
+; ZVE64F-NEXT:    vsetivli zero, 8, e16, m1, tu, ma
+; ZVE64F-NEXT:    vslideup.vi v8, v9, 4
 ; ZVE64F-NEXT:    vsetivli zero, 12, e16, m2, tu, ma
-; ZVE64F-NEXT:    vslideup.vi v8, v12, 8
+; ZVE64F-NEXT:    vslideup.vi v8, v10, 8
 ; ZVE64F-NEXT:    vsetivli zero, 16, e16, m2, ta, ma
-; ZVE64F-NEXT:    vslideup.vi v8, v14, 12
+; ZVE64F-NEXT:    vslideup.vi v8, v12, 12
 ; ZVE64F-NEXT:    vse16.v v8, (a1)
 ; ZVE64F-NEXT:    ret
   %a = load <4 x i16>, ptr %x
diff --git a/llvm/test/CodeGen/RISCV/rvv/fpclamptosat_vec.ll b/llvm/test/CodeGen/RISCV/rvv/fpclamptosat_vec.ll
index 31e7e7be76c89b1..f598118c18aff9f 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fpclamptosat_vec.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fpclamptosat_vec.ll
@@ -460,54 +460,49 @@ define <4 x i32> @stest_f16i32(<4 x half> %x) {
 ; CHECK-V-NEXT:    sub sp, sp, a1
 ; CHECK-V-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x30, 0x22, 0x11, 0x04, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 48 + 4 * vlenb
 ; CHECK-V-NEXT:    lhu s0, 24(a0)
-; CHECK-V-NEXT:    lhu s1, 16(a0)
-; CHECK-V-NEXT:    lhu s2, 0(a0)
-; CHECK-V-NEXT:    lhu a0, 8(a0)
+; CHECK-V-NEXT:    lhu s1, 0(a0)
+; CHECK-V-NEXT:    lhu s2, 8(a0)
+; CHECK-V-NEXT:    lhu a0, 16(a0)
 ; CHECK-V-NEXT:    fmv.w.x fa0, a0
 ; CHECK-V-NEXT:    call __extendhfsf2@plt
 ; CHECK-V-NEXT:    fcvt.l.s a0, fa0, rtz
 ; CHECK-V-NEXT:    vsetivli zero, 1, e64, m1, ta, ma
 ; CHECK-V-NEXT:    vmv.s.x v8, a0
-; CHECK-V-NEXT:    addi a0, sp, 16
-; CHECK-V-NEXT:    vs2r.v v8, (a0) # Unknown-size Folded Spill
-; CHECK-V-NEXT:    fmv.w.x fa0, s2
-; CHECK-V-NEXT:    call __extendhfsf2@plt
-; CHECK-V-NEXT:    fcvt.l.s a0, fa0, rtz
-; CHECK-V-NEXT:    vsetivli zero, 2, e64, m2, tu, ma
-; CHECK-V-NEXT:    vmv.s.x v8, a0
-; CHECK-V-NEXT:    addi a0, sp, 16
-; CHECK-V-NEXT:    vl2r.v v10, (a0) # Unknown-size Folded Reload
-; CHECK-V-NEXT:    vslideup.vi v8, v10, 1
 ; CHECK-V-NEXT:    csrr a0, vlenb
 ; CHECK-V-NEXT:    slli a0, a0, 1
 ; CHECK-V-NEXT:    add a0, sp, a0
 ; CHECK-V-NEXT:    addi a0, a0, 16
 ; CHECK-V-NEXT:    vs2r.v v8, (a0) # Unknown-size Folded Spill
+; CHECK-V-NEXT:    fmv.w.x fa0, s2
+; CHECK-V-NEXT:    call __extendhfsf2@plt
+; CHECK-V-NEXT:    fcvt.l.s a0, fa0, rtz
+; CHECK-V-NEXT:    vsetivli zero, 1, e64, m1, ta, ma
 ; CHECK-V-NEXT:    fmv.w.x fa0, s1
+; CHECK-V-NEXT:    vmv.s.x v8, a0
+; CHECK-V-NEXT:    addi a0, sp, 16
+; CHECK-V-NEXT:    vs1r.v v8, (a0) # Unknown-size Folded Spill
 ; CHECK-V-NEXT:    call __extendhfsf2@plt
 ; CHECK-V-NEXT:    fcvt.l.s a0, fa0, rtz
+; CHECK-V-NEXT:    vsetivli zero, 2, e64, m1, tu, ma
+; CHECK-V-NEXT:    vmv.s.x v10, a0
+; CHECK-V-NEXT:    addi a0, sp, 16
+; CHECK-V-NEXT:    vl1r.v v8, (a0) # Unknown-size Folded Reload
+; CHECK-V-NEXT:    vslideup.vi v10, v8, 1
 ; CHECK-V-NEXT:    vsetivli zero, 3, e64, m2, tu, ma
-; CHECK-V-NEXT:    vmv.s.x v8, a0
 ; CHECK-V-NEXT:    csrr a0, vlenb
 ; CHECK-V-NEXT:    slli a0, a0, 1
 ; CHECK-V-NEXT:    add a0, sp, a0
 ; CHECK-V-NEXT:    addi a0, a0, 16
-; CHECK-V-NEXT:    vl2r.v v10, (a0) # Unknown-size Folded Reload
+; CHECK-V-NEXT:    vl2r.v v8, (a0) # Unknown-size Folded Reload
 ; CHECK-V-NEXT:    vslideup.vi v10, v8, 2
-; CHECK-V-NEXT:    csrr a0, vlenb
-; CHECK-V-NEXT:    slli a0, a0, 1
-; CHECK-V-NEXT:    add a0, sp, a0
-; CHECK-V-NEXT:    addi a0, a0, 16
+; CHECK-V-NEXT:    addi a0, sp, 16
 ; CHECK-V-NEXT:    vs2r.v v10, (a0) # Unknown-size Folded Spill
 ; CHECK-V-NEXT:    fmv.w.x fa0, s0
 ; CHECK-V-NEXT:    call __extendhfsf2@plt
 ; CHECK-V-NEXT:    fcvt.l.s a0, fa0, rtz
 ; CHECK-V-NEXT:    vsetivli zero, 4, e64, m2, ta, ma
 ; CHECK-V-NEXT:    vmv.s.x v8, a0
-; CHECK-V-NEXT:    csrr a0, vlenb
-; CHECK-V-NEXT:    slli a0, a0, 1
-; CHECK-V-NEXT:    add a0, sp, a0
-; CHECK-V-NEXT:    addi a0, a0, 16
+; CHECK-V-NEXT:    addi a0, sp, 16
 ; CHECK-V-NEXT:    vl2r.v v10, (a0) # Unknown-size Folded Reload
 ; CHECK-V-NEXT:    vslideup.vi v10, v8, 3
 ; CHECK-V-NEXT:    lui a0, 524288
@@ -632,54 +627,49 @@ define <4 x i32> @utesth_f16i32(<4 x half> %x) {
 ; CHECK-V-NEXT:    sub sp, sp, a1
 ; CHECK-V-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x30, 0x22, 0x11, 0x04, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 48 + 4 * vlenb
 ; CHECK-V-NEXT:    lhu s0, 24(a0)
-; CHECK-V-NEXT:    lhu s1, 16(a0)
-; CHECK-V-NEXT:    lhu s2, 0(a0)
-; CHECK-V-NEXT:    lhu a0, 8(a0)
+; CHECK-V-NEXT:    lhu s1, 0(a0)
+; CHECK-V-NEXT:    lhu s2, 8(a0)
+; CHECK-V-NEXT:    lhu a0, 16(a0)
 ; CHECK-V-NEXT:    fmv.w.x fa0, a0
 ; CHECK-V-NEXT:    call __extendhfsf2@plt
 ; CHECK-V-NEXT:    fcvt.lu.s a0, fa0, rtz
 ; CHECK-V-NEXT:    vsetivli zero, 1, e64, m1, ta, ma
 ; CHECK-V-NEXT:    vmv.s....

jayfoad · 2023-09-13T15:00:44Z

This is somewhat dependent on #66211 , since it introduces a subregister copy that triggers a crash with -early-live-intervals in one of the tests.

I think #65916 and #66211 are two different fixes for the same problem. Do you #65916 also fix the crash? Or do you really need #66211?

lukel97 · 2023-09-13T16:03:03Z

This is somewhat dependent on #66211 , since it introduces a subregister copy that triggers a crash with -early-live-intervals in one of the tests.

I think #65916 and #66211 are two different fixes for the same problem. Do you #65916 also fix the crash? Or do you really need #66211?

Both fix the crash, but I closed #65916 in favour of #66211 since my understanding of LiveIntervals is limited

Similiar to llvm#66267, we can perform a vslideup_vl on a smaller type if we know the highest lane that will be written to, which can be determined from VL. This is an alternative to llvm#65997 and llvm#66087

preames

LGTM

preames · 2023-09-20T18:16:55Z

Warning: Given #66211 has landed, I am only reviewing the second change in this stack. Make sure you rebase!

Similar to llvm#65598, if we're using a vslideup to insert a fixed length vector into another vector, then we can work out the minimum number of registers it will need to slide up across given the minimum VLEN, and shrink the type operated on to reduce LMUL accordingly. This is somewhat dependent on llvm#65916, since it introduces a subregister copy that triggers a crash with -early-live-intervals in one of the tests.

topperc · 2023-09-27T06:23:46Z

I think we're starting to see failures from this change. This fixes the failures I've seen, but I think there may be some ordering issues with when we decide to shrink. If we enter this particular if it doesn't seem like we should have done any shrinking in the first place.

diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index c4942f9c637b..ee9957795f97 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -8641,6 +8641,9 @@ SDValue RISCVTargetLowering::lowerINSERT_SUBVECTOR(SDValue Op,
                          DAG.getUNDEF(ContainerVT), SubVec,
                          DAG.getConstant(0, DL, XLenVT));
     if (OrigIdx == 0 && Vec.isUndef() && VecVT.isFixedLengthVector()) {
+      if (ContainerVT != OrigContainerVT)
+        SubVec = DAG.getNode(ISD::INSERT_SUBVECTOR, DL, OrigContainerVT, OrigVec,
+                             SubVec, DAG.getVectorIdxConstant(0, DL));
       SubVec = convertFromScalableVector(VecVT, SubVec, DAG, Subtarget);
       return DAG.getBitcast(Op.getValueType(), SubVec);
     }

Here's one failure that I think is related to this iree-org/iree#15038 I also have an internal build error , but I don't have a reduce reproducer yet.

Here's a fragment of the debug output from my internal reproducer.

Legalizing: t218: v2f32 = extract_subvector t217, Constant:i64<0>                
Trying custom legalization                                                       
                                                                                 
Legalizing: t217: nxv1f32,ch = llvm.riscv.vle<(dereferenceable load (s64) from %ir.this, align 4, !tbaa !13)> t0, TargetConstant:i64<10003>, undef:nxv1f32, t4, Constant:i64<2>
Trying custom legalization                                                       
Could not custom legalize node                                                   
Trying to expand node                                                            
                                                                                 
Legalizing: t216: v8f32 = insert_subvector undef:v8f32, t218, Constant:i64<0>    
Trying custom legalization                                                       
Creating new node: t220: nxv2f32 = undef                                         
Creating new node: t221: nxv2f32 = insert_subvector undef:nxv2f32, t218, Constant:i64<0>
Creating new node: t222: v8f32 = extract_subvector t221, Constant:i64<0>         
Successfully custom legalized node                                               
 ... replacing: t216: v8f32 = insert_subvector undef:v8f32, t218, Constant:i64<0>
     with:      t222: v8f32 = extract_subvector t221, Constant:i64<0>

MinVLen is 128. The issue is with the t222 extract_subvector. t221 is an LMUL=1 with 128 bits. v8f32 is 256 bits, so we can't extract that from an LMUL=1 type.

lukel97 · 2023-09-27T10:10:57Z

@topperc whoops, thanks for catching that. Got a reproducer at #67535

Continuing on from llvm#65997, if the index of insert_vector_elt is a constant then we can work out what the minimum number of registers will be needed for the slideup and choose a smaller type to operate on. This reduces the LMUL for not just the slideup but also for the scalar insert.

…lt (#66087) Continuing on from #65997, if the index of insert_vector_elt is a constant then we can work out what the minimum number of registers will be needed for the slideup and choose a smaller type to operate on. This reduces the LMUL for not just the slideup but also for the scalar insert.

…lt (llvm#66087) Continuing on from llvm#65997, if the index of insert_vector_elt is a constant then we can work out what the minimum number of registers will be needed for the slideup and choose a smaller type to operate on. This reduces the LMUL for not just the slideup but also for the scalar insert.

…vector (#65997)" This reverts commit b5ff71e. As described in #68730, this appears to have exposed an existing liveness issue. Revert to green until we can figure out how to address the root cause. Note: This was not a clean revert. I ended up doing it by hand.

…sert_subvector (llvm#65997)"" This reverts commit 3a6cc52.

lukel97 requested review from preames and topperc September 11, 2023 18:55

lukel97 requested a review from a team as a code owner September 11, 2023 18:56

llvmbot added the backend:RISC-V label Sep 11, 2023

lukel97 changed the title ~~[RISCV] Shrine vslideup's LMUL when lowering fixed insert_subvector~~ [RISCV] Shrink vslideup's LMUL when lowering fixed insert_subvector Sep 11, 2023

lukel97 force-pushed the insert-subvector-shrink branch from 282f29d to dd0dd7a Compare September 11, 2023 18:57

lukel97 force-pushed the insert-subvector-shrink branch from 8b637f0 to dd0dd7a Compare September 12, 2023 13:45

lukel97 mentioned this pull request Sep 12, 2023

[RISCV] Reduce LMUL when index is known when lowering insert_vector_elt #66087

Merged

lukel97 force-pushed the insert-subvector-shrink branch from dd0dd7a to 5621ded Compare September 13, 2023 14:48

lukel97 mentioned this pull request Sep 13, 2023

[TwoAddressInstruction] Update LiveIntervals after INSERT_SUBREG with undef read #66211

Merged

lukel97 mentioned this pull request Sep 18, 2023

[RISCV] Combine vslideup_vl with known VL to a smaller LMUL #66671

Closed

preames approved these changes Sep 20, 2023

View reviewed changes

lukel97 force-pushed the insert-subvector-shrink branch from 5621ded to 0645568 Compare September 21, 2023 11:56

lukel97 merged commit b5ff71e into llvm:main Sep 21, 2023

kstoimenov mentioned this pull request Sep 22, 2023

Add memcpm test kstoimenov/llvm-project#13

Closed

preames added a commit to preames/llvm-project that referenced this pull request Oct 26, 2023

Revert "Revert "[RISCV] Shrink vslideup's LMUL when lowering fixed in…

1cb8598

…sert_subvector (llvm#65997)"" This reverts commit 3a6cc52.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RISCV] Shrink vslideup's LMUL when lowering fixed insert_subvector #65997

[RISCV] Shrink vslideup's LMUL when lowering fixed insert_subvector #65997

lukel97 commented Sep 11, 2023 •

edited

Loading

llvmbot commented Sep 11, 2023

This is somewhat dependent on #65916, since it introduces a subregister copy that triggers a crash with -early-live-intervals in one of the tests.

jayfoad commented Sep 13, 2023

lukel97 commented Sep 13, 2023

preames left a comment

preames commented Sep 20, 2023

topperc commented Sep 27, 2023 •

edited

Loading

lukel97 commented Sep 27, 2023

[RISCV] Shrink vslideup's LMUL when lowering fixed insert_subvector #65997

[RISCV] Shrink vslideup's LMUL when lowering fixed insert_subvector #65997

Conversation

lukel97 commented Sep 11, 2023 • edited Loading

llvmbot commented Sep 11, 2023

This is somewhat dependent on #65916, since it introduces a subregister copy that triggers a crash with -early-live-intervals in one of the tests.

jayfoad commented Sep 13, 2023

lukel97 commented Sep 13, 2023

preames left a comment

Choose a reason for hiding this comment

preames commented Sep 20, 2023

topperc commented Sep 27, 2023 • edited Loading

lukel97 commented Sep 27, 2023

lukel97 commented Sep 11, 2023 •

edited

Loading

topperc commented Sep 27, 2023 •

edited

Loading