-
Notifications
You must be signed in to change notification settings - Fork 12.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RISCV] Shrink vslideup's LMUL when lowering fixed insert_subvector #65997
Conversation
@llvm/pr-subscribers-backend-risc-v ChangesSimilar to #65598, if we're using a vslideup to insert a fixed length vector into another vector, then we can work out the minimum number of registers it will need to slide up across given the minimum VLEN, and shrink the type operated on to reduce LMUL accordingly. This is somewhat dependent on #65916, since it introduces a subregister copy that triggers a crash with -early-live-intervals in one of the tests.Patch is 47.04 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/65997.diff 6 Files Affected:
diff --git a/llvm/lib/CodeGen/TwoAddressInstructionPass.cpp b/llvm/lib/CodeGen/TwoAddressInstructionPass.cpp index 45f61262faf9391..e36bffc91b91d95 100644 --- a/llvm/lib/CodeGen/TwoAddressInstructionPass.cpp +++ b/llvm/lib/CodeGen/TwoAddressInstructionPass.cpp @@ -1871,11 +1871,30 @@ bool TwoAddressInstructionPass::runOnMachineFunction(MachineFunction &Func) { SlotIndex Idx = LIS->getInstructionIndex(*mi); for (auto &S : LI.subranges()) { if ((S.LaneMask & LaneMask).none()) { + // If Idx is 160B, and we have a subrange that isn't in + // %reg.subidx like so: + // + // [152r,160r)[160r,256r) + // + // Merge the two segments together so the subrange becomes: + // + // [152r,256r) LiveRange::iterator UseSeg = S.FindSegmentContaining(Idx); - LiveRange::iterator DefSeg = std::next(UseSeg); - S.MergeValueNumberInto(DefSeg->valno, UseSeg->valno); + if (UseSeg != S.end()) { + LiveRange::iterator DefSeg = std::next(UseSeg); + assert(DefSeg != S.end()); + S.MergeValueNumberInto(DefSeg->valno, UseSeg->valno); + } + // Otherwise, it should have only one segment that starts at + // 160r which we should remove. + else { + assert(S.containsOneValue()); + assert(S.begin()->start == Idx.getRegSlot()); + S.removeSegment(S.begin()); + } } } + LI.removeEmptySubRanges(); // The COPY no longer has a use of %reg. LIS->shrinkToUses(&LI); diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp index 4ff264635cda248..3a81179d1cfc120 100644 --- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp +++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp @@ -8606,6 +8606,18 @@ SDValue RISCVTargetLowering::lowerINSERT_SUBVECTOR(SDValue Op, ContainerVT = getContainerForFixedLengthVector(VecVT); Vec = convertToScalableVector(ContainerVT, Vec, DAG, Subtarget); } + + // Shrink down Vec so we're performing the slideup on a smaller LMUL. + unsigned LastIdx = OrigIdx + SubVecVT.getVectorNumElements() - 1; + MVT OrigContainerVT = ContainerVT; + SDValue OrigVec = Vec; + if (auto ShrunkVT = + getSmallestVTForIndex(ContainerVT, LastIdx, DL, DAG, Subtarget)) { + ContainerVT = *ShrunkVT; + Vec = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, ContainerVT, Vec, + DAG.getVectorIdxConstant(0, DL)); + } + SubVec = DAG.getNode(ISD::INSERT_SUBVECTOR, DL, ContainerVT, DAG.getUNDEF(ContainerVT), SubVec, DAG.getConstant(0, DL, XLenVT)); @@ -8636,6 +8648,12 @@ SDValue RISCVTargetLowering::lowerINSERT_SUBVECTOR(SDValue Op, SlideupAmt, Mask, VL, Policy); } + // If we performed the slideup on a smaller LMUL, insert the result back + // into the rest of the vector. + if (ContainerVT != OrigContainerVT) + SubVec = DAG.getNode(ISD::INSERT_SUBVECTOR, DL, OrigContainerVT, OrigVec, + SubVec, DAG.getVectorIdxConstant(0, DL)); + if (VecVT.isFixedLengthVector()) SubVec = convertFromScalableVector(VecVT, SubVec, DAG, Subtarget); return DAG.getBitcast(Op.getValueType(), SubVec); diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-insert-subvector.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-insert-subvector.ll index 1d6a45ed36f335c..6a9212ed309a8ef 100644 --- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-insert-subvector.ll +++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-insert-subvector.ll @@ -14,7 +14,7 @@ define @insert_nxv8i32_v2i32_0( %vec, ptr % ; CHECK: # %bb.0: ; CHECK-NEXT: vsetivli zero, 2, e32, mf2, ta, ma ; CHECK-NEXT: vle32.v v12, (a0) -; CHECK-NEXT: vsetivli zero, 2, e32, m4, tu, ma +; CHECK-NEXT: vsetivli zero, 2, e32, m1, tu, ma ; CHECK-NEXT: vmv.v.v v8, v12 ; CHECK-NEXT: ret %sv = load <2 x i32>, ptr %svp @@ -27,7 +27,7 @@ define @insert_nxv8i32_v2i32_2( %vec, ptr % ; CHECK: # %bb.0: ; CHECK-NEXT: vsetivli zero, 2, e32, mf2, ta, ma ; CHECK-NEXT: vle32.v v12, (a0) -; CHECK-NEXT: vsetivli zero, 4, e32, m4, tu, ma +; CHECK-NEXT: vsetivli zero, 4, e32, m1, tu, ma ; CHECK-NEXT: vslideup.vi v8, v12, 2 ; CHECK-NEXT: ret %sv = load <2 x i32>, ptr %svp @@ -40,7 +40,7 @@ define @insert_nxv8i32_v2i32_6( %vec, ptr % ; CHECK: # %bb.0: ; CHECK-NEXT: vsetivli zero, 2, e32, mf2, ta, ma ; CHECK-NEXT: vle32.v v12, (a0) -; CHECK-NEXT: vsetivli zero, 8, e32, m4, tu, ma +; CHECK-NEXT: vsetivli zero, 8, e32, m2, tu, ma ; CHECK-NEXT: vslideup.vi v8, v12, 6 ; CHECK-NEXT: ret %sv = load <2 x i32>, ptr %svp @@ -51,22 +51,19 @@ define @insert_nxv8i32_v2i32_6( %vec, ptr % define @insert_nxv8i32_v8i32_0( %vec, ptr %svp) { ; LMULMAX2-LABEL: insert_nxv8i32_v8i32_0: ; LMULMAX2: # %bb.0: -; LMULMAX2-NEXT: vsetivli zero, 8, e32, m2, ta, ma -; LMULMAX2-NEXT: vle32.v v12, (a0) -; LMULMAX2-NEXT: vsetivli zero, 8, e32, m4, tu, ma -; LMULMAX2-NEXT: vmv.v.v v8, v12 +; LMULMAX2-NEXT: vsetivli zero, 8, e32, m2, tu, ma +; LMULMAX2-NEXT: vle32.v v8, (a0) ; LMULMAX2-NEXT: ret ; ; LMULMAX1-LABEL: insert_nxv8i32_v8i32_0: ; LMULMAX1: # %bb.0: +; LMULMAX1-NEXT: addi a1, a0, 16 ; LMULMAX1-NEXT: vsetivli zero, 4, e32, m1, ta, ma -; LMULMAX1-NEXT: vle32.v v12, (a0) -; LMULMAX1-NEXT: addi a0, a0, 16 -; LMULMAX1-NEXT: vle32.v v16, (a0) -; LMULMAX1-NEXT: vsetivli zero, 4, e32, m4, tu, ma -; LMULMAX1-NEXT: vmv.v.v v8, v12 -; LMULMAX1-NEXT: vsetivli zero, 8, e32, m4, tu, ma -; LMULMAX1-NEXT: vslideup.vi v8, v16, 4 +; LMULMAX1-NEXT: vle32.v v12, (a1) +; LMULMAX1-NEXT: vsetvli zero, zero, e32, m1, tu, ma +; LMULMAX1-NEXT: vle32.v v8, (a0) +; LMULMAX1-NEXT: vsetivli zero, 8, e32, m2, tu, ma +; LMULMAX1-NEXT: vslideup.vi v8, v12, 4 ; LMULMAX1-NEXT: ret %sv = load <8 x i32>, ptr %svp %v = call @llvm.vector.insert.v8i32.nxv8i32( %vec, <8 x i32> %sv, i64 0) @@ -84,14 +81,14 @@ define @insert_nxv8i32_v8i32_8( %vec, ptr % ; ; LMULMAX1-LABEL: insert_nxv8i32_v8i32_8: ; LMULMAX1: # %bb.0: -; LMULMAX1-NEXT: addi a1, a0, 16 ; LMULMAX1-NEXT: vsetivli zero, 4, e32, m1, ta, ma -; LMULMAX1-NEXT: vle32.v v12, (a1) +; LMULMAX1-NEXT: vle32.v v12, (a0) +; LMULMAX1-NEXT: addi a0, a0, 16 ; LMULMAX1-NEXT: vle32.v v16, (a0) ; LMULMAX1-NEXT: vsetivli zero, 12, e32, m4, tu, ma -; LMULMAX1-NEXT: vslideup.vi v8, v16, 8 +; LMULMAX1-NEXT: vslideup.vi v8, v12, 8 ; LMULMAX1-NEXT: vsetivli zero, 16, e32, m4, tu, ma -; LMULMAX1-NEXT: vslideup.vi v8, v12, 12 +; LMULMAX1-NEXT: vslideup.vi v8, v16, 12 ; LMULMAX1-NEXT: ret %sv = load <8 x i32>, ptr %svp %v = call @llvm.vector.insert.v8i32.nxv8i32( %vec, <8 x i32> %sv, i64 8) @@ -166,7 +163,7 @@ define void @insert_v8i32_v2i32_0(ptr %vp, ptr %svp) { ; LMULMAX2-NEXT: vle32.v v8, (a1) ; LMULMAX2-NEXT: vsetivli zero, 8, e32, m2, ta, ma ; LMULMAX2-NEXT: vle32.v v10, (a0) -; LMULMAX2-NEXT: vsetivli zero, 2, e32, m2, tu, ma +; LMULMAX2-NEXT: vsetivli zero, 2, e32, m1, tu, ma ; LMULMAX2-NEXT: vmv.v.v v10, v8 ; LMULMAX2-NEXT: vsetivli zero, 8, e32, m2, ta, ma ; LMULMAX2-NEXT: vse32.v v10, (a0) @@ -197,7 +194,7 @@ define void @insert_v8i32_v2i32_2(ptr %vp, ptr %svp) { ; LMULMAX2-NEXT: vle32.v v8, (a1) ; LMULMAX2-NEXT: vsetivli zero, 8, e32, m2, ta, ma ; LMULMAX2-NEXT: vle32.v v10, (a0) -; LMULMAX2-NEXT: vsetivli zero, 4, e32, m2, tu, ma +; LMULMAX2-NEXT: vsetivli zero, 4, e32, m1, tu, ma ; LMULMAX2-NEXT: vslideup.vi v10, v8, 2 ; LMULMAX2-NEXT: vsetivli zero, 8, e32, m2, ta, ma ; LMULMAX2-NEXT: vse32.v v10, (a0) @@ -508,9 +505,9 @@ define void @insert_v2i64_nxv16i64(ptr %psv0, ptr %psv1, * %o ; CHECK: # %bb.0: ; CHECK-NEXT: vsetivli zero, 2, e64, m1, ta, ma ; CHECK-NEXT: vle64.v v8, (a0) -; CHECK-NEXT: vle64.v v16, (a1) -; CHECK-NEXT: vsetivli zero, 6, e64, m8, tu, ma -; CHECK-NEXT: vslideup.vi v8, v16, 4 +; CHECK-NEXT: vle64.v v12, (a1) +; CHECK-NEXT: vsetivli zero, 6, e64, m4, tu, ma +; CHECK-NEXT: vslideup.vi v8, v12, 4 ; CHECK-NEXT: vs8r.v v8, (a2) ; CHECK-NEXT: ret %sv0 = load <2 x i64>, ptr %psv0 @@ -539,7 +536,7 @@ define void @insert_v2i64_nxv16i64_lo2(ptr %psv, * %out) { ; CHECK: # %bb.0: ; CHECK-NEXT: vsetivli zero, 2, e64, m1, ta, ma ; CHECK-NEXT: vle64.v v8, (a0) -; CHECK-NEXT: vsetivli zero, 4, e64, m8, ta, ma +; CHECK-NEXT: vsetivli zero, 4, e64, m2, ta, ma ; CHECK-NEXT: vslideup.vi v16, v8, 2 ; CHECK-NEXT: vs8r.v v16, (a1) ; CHECK-NEXT: ret diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-combine.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-combine.ll index f52ba6f51d5c897..805557905117add 100644 --- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-combine.ll +++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-strided-load-combine.ll @@ -27,13 +27,13 @@ define void @widen_3xv4i16(ptr %x, ptr %z) { ; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma ; CHECK-NEXT: vle16.v v8, (a0) ; CHECK-NEXT: addi a2, a0, 8 -; CHECK-NEXT: vle16.v v10, (a2) +; CHECK-NEXT: vle16.v v9, (a2) ; CHECK-NEXT: addi a0, a0, 16 -; CHECK-NEXT: vle16.v v12, (a0) -; CHECK-NEXT: vsetivli zero, 8, e16, m2, tu, ma -; CHECK-NEXT: vslideup.vi v8, v10, 4 +; CHECK-NEXT: vle16.v v10, (a0) +; CHECK-NEXT: vsetivli zero, 8, e16, m1, tu, ma +; CHECK-NEXT: vslideup.vi v8, v9, 4 ; CHECK-NEXT: vsetivli zero, 12, e16, m2, tu, ma -; CHECK-NEXT: vslideup.vi v8, v12, 8 +; CHECK-NEXT: vslideup.vi v8, v10, 8 ; CHECK-NEXT: vse16.v v8, (a1) ; CHECK-NEXT: ret %a = load <4 x i16>, ptr %x @@ -75,17 +75,17 @@ define void @widen_4xv4i16_unaligned(ptr %x, ptr %z) { ; CHECK-NO-MISALIGN-NEXT: vsetivli zero, 8, e8, mf2, ta, ma ; CHECK-NO-MISALIGN-NEXT: vle8.v v8, (a0) ; CHECK-NO-MISALIGN-NEXT: addi a2, a0, 8 -; CHECK-NO-MISALIGN-NEXT: vle8.v v10, (a2) +; CHECK-NO-MISALIGN-NEXT: vle8.v v9, (a2) ; CHECK-NO-MISALIGN-NEXT: addi a2, a0, 16 -; CHECK-NO-MISALIGN-NEXT: vle8.v v12, (a2) +; CHECK-NO-MISALIGN-NEXT: vle8.v v10, (a2) ; CHECK-NO-MISALIGN-NEXT: addi a0, a0, 24 -; CHECK-NO-MISALIGN-NEXT: vle8.v v14, (a0) -; CHECK-NO-MISALIGN-NEXT: vsetivli zero, 8, e16, m2, tu, ma -; CHECK-NO-MISALIGN-NEXT: vslideup.vi v8, v10, 4 +; CHECK-NO-MISALIGN-NEXT: vle8.v v12, (a0) +; CHECK-NO-MISALIGN-NEXT: vsetvli zero, zero, e16, m1, tu, ma +; CHECK-NO-MISALIGN-NEXT: vslideup.vi v8, v9, 4 ; CHECK-NO-MISALIGN-NEXT: vsetivli zero, 12, e16, m2, tu, ma -; CHECK-NO-MISALIGN-NEXT: vslideup.vi v8, v12, 8 +; CHECK-NO-MISALIGN-NEXT: vslideup.vi v8, v10, 8 ; CHECK-NO-MISALIGN-NEXT: vsetivli zero, 16, e16, m2, ta, ma -; CHECK-NO-MISALIGN-NEXT: vslideup.vi v8, v14, 12 +; CHECK-NO-MISALIGN-NEXT: vslideup.vi v8, v12, 12 ; CHECK-NO-MISALIGN-NEXT: vse16.v v8, (a1) ; CHECK-NO-MISALIGN-NEXT: ret ; @@ -188,17 +188,17 @@ define void @strided_constant_mismatch_4xv4i16(ptr %x, ptr %z) { ; CHECK-NEXT: vsetivli zero, 4, e16, mf2, ta, ma ; CHECK-NEXT: vle16.v v8, (a0) ; CHECK-NEXT: addi a2, a0, 2 -; CHECK-NEXT: vle16.v v10, (a2) +; CHECK-NEXT: vle16.v v9, (a2) ; CHECK-NEXT: addi a2, a0, 6 -; CHECK-NEXT: vle16.v v12, (a2) +; CHECK-NEXT: vle16.v v10, (a2) ; CHECK-NEXT: addi a0, a0, 8 -; CHECK-NEXT: vle16.v v14, (a0) -; CHECK-NEXT: vsetivli zero, 8, e16, m2, tu, ma -; CHECK-NEXT: vslideup.vi v8, v10, 4 +; CHECK-NEXT: vle16.v v12, (a0) +; CHECK-NEXT: vsetivli zero, 8, e16, m1, tu, ma +; CHECK-NEXT: vslideup.vi v8, v9, 4 ; CHECK-NEXT: vsetivli zero, 12, e16, m2, tu, ma -; CHECK-NEXT: vslideup.vi v8, v12, 8 +; CHECK-NEXT: vslideup.vi v8, v10, 8 ; CHECK-NEXT: vsetivli zero, 16, e16, m2, ta, ma -; CHECK-NEXT: vslideup.vi v8, v14, 12 +; CHECK-NEXT: vslideup.vi v8, v12, 12 ; CHECK-NEXT: vse16.v v8, (a1) ; CHECK-NEXT: ret %a = load <4 x i16>, ptr %x @@ -258,17 +258,17 @@ define void @strided_runtime_mismatch_4xv4i16(ptr %x, ptr %z, i64 %s, i64 %t) { ; RV32-NEXT: vsetivli zero, 4, e16, mf2, ta, ma ; RV32-NEXT: vle16.v v8, (a0) ; RV32-NEXT: add a0, a0, a2 -; RV32-NEXT: vle16.v v10, (a0) +; RV32-NEXT: vle16.v v9, (a0) ; RV32-NEXT: add a0, a0, a4 -; RV32-NEXT: vle16.v v12, (a0) +; RV32-NEXT: vle16.v v10, (a0) ; RV32-NEXT: add a0, a0, a2 -; RV32-NEXT: vle16.v v14, (a0) -; RV32-NEXT: vsetivli zero, 8, e16, m2, tu, ma -; RV32-NEXT: vslideup.vi v8, v10, 4 +; RV32-NEXT: vle16.v v12, (a0) +; RV32-NEXT: vsetivli zero, 8, e16, m1, tu, ma +; RV32-NEXT: vslideup.vi v8, v9, 4 ; RV32-NEXT: vsetivli zero, 12, e16, m2, tu, ma -; RV32-NEXT: vslideup.vi v8, v12, 8 +; RV32-NEXT: vslideup.vi v8, v10, 8 ; RV32-NEXT: vsetivli zero, 16, e16, m2, ta, ma -; RV32-NEXT: vslideup.vi v8, v14, 12 +; RV32-NEXT: vslideup.vi v8, v12, 12 ; RV32-NEXT: vse16.v v8, (a1) ; RV32-NEXT: ret ; @@ -277,17 +277,17 @@ define void @strided_runtime_mismatch_4xv4i16(ptr %x, ptr %z, i64 %s, i64 %t) { ; RV64-NEXT: vsetivli zero, 4, e16, mf2, ta, ma ; RV64-NEXT: vle16.v v8, (a0) ; RV64-NEXT: add a0, a0, a2 -; RV64-NEXT: vle16.v v10, (a0) +; RV64-NEXT: vle16.v v9, (a0) ; RV64-NEXT: add a0, a0, a3 -; RV64-NEXT: vle16.v v12, (a0) +; RV64-NEXT: vle16.v v10, (a0) ; RV64-NEXT: add a0, a0, a2 -; RV64-NEXT: vle16.v v14, (a0) -; RV64-NEXT: vsetivli zero, 8, e16, m2, tu, ma -; RV64-NEXT: vslideup.vi v8, v10, 4 +; RV64-NEXT: vle16.v v12, (a0) +; RV64-NEXT: vsetivli zero, 8, e16, m1, tu, ma +; RV64-NEXT: vslideup.vi v8, v9, 4 ; RV64-NEXT: vsetivli zero, 12, e16, m2, tu, ma -; RV64-NEXT: vslideup.vi v8, v12, 8 +; RV64-NEXT: vslideup.vi v8, v10, 8 ; RV64-NEXT: vsetivli zero, 16, e16, m2, ta, ma -; RV64-NEXT: vslideup.vi v8, v14, 12 +; RV64-NEXT: vslideup.vi v8, v12, 12 ; RV64-NEXT: vse16.v v8, (a1) ; RV64-NEXT: ret ; @@ -296,17 +296,17 @@ define void @strided_runtime_mismatch_4xv4i16(ptr %x, ptr %z, i64 %s, i64 %t) { ; ZVE64F-NEXT: vsetivli zero, 4, e16, mf2, ta, ma ; ZVE64F-NEXT: vle16.v v8, (a0) ; ZVE64F-NEXT: add a0, a0, a2 -; ZVE64F-NEXT: vle16.v v10, (a0) +; ZVE64F-NEXT: vle16.v v9, (a0) ; ZVE64F-NEXT: add a0, a0, a3 -; ZVE64F-NEXT: vle16.v v12, (a0) +; ZVE64F-NEXT: vle16.v v10, (a0) ; ZVE64F-NEXT: add a0, a0, a2 -; ZVE64F-NEXT: vle16.v v14, (a0) -; ZVE64F-NEXT: vsetivli zero, 8, e16, m2, tu, ma -; ZVE64F-NEXT: vslideup.vi v8, v10, 4 +; ZVE64F-NEXT: vle16.v v12, (a0) +; ZVE64F-NEXT: vsetivli zero, 8, e16, m1, tu, ma +; ZVE64F-NEXT: vslideup.vi v8, v9, 4 ; ZVE64F-NEXT: vsetivli zero, 12, e16, m2, tu, ma -; ZVE64F-NEXT: vslideup.vi v8, v12, 8 +; ZVE64F-NEXT: vslideup.vi v8, v10, 8 ; ZVE64F-NEXT: vsetivli zero, 16, e16, m2, ta, ma -; ZVE64F-NEXT: vslideup.vi v8, v14, 12 +; ZVE64F-NEXT: vslideup.vi v8, v12, 12 ; ZVE64F-NEXT: vse16.v v8, (a1) ; ZVE64F-NEXT: ret %a = load <4 x i16>, ptr %x diff --git a/llvm/test/CodeGen/RISCV/rvv/fpclamptosat_vec.ll b/llvm/test/CodeGen/RISCV/rvv/fpclamptosat_vec.ll index 31e7e7be76c89b1..f598118c18aff9f 100644 --- a/llvm/test/CodeGen/RISCV/rvv/fpclamptosat_vec.ll +++ b/llvm/test/CodeGen/RISCV/rvv/fpclamptosat_vec.ll @@ -460,54 +460,49 @@ define <4 x i32> @stest_f16i32(<4 x half> %x) { ; CHECK-V-NEXT: sub sp, sp, a1 ; CHECK-V-NEXT: .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x30, 0x22, 0x11, 0x04, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 48 + 4 * vlenb ; CHECK-V-NEXT: lhu s0, 24(a0) -; CHECK-V-NEXT: lhu s1, 16(a0) -; CHECK-V-NEXT: lhu s2, 0(a0) -; CHECK-V-NEXT: lhu a0, 8(a0) +; CHECK-V-NEXT: lhu s1, 0(a0) +; CHECK-V-NEXT: lhu s2, 8(a0) +; CHECK-V-NEXT: lhu a0, 16(a0) ; CHECK-V-NEXT: fmv.w.x fa0, a0 ; CHECK-V-NEXT: call __extendhfsf2@plt ; CHECK-V-NEXT: fcvt.l.s a0, fa0, rtz ; CHECK-V-NEXT: vsetivli zero, 1, e64, m1, ta, ma ; CHECK-V-NEXT: vmv.s.x v8, a0 -; CHECK-V-NEXT: addi a0, sp, 16 -; CHECK-V-NEXT: vs2r.v v8, (a0) # Unknown-size Folded Spill -; CHECK-V-NEXT: fmv.w.x fa0, s2 -; CHECK-V-NEXT: call __extendhfsf2@plt -; CHECK-V-NEXT: fcvt.l.s a0, fa0, rtz -; CHECK-V-NEXT: vsetivli zero, 2, e64, m2, tu, ma -; CHECK-V-NEXT: vmv.s.x v8, a0 -; CHECK-V-NEXT: addi a0, sp, 16 -; CHECK-V-NEXT: vl2r.v v10, (a0) # Unknown-size Folded Reload -; CHECK-V-NEXT: vslideup.vi v8, v10, 1 ; CHECK-V-NEXT: csrr a0, vlenb ; CHECK-V-NEXT: slli a0, a0, 1 ; CHECK-V-NEXT: add a0, sp, a0 ; CHECK-V-NEXT: addi a0, a0, 16 ; CHECK-V-NEXT: vs2r.v v8, (a0) # Unknown-size Folded Spill +; CHECK-V-NEXT: fmv.w.x fa0, s2 +; CHECK-V-NEXT: call __extendhfsf2@plt +; CHECK-V-NEXT: fcvt.l.s a0, fa0, rtz +; CHECK-V-NEXT: vsetivli zero, 1, e64, m1, ta, ma ; CHECK-V-NEXT: fmv.w.x fa0, s1 +; CHECK-V-NEXT: vmv.s.x v8, a0 +; CHECK-V-NEXT: addi a0, sp, 16 +; CHECK-V-NEXT: vs1r.v v8, (a0) # Unknown-size Folded Spill ; CHECK-V-NEXT: call __extendhfsf2@plt ; CHECK-V-NEXT: fcvt.l.s a0, fa0, rtz +; CHECK-V-NEXT: vsetivli zero, 2, e64, m1, tu, ma +; CHECK-V-NEXT: vmv.s.x v10, a0 +; CHECK-V-NEXT: addi a0, sp, 16 +; CHECK-V-NEXT: vl1r.v v8, (a0) # Unknown-size Folded Reload +; CHECK-V-NEXT: vslideup.vi v10, v8, 1 ; CHECK-V-NEXT: vsetivli zero, 3, e64, m2, tu, ma -; CHECK-V-NEXT: vmv.s.x v8, a0 ; CHECK-V-NEXT: csrr a0, vlenb ; CHECK-V-NEXT: slli a0, a0, 1 ; CHECK-V-NEXT: add a0, sp, a0 ; CHECK-V-NEXT: addi a0, a0, 16 -; CHECK-V-NEXT: vl2r.v v10, (a0) # Unknown-size Folded Reload +; CHECK-V-NEXT: vl2r.v v8, (a0) # Unknown-size Folded Reload ; CHECK-V-NEXT: vslideup.vi v10, v8, 2 -; CHECK-V-NEXT: csrr a0, vlenb -; CHECK-V-NEXT: slli a0, a0, 1 -; CHECK-V-NEXT: add a0, sp, a0 -; CHECK-V-NEXT: addi a0, a0, 16 +; CHECK-V-NEXT: addi a0, sp, 16 ; CHECK-V-NEXT: vs2r.v v10, (a0) # Unknown-size Folded Spill ; CHECK-V-NEXT: fmv.w.x fa0, s0 ; CHECK-V-NEXT: call __extendhfsf2@plt ; CHECK-V-NEXT: fcvt.l.s a0, fa0, rtz ; CHECK-V-NEXT: vsetivli zero, 4, e64, m2, ta, ma ; CHECK-V-NEXT: vmv.s.x v8, a0 -; CHECK-V-NEXT: csrr a0, vlenb -; CHECK-V-NEXT: slli a0, a0, 1 -; CHECK-V-NEXT: add a0, sp, a0 -; CHECK-V-NEXT: addi a0, a0, 16 +; CHECK-V-NEXT: addi a0, sp, 16 ; CHECK-V-NEXT: vl2r.v v10, (a0) # Unknown-size Folded Reload ; CHECK-V-NEXT: vslideup.vi v10, v8, 3 ; CHECK-V-NEXT: lui a0, 524288 @@ -632,54 +627,49 @@ define <4 x i32> @utesth_f16i32(<4 x half> %x) { ; CHECK-V-NEXT: sub sp, sp, a1 ; CHECK-V-NEXT: .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x30, 0x22, 0x11, 0x04, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 48 + 4 * vlenb ; CHECK-V-NEXT: lhu s0, 24(a0) -; CHECK-V-NEXT: lhu s1, 16(a0) -; CHECK-V-NEXT: lhu s2, 0(a0) -; CHECK-V-NEXT: lhu a0, 8(a0) +; CHECK-V-NEXT: lhu s1, 0(a0) +; CHECK-V-NEXT: lhu s2, 8(a0) +; CHECK-V-NEXT: lhu a0, 16(a0) ; CHECK-V-NEXT: fmv.w.x fa0, a0 ; CHECK-V-NEXT: call __extendhfsf2@plt ; CHECK-V-NEXT: fcvt.lu.s a0, fa0, rtz ; CHECK-V-NEXT: vsetivli zero, 1, e64, m1, ta, ma ; CHECK-V-NEXT: vmv.s.... |
282f29d
to
dd0dd7a
Compare
8b637f0
to
dd0dd7a
Compare
dd0dd7a
to
5621ded
Compare
Both fix the crash, but I closed #65916 in favour of #66211 since my understanding of LiveIntervals is limited |
Similiar to llvm#66267, we can perform a vslideup_vl on a smaller type if we know the highest lane that will be written to, which can be determined from VL. This is an alternative to llvm#65997 and llvm#66087
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Warning: Given #66211 has landed, I am only reviewing the second change in this stack. Make sure you rebase! |
Similar to llvm#65598, if we're using a vslideup to insert a fixed length vector into another vector, then we can work out the minimum number of registers it will need to slide up across given the minimum VLEN, and shrink the type operated on to reduce LMUL accordingly. This is somewhat dependent on llvm#65916, since it introduces a subregister copy that triggers a crash with -early-live-intervals in one of the tests.
5621ded
to
0645568
Compare
I think we're starting to see failures from this change. This fixes the failures I've seen, but I think there may be some ordering issues with when we decide to shrink. If we enter this particular
Here's one failure that I think is related to this iree-org/iree#15038 I also have an internal build error , but I don't have a reduce reproducer yet. Here's a fragment of the debug output from my internal reproducer.
MinVLen is 128. The issue is with the |
Continuing on from llvm#65997, if the index of insert_vector_elt is a constant then we can work out what the minimum number of registers will be needed for the slideup and choose a smaller type to operate on. This reduces the LMUL for not just the slideup but also for the scalar insert.
…lt (llvm#66087) Continuing on from llvm#65997, if the index of insert_vector_elt is a constant then we can work out what the minimum number of registers will be needed for the slideup and choose a smaller type to operate on. This reduces the LMUL for not just the slideup but also for the scalar insert.
…sert_subvector (llvm#65997)"" This reverts commit 3a6cc52.
Similar to #65598, if we're using a vslideup to insert a fixed length vector into another vector, then we can work out the minimum number of registers it will need to slide up across given the minimum VLEN, and shrink the type operated on to reduce LMUL accordingly.
This is somewhat dependent on #66211 , since it introduces a subregister copy that triggers a crash with -early-live-intervals in one of the tests.
Stacked upon #66211