[SLP]Improve/fix subvectors in gather/buildvector nodes handling #104144

alexey-bataev · 2024-08-14T18:35:52Z

SLP vectorizer has an estimation for gather/buildvector nodes, which
contain some scalar loads. SLP vectorizer performs pretty similar (but
large in SLOCs) estimation, which not always correct. Instead, this
patch implements clustering analysis and actual node allocation with the
full analysis for the vectorized clustered scalars (not only loads, but
also some other instructions) with the correct cost estimation and
vector insert instructions. Improves overall vectorization quality and
simplifies analysis/estimations.

Created using spr 1.3.5

llvmbot · 2024-08-14T18:36:31Z

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-systemz

Author: Alexey Bataev (alexey-bataev)

Changes

SLP vectorizer has an estimation for gather/buildvector nodes, which
contain some scalar loads. SLP vectorizer performs pretty similar (but
large in SLOCs) estimation, which not always correct. Instead, this
patch implements clustering analysis and actual node allocation with the
full analysis for the vectorized clustered scalars (not only loads, but
also some other instructions) with the correct cost estimation and
vector insert instructions. Improves overall vectorization quality and
simplifies analysis/estimations.

Patch is 183.96 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/104144.diff

27 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+160-185)
(modified) llvm/test/Transforms/PhaseOrdering/AArch64/slpordering.ll (+37-37)
(modified) llvm/test/Transforms/SLPVectorizer/AArch64/getelementptr.ll (+6-5)
(modified) llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll (+96-96)
(modified) llvm/test/Transforms/SLPVectorizer/AArch64/multiple_reduction.ll (+147-218)
(modified) llvm/test/Transforms/SLPVectorizer/AArch64/scalarization-overhead.ll (+43-19)
(modified) llvm/test/Transforms/SLPVectorizer/AArch64/shuffle-vectors-mask-size.ll (+2-5)
(modified) llvm/test/Transforms/SLPVectorizer/AArch64/tsc-s116.ll (+4-4)
(modified) llvm/test/Transforms/SLPVectorizer/AArch64/vectorizable-selects-uniform-cmps.ll (+19-13)
(modified) llvm/test/Transforms/SLPVectorizer/RISCV/combined-loads-stored.ll (+3-4)
(modified) llvm/test/Transforms/SLPVectorizer/RISCV/reductions.ll (+22-26)
(modified) llvm/test/Transforms/SLPVectorizer/SystemZ/pr34619.ll (+5-6)
(modified) llvm/test/Transforms/SLPVectorizer/X86/addsub.ll (+8-10)
(modified) llvm/test/Transforms/SLPVectorizer/X86/extract-many-users-buildvector.ll (+19-24)
(modified) llvm/test/Transforms/SLPVectorizer/X86/extract-scalar-from-undef.ll (+13-14)
(modified) llvm/test/Transforms/SLPVectorizer/X86/gather-node-same-as-vect-but-order.ll (+6-7)
(modified) llvm/test/Transforms/SLPVectorizer/X86/horizontal-minmax.ll (+7-9)
(modified) llvm/test/Transforms/SLPVectorizer/X86/inst_size_bug.ll (+12-6)
(modified) llvm/test/Transforms/SLPVectorizer/X86/landing_pad.ll (+10-9)
(modified) llvm/test/Transforms/SLPVectorizer/X86/phi.ll (+27-26)
(modified) llvm/test/Transforms/SLPVectorizer/X86/reduction-logical.ll (+9-8)
(modified) llvm/test/Transforms/SLPVectorizer/X86/remark-partial-loads-vectorize.ll (+3-13)
(modified) llvm/test/Transforms/SLPVectorizer/X86/scatter-vectorize-reused-pointer.ll (+14-12)
(modified) llvm/test/Transforms/SLPVectorizer/X86/schedule_budget_debug_info.ll (+28-12)
(modified) llvm/test/Transforms/SLPVectorizer/X86/split-load8_2-unord.ll (+17-22)
(modified) llvm/test/Transforms/SLPVectorizer/X86/tiny-tree.ll (+2-3)
(modified) llvm/test/Transforms/SLPVectorizer/X86/vect-gather-same-nodes.ll (+3-3)

diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 81841a8f692870..b32017adcf8ca8 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -3104,6 +3104,10 @@ class BoUpSLP {
     /// The index of this treeEntry in VectorizableTree.
     int Idx = -1;
 
+    /// For gather/buildvector/alt opcode (TODO) nodes, which are combined from
+    /// other nodes as a serie of insertvector instructions.
+    SmallVector<std::pair<unsigned, unsigned>, 0> CombinedEntriesWithIndices;
+
   private:
     /// The operands of each instruction in each lane Operands[op_index][lane].
     /// Note: This helps avoid the replication of the code that performs the
@@ -3404,7 +3408,9 @@ class BoUpSLP {
         if (!isConstant(V)) {
           auto *I = dyn_cast<CastInst>(V);
           AllConstsOrCasts &= I && I->getType()->isIntegerTy();
-          ValueToGatherNodes.try_emplace(V).first->getSecond().insert(Last);
+          if (UserTreeIdx.EdgeIdx != UINT_MAX || !UserTreeIdx.UserTE ||
+              !UserTreeIdx.UserTE->isGather())
+            ValueToGatherNodes.try_emplace(V).first->getSecond().insert(Last);
         }
       if (AllConstsOrCasts)
         CastMaxMinBWSizes =
@@ -8361,8 +8367,49 @@ getGEPCosts(const TargetTransformInfo &TTI, ArrayRef<Value *> Ptrs,
 
 void BoUpSLP::transformNodes() {
   constexpr TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
-  for (std::unique_ptr<TreeEntry> &TE : VectorizableTree) {
-    TreeEntry &E = *TE;
+  // The tree may grow here, so iterate over nodes, built before.
+  for (unsigned Idx : seq<unsigned>(VectorizableTree.size())) {
+    TreeEntry &E = *VectorizableTree[Idx];
+    if (E.isGather()) {
+      ArrayRef<Value *> VL = E.Scalars;
+      const unsigned Sz = getVectorElementSize(VL.front());
+      unsigned MinVF = getMinVF(2 * Sz);
+      if (VL.size() <= 2 ||
+          (E.getOpcode() &&
+           (E.isAltShuffle() || E.getOpcode() != Instruction::Load)))
+        continue;
+      // Try to find vectorizable sequences and transform them into a serie of
+      // insertvector instructions.
+      unsigned StartIdx = 0;
+      unsigned End = VL.size();
+      for (unsigned VF = VL.size() / 2; VF >= MinVF; VF /= 2) {
+        for (unsigned Cnt = StartIdx; Cnt + VF <= End; Cnt += VF) {
+          ArrayRef<Value *> Slice = VL.slice(Cnt, VF);
+          InstructionsState S = getSameOpcode(Slice, *TLI);
+          if (!S.getOpcode() || S.isAltShuffle() ||
+              (S.getOpcode() != Instruction::Load &&
+               any_of(Slice, [&](Value *V) {
+                 return !areAllUsersVectorized(cast<Instruction>(V),
+                                               UserIgnoreList);
+               })))
+            continue;
+          if (!getTreeEntry(Slice.front()) && !getTreeEntry(Slice.back())) {
+            unsigned PrevSize = VectorizableTree.size();
+            buildTree_rec(Slice, 0, EdgeInfo(&E, UINT_MAX));
+            if (PrevSize + 1 == VectorizableTree.size() &&
+                VectorizableTree[PrevSize]->isGather()) {
+              VectorizableTree.pop_back();
+              continue;
+            }
+            E.CombinedEntriesWithIndices.emplace_back(PrevSize, Cnt);
+            if (StartIdx == Cnt)
+              StartIdx = Cnt + VF;
+            if (End == Cnt + VF)
+              End = Cnt;
+          }
+        }
+      }
+    }
     switch (E.getOpcode()) {
     case Instruction::Load: {
       // No need to reorder masked gather loads, just reorder the scalar
@@ -8485,175 +8532,7 @@ class BoUpSLP::ShuffleCostEstimator : public BaseShuffleAnalysis {
     auto *VecTy = getWidenedType(ScalarTy, VL.size());
     InstructionCost GatherCost = 0;
     SmallVector<Value *> Gathers(VL);
-    // Improve gather cost for gather of loads, if we can group some of the
-    // loads into vector loads.
-    InstructionsState S = getSameOpcode(VL, *R.TLI);
-    const unsigned Sz = R.DL->getTypeSizeInBits(ScalarTy);
-    unsigned MinVF = R.getMinVF(2 * Sz);
-    if (VL.size() > 2 &&
-        ((S.getOpcode() == Instruction::Load && !S.isAltShuffle()) ||
-         (InVectors.empty() &&
-          any_of(seq<unsigned>(0, VL.size() / MinVF),
-                 [&](unsigned Idx) {
-                   ArrayRef<Value *> SubVL = VL.slice(Idx * MinVF, MinVF);
-                   InstructionsState S = getSameOpcode(SubVL, *R.TLI);
-                   return S.getOpcode() == Instruction::Load &&
-                          !S.isAltShuffle();
-                 }))) &&
-        !all_of(Gathers, [&](Value *V) { return R.getTreeEntry(V); }) &&
-        !isSplat(Gathers)) {
-      InstructionCost BaseCost = R.getGatherCost(Gathers, !Root, ScalarTy);
-      SetVector<Value *> VectorizedLoads;
-      SmallVector<std::pair<unsigned, LoadsState>> VectorizedStarts;
-      SmallVector<unsigned> ScatterVectorized;
-      unsigned StartIdx = 0;
-      unsigned VF = VL.size() / 2;
-      for (; VF >= MinVF; VF /= 2) {
-        for (unsigned Cnt = StartIdx, End = VL.size(); Cnt + VF <= End;
-             Cnt += VF) {
-          ArrayRef<Value *> Slice = VL.slice(Cnt, VF);
-          if (S.getOpcode() != Instruction::Load || S.isAltShuffle()) {
-            InstructionsState SliceS = getSameOpcode(Slice, *R.TLI);
-            if (SliceS.getOpcode() != Instruction::Load ||
-                SliceS.isAltShuffle())
-              continue;
-          }
-          if (!VectorizedLoads.count(Slice.front()) &&
-              !VectorizedLoads.count(Slice.back()) && allSameBlock(Slice)) {
-            SmallVector<Value *> PointerOps;
-            OrdersType CurrentOrder;
-            LoadsState LS = R.canVectorizeLoads(Slice, Slice.front(),
-                                                CurrentOrder, PointerOps);
-            switch (LS) {
-            case LoadsState::Vectorize:
-            case LoadsState::ScatterVectorize:
-            case LoadsState::StridedVectorize:
-              // Mark the vectorized loads so that we don't vectorize them
-              // again.
-              // TODO: better handling of loads with reorders.
-              if (((LS == LoadsState::Vectorize ||
-                    LS == LoadsState::StridedVectorize) &&
-                   CurrentOrder.empty()) ||
-                  (LS == LoadsState::StridedVectorize &&
-                   isReverseOrder(CurrentOrder)))
-                VectorizedStarts.emplace_back(Cnt, LS);
-              else
-                ScatterVectorized.push_back(Cnt);
-              VectorizedLoads.insert(Slice.begin(), Slice.end());
-              // If we vectorized initial block, no need to try to vectorize
-              // it again.
-              if (Cnt == StartIdx)
-                StartIdx += VF;
-              break;
-            case LoadsState::Gather:
-              break;
-            }
-          }
-        }
-        // Check if the whole array was vectorized already - exit.
-        if (StartIdx >= VL.size())
-          break;
-        // Found vectorizable parts - exit.
-        if (!VectorizedLoads.empty())
-          break;
-      }
-      if (!VectorizedLoads.empty()) {
-        unsigned NumParts = TTI.getNumberOfParts(VecTy);
-        bool NeedInsertSubvectorAnalysis =
-            !NumParts || (VL.size() / VF) > NumParts;
-        // Get the cost for gathered loads.
-        for (unsigned I = 0, End = VL.size(); I < End; I += VF) {
-          if (VectorizedLoads.contains(VL[I]))
-            continue;
-          GatherCost +=
-              getBuildVectorCost(VL.slice(I, std::min(End - I, VF)), Root);
-        }
-        // Exclude potentially vectorized loads from list of gathered
-        // scalars.
-        Gathers.assign(Gathers.size(), PoisonValue::get(VL.front()->getType()));
-        // The cost for vectorized loads.
-        InstructionCost ScalarsCost = 0;
-        for (Value *V : VectorizedLoads) {
-          auto *LI = cast<LoadInst>(V);
-          ScalarsCost +=
-              TTI.getMemoryOpCost(Instruction::Load, LI->getType(),
-                                  LI->getAlign(), LI->getPointerAddressSpace(),
-                                  CostKind, TTI::OperandValueInfo(), LI);
-        }
-        auto *LoadTy = getWidenedType(VL.front()->getType(), VF);
-        for (const std::pair<unsigned, LoadsState> &P : VectorizedStarts) {
-          auto *LI = cast<LoadInst>(VL[P.first]);
-          Align Alignment = LI->getAlign();
-          GatherCost +=
-              P.second == LoadsState::Vectorize
-                  ? TTI.getMemoryOpCost(Instruction::Load, LoadTy, Alignment,
-                                        LI->getPointerAddressSpace(), CostKind,
-                                        TTI::OperandValueInfo(), LI)
-                  : TTI.getStridedMemoryOpCost(
-                        Instruction::Load, LoadTy, LI->getPointerOperand(),
-                        /*VariableMask=*/false, Alignment, CostKind, LI);
-          // Add external uses costs.
-          for (auto [Idx, V] : enumerate(VL.slice(
-                   P.first, std::min<unsigned>(VL.size() - P.first, VF))))
-            if (!R.areAllUsersVectorized(cast<Instruction>(V)))
-              GatherCost += TTI.getVectorInstrCost(Instruction::ExtractElement,
-                                                   LoadTy, CostKind, Idx);
-          // Estimate GEP cost.
-          SmallVector<Value *> PointerOps(VF);
-          for (auto [I, V] : enumerate(VL.slice(P.first, VF)))
-            PointerOps[I] = cast<LoadInst>(V)->getPointerOperand();
-          auto [ScalarGEPCost, VectorGEPCost] =
-              getGEPCosts(TTI, PointerOps, LI->getPointerOperand(),
-                          Instruction::Load, CostKind, LI->getType(), LoadTy);
-          GatherCost += VectorGEPCost - ScalarGEPCost;
-        }
-        for (unsigned P : ScatterVectorized) {
-          auto *LI0 = cast<LoadInst>(VL[P]);
-          ArrayRef<Value *> Slice = VL.slice(P, VF);
-          Align CommonAlignment = computeCommonAlignment<LoadInst>(Slice);
-          GatherCost += TTI.getGatherScatterOpCost(
-              Instruction::Load, LoadTy, LI0->getPointerOperand(),
-              /*VariableMask=*/false, CommonAlignment, CostKind, LI0);
-          // Estimate GEP cost.
-          SmallVector<Value *> PointerOps(VF);
-          for (auto [I, V] : enumerate(Slice))
-            PointerOps[I] = cast<LoadInst>(V)->getPointerOperand();
-          OrdersType Order;
-          if (sortPtrAccesses(PointerOps, LI0->getType(), *R.DL, *R.SE,
-                              Order)) {
-            // TODO: improve checks if GEPs can be vectorized.
-            Value *Ptr0 = PointerOps.front();
-            Type *ScalarTy = Ptr0->getType();
-            auto *VecTy = getWidenedType(ScalarTy, VF);
-            auto [ScalarGEPCost, VectorGEPCost] =
-                getGEPCosts(TTI, PointerOps, Ptr0, Instruction::GetElementPtr,
-                            CostKind, ScalarTy, VecTy);
-            GatherCost += VectorGEPCost - ScalarGEPCost;
-            if (!Order.empty()) {
-              SmallVector<int> Mask;
-              inversePermutation(Order, Mask);
-              GatherCost += ::getShuffleCost(TTI, TTI::SK_PermuteSingleSrc,
-                                             VecTy, Mask, CostKind);
-            }
-          } else {
-            GatherCost += R.getGatherCost(PointerOps, /*ForPoisonSrc=*/true,
-                                          PointerOps.front()->getType());
-          }
-        }
-        if (NeedInsertSubvectorAnalysis) {
-          // Add the cost for the subvectors insert.
-          SmallVector<int> ShuffleMask(VL.size());
-          for (unsigned I = VF, E = VL.size(); I < E; I += VF) {
-            for (unsigned Idx : seq<unsigned>(0, E))
-              ShuffleMask[Idx] = Idx / VF == I ? E + Idx % VF : Idx;
-            GatherCost += ::getShuffleCost(TTI, TTI::SK_InsertSubvector, VecTy,
-                                           ShuffleMask, CostKind, I, LoadTy);
-          }
-        }
-        GatherCost -= ScalarsCost;
-      }
-      GatherCost = std::min(BaseCost, GatherCost);
-    } else if (!Root && isSplat(VL)) {
+    if (!Root && isSplat(VL)) {
       // Found the broadcasting of the single scalar, calculate the cost as
       // the broadcast.
       const auto *It = find_if_not(VL, IsaPred<UndefValue>);
@@ -9401,7 +9280,9 @@ class BoUpSLP::ShuffleCostEstimator : public BaseShuffleAnalysis {
   InstructionCost createFreeze(InstructionCost Cost) { return Cost; }
   /// Finalize emission of the shuffles.
   InstructionCost
-  finalize(ArrayRef<int> ExtMask, unsigned VF = 0,
+  finalize(ArrayRef<int> ExtMask,
+           ArrayRef<std::pair<const TreeEntry *, unsigned>> SubVectors,
+           unsigned VF = 0,
            function_ref<void(Value *&, SmallVectorImpl<int> &)> Action = {}) {
     IsFinalized = true;
     if (Action) {
@@ -9419,6 +9300,29 @@ class BoUpSLP::ShuffleCostEstimator : public BaseShuffleAnalysis {
       Action(V, CommonMask);
       InVectors.front() = V;
     }
+    if (!SubVectors.empty()) {
+      const PointerUnion<Value *, const TreeEntry *> &Vec = InVectors.front();
+      if (InVectors.size() == 2)
+        Cost += createShuffle(Vec, InVectors.back(), CommonMask);
+      else
+        Cost += createShuffle(Vec, nullptr, CommonMask);
+      for (unsigned Idx = 0, Sz = CommonMask.size(); Idx < Sz; ++Idx)
+        if (CommonMask[Idx] != PoisonMaskElem)
+          CommonMask[Idx] = Idx;
+      for (const auto [E, Idx] : SubVectors) {
+        Cost += ::getShuffleCost(
+            TTI, TTI::SK_InsertSubvector,
+            FixedVectorType::get(ScalarTy, CommonMask.size()), std::nullopt,
+            CostKind, Idx,
+            FixedVectorType::get(ScalarTy, E->getVectorFactor()));
+        if (!CommonMask.empty()) {
+          std::iota(std::next(CommonMask.begin(), Idx),
+                    std::next(CommonMask.begin(), Idx + E->getVectorFactor()),
+                    Idx);
+        }
+      }
+    }
+
     ::addMask(CommonMask, ExtMask, /*ExtendingManyInputs=*/true);
     if (CommonMask.empty()) {
       assert(InVectors.size() == 1 && "Expected only one vector with no mask");
@@ -10942,8 +10846,31 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals) {
       if (CanBeUsedAsScalar) {
         InstructionCost ScalarCost = TTI->getInstructionCost(Inst, CostKind);
         bool KeepScalar = ScalarCost <= ExtraCost;
-        if (KeepScalar && ScalarCost != TTI::TCC_Free &&
-            ExtraCost - ScalarCost <= TTI::TCC_Basic) {
+        // Try to keep original scalar if the user is the phi node from the same
+        // block as the root phis, currently vectorized. It allows to keep
+        // better ordering info of PHIs, being vectorized currently.
+        bool IsProfitablePHIUser =
+            (KeepScalar || (ScalarCost - ExtraCost <= TTI::TCC_Basic &&
+                            VectorizableTree.front()->Scalars.size() > 2)) &&
+            VectorizableTree.front()->getOpcode() == Instruction::PHI &&
+            !Inst->hasNUsesOrMore(UsesLimit) &&
+            none_of(Inst->users(),
+                    [&](User *U) {
+                      auto *PHIUser = dyn_cast<PHINode>(U);
+                      return (!PHIUser ||
+                              PHIUser->getParent() !=
+                                  cast<Instruction>(
+                                      VectorizableTree.front()->getMainOp())
+                                      ->getParent()) &&
+                             !getTreeEntry(U);
+                    }) &&
+            count_if(Entry->Scalars, [&](Value *V) {
+              return ValueToExtUses->contains(V);
+            }) <= 2;
+        if (IsProfitablePHIUser) {
+          KeepScalar = true;
+        } else if (KeepScalar && ScalarCost != TTI::TCC_Free &&
+                   ExtraCost - ScalarCost <= TTI::TCC_Basic) {
           unsigned ScalarUsesCount = count_if(Entry->Scalars, [&](Value *V) {
             return ValueToExtUses->contains(V);
           });
@@ -12490,7 +12417,9 @@ class BoUpSLP::ShuffleInstructionBuilder final : public BaseShuffleAnalysis {
   /// \param Action the action (if any) to be performed before final applying of
   /// the \p ExtMask mask.
   Value *
-  finalize(ArrayRef<int> ExtMask, unsigned VF = 0,
+  finalize(ArrayRef<int> ExtMask,
+           ArrayRef<std::pair<const TreeEntry *, unsigned>> SubVectors,
+           unsigned VF = 0,
            function_ref<void(Value *&, SmallVectorImpl<int> &)> Action = {}) {
     IsFinalized = true;
     SmallVector<int> NewExtMask(ExtMask);
@@ -12524,6 +12453,29 @@ class BoUpSLP::ShuffleInstructionBuilder final : public BaseShuffleAnalysis {
       Action(Vec, CommonMask);
       InVectors.front() = Vec;
     }
+    if (!SubVectors.empty()) {
+      Value *Vec = InVectors.front();
+      if (InVectors.size() == 2) {
+        Vec = createShuffle(Vec, InVectors.back(), CommonMask);
+        InVectors.pop_back();
+      } else {
+        Vec = createShuffle(Vec, nullptr, CommonMask);
+      }
+      for (unsigned Idx = 0, Sz = CommonMask.size(); Idx < Sz; ++Idx)
+        if (CommonMask[Idx] != PoisonMaskElem)
+          CommonMask[Idx] = Idx;
+      for (const auto [E, Idx] : SubVectors) {
+        Vec = Builder.CreateInsertVector(
+            Vec->getType(), Vec, E->VectorizedValue, Builder.getInt64(Idx));
+        if (!CommonMask.empty()) {
+          std::iota(std::next(CommonMask.begin(), Idx),
+                    std::next(CommonMask.begin(), Idx + E->getVectorFactor()),
+                    Idx);
+        }
+      }
+      InVectors.front() = Vec;
+    }
+
     if (!ExtMask.empty()) {
       if (CommonMask.empty()) {
         CommonMask.assign(ExtMask.begin(), ExtMask.end());
@@ -12602,7 +12554,10 @@ Value *BoUpSLP::vectorizeOperand(TreeEntry *E, unsigned NodeIdx,
                              : ScalarTy,
             Builder, *this);
         ShuffleBuilder.add(V, Mask);
-        return ShuffleBuilder.finalize(std::nullopt);
+        SmallVector<std::pair<const TreeEntry *, unsigned>> SubVectors;
+        for (const auto [EIdx, Idx] : E->CombinedEntriesWithIndices)
+          SubVectors.emplace_back(VectorizableTree[EIdx].get(), Idx);
+        return ShuffleBuilder.finalize(std::nullopt, SubVectors);
       };
       Value *V = vectorizeTree(VE, PostponedPHIs);
       if (VF * getNumElements(VL[0]->getType()) !=
@@ -12685,6 +12640,14 @@ ResTy BoUpSLP::processBuildVector(const TreeEntry *E, Type *ScalarTy,
   SmallVector<int> ReuseShuffleIndices(E->ReuseShuffleIndices.begin(),
                                        E->ReuseShuffleIndices.end());
   SmallVector<Value *> GatheredScalars(E->Scalars.begin(), E->Scalars.end());
+  // Clear values, to be replaced by insertvector instructions.
+  for (const auto [EIdx, Idx] : E->CombinedEntriesWithIndices)
+    for_each(MutableArrayRef(GatheredScalars)
+                 .slice(Idx, VectorizableTree[EIdx]->getVectorFactor()),
+             [&](Value *&V) { V = PoisonValue::get(V->getType()); });
+  SmallVector<std::pair<const TreeEntry *, unsigned>> SubVectors;
+  for (const auto [EIdx, Idx] : E->CombinedEntriesWithIndices)
+    SubVectors.emplace_back(VectorizableTree[EIdx].get(), Idx);
   // Build a mask out of the reorder indices and reorder scalars per this
   // mask.
   SmallVector<int> ReorderMask;
@@ -12822,7 +12785,7 @@ ResTy BoUpSLP::processBuildVector(const TreeEntry *E, Type *ScalarTy,
           }
         }
         ShuffleBuilder.add(*FrontTE, Mask);
-        Res = ShuffleBuilder.finalize(E->getCommonMask());
+        Res = ShuffleBuilder.finalize(E->getCommonMask(), SubVectors);
         return Res;
       }
       if (!Resized) {
@@ -13079,10 +13042,10 @@ ResTy BoUpSLP::processBuildVector(const TreeEntry *E, Type *ScalarTy,
                  (IsSingleShuffle && ((IsIdentityShuffle &&
                   IsNonPoisoned) || IsUsedInExpr) && isa<UndefValue>(V));
         }))
-      Res = ShuffleBuilder.finalize(E->ReuseShuffleIndices);
+      Res = ShuffleBuilder.finalize(E->ReuseShuffleIndices, SubVectors);
     else
       Res = ShuffleBuilder.finalize(
-          E->ReuseShuffleIndices, E->Scalars.size(),
+          E->ReuseShuffleIndices, SubVectors, E->Scalars.size(),
           [&](Value *&Vec, SmallVectorImpl<int> &Mask) {
             TryPackScalars(NonConstant...
[truncated]

Created using spr 1.3.5

alexey-bataev · 2024-08-18T15:22:01Z

Ping!

alexey-bataev · 2024-08-20T18:34:28Z

Ping!

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

Created using spr 1.3.5

RKSimon

LGTM

petrhosek · 2024-08-22T23:12:37Z

This change appears to have broken the UBSan runtime build with the following error:

FAILED: compiler-rt/lib/ubsan_minimal/CMakeFiles/RTUbsan_minimal.aarch64.dir/ubsan_minimal_handlers.cpp.o 
/Volumes/Work/s/w/ir/x/w/llvm_build/./bin/clang++ --target=aarch64-unknown-linux-gnu --sysroot=/Volumes/Work/s/w/ir/x/w/cipd/linux -D_DEBUG -D_GLIBCXX_ASSERTIONS -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/Volumes/Work/s/w/ir/x/w/llvm-llvm-project/compiler-rt/lib/ubsan_minimal/.. --target=aarch64-unknown-linux-gnu -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -ffile-prefix-map=/Volumes/Work/s/w/ir/x/w/llvm_build/runtimes/runtimes-aarch64-unknown-linux-gnu-bins=../../../llvm-llvm-project -ffile-prefix-map=/Volumes/Work/s/w/ir/x/w/llvm-llvm-project/= -no-canonical-prefixes -Wall -Wno-unused-parameter -O2 -g -DNDEBUG -std=c++17 -fPIC -fno-builtin -fno-exceptions -fomit-frame-pointer -funwind-tables -fno-stack-protector -fno-sanitize=safe-stack -fvisibility=hidden -fno-lto -Wthread-safety -Wthread-safety-reference -Wthread-safety-beta -O3 -gline-tables-only -Wno-gnu -Wno-variadic-macros -Wno-c99-extensions -ftrivial-auto-var-init=pattern -nostdinc++ -DSANITIZER_COMMON_NO_REDEFINE_BUILTINS -fno-rtti -MD -MT compiler-rt/lib/ubsan_minimal/CMakeFiles/RTUbsan_minimal.aarch64.dir/ubsan_minimal_handlers.cpp.o -MF compiler-rt/lib/ubsan_minimal/CMakeFiles/RTUbsan_minimal.aarch64.dir/ubsan_minimal_handlers.cpp.o.d -o compiler-rt/lib/ubsan_minimal/CMakeFiles/RTUbsan_minimal.aarch64.dir/ubsan_minimal_handlers.cpp.o -c /Volumes/Work/s/w/ir/x/w/llvm-llvm-project/compiler-rt/lib/ubsan_minimal/ubsan_minimal_handlers.cpp
Assertion failed: (isValidOperands(V1, V2, Mask) && "Invalid shuffle vector instruction operands!"), function ShuffleVectorInst, file Instructions.cpp, line 1700.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.	Program arguments: /Volumes/Work/s/w/ir/x/w/llvm_build/./bin/clang++ --target=aarch64-unknown-linux-gnu --sysroot=/Volumes/Work/s/w/ir/x/w/cipd/linux -D_DEBUG -D_GLIBCXX_ASSERTIONS -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/Volumes/Work/s/w/ir/x/w/llvm-llvm-project/compiler-rt/lib/ubsan_minimal/.. --target=aarch64-unknown-linux-gnu -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -ffile-prefix-map=/Volumes/Work/s/w/ir/x/w/llvm_build/runtimes/runtimes-aarch64-unknown-linux-gnu-bins=../../../llvm-llvm-project -ffile-prefix-map=/Volumes/Work/s/w/ir/x/w/llvm-llvm-project/= -no-canonical-prefixes -Wall -Wno-unused-parameter -O2 -g -DNDEBUG -std=c++17 -fPIC -fno-builtin -fno-exceptions -fomit-frame-pointer -funwind-tables -fno-stack-protector -fno-sanitize=safe-stack -fvisibility=hidden -fno-lto -Wthread-safety -Wthread-safety-reference -Wthread-safety-beta -O3 -gline-tables-only -Wno-gnu -Wno-variadic-macros -Wno-c99-extensions -ftrivial-auto-var-init=pattern -nostdinc++ -DSANITIZER_COMMON_NO_REDEFINE_BUILTINS -fno-rtti -MD -MT compiler-rt/lib/ubsan_minimal/CMakeFiles/RTUbsan_minimal.aarch64.dir/ubsan_minimal_handlers.cpp.o -MF compiler-rt/lib/ubsan_minimal/CMakeFiles/RTUbsan_minimal.aarch64.dir/ubsan_minimal_handlers.cpp.o.d -o compiler-rt/lib/ubsan_minimal/CMakeFiles/RTUbsan_minimal.aarch64.dir/ubsan_minimal_handlers.cpp.o -c /Volumes/Work/s/w/ir/x/w/llvm-llvm-project/compiler-rt/lib/ubsan_minimal/ubsan_minimal_handlers.cpp
1.	<eof> parser at end of file
2.	Optimizer
3.	Running pass "function<eager-inv>(float2int,lower-constant-intrinsics,chr,loop(loop-rotate<header-duplication;no-prepare-for-lto>,loop-deletion),loop-distribute,inject-tli-mappings,loop-vectorize<no-interleave-forced-only;no-vectorize-forced-only;>,infer-alignment,loop-load-elim,instcombine<max-iterations=1;no-use-loop-info;no-verify-fixpoint>,simplifycfg<bonus-inst-threshold=1;forward-switch-cond;switch-range-to-icmp;switch-to-lookup;no-keep-loops;hoist-common-insts;sink-common-insts;speculate-blocks;simplify-cond-branch;no-speculate-unpredictables>,slp-vectorizer,vector-combine,instcombine<max-iterations=1;no-use-loop-info;no-verify-fixpoint>,loop-unroll<O3>,transform-warning,sroa<preserve-cfg>,infer-alignment,instcombine<max-iterations=1;no-use-loop-info;no-verify-fixpoint>,loop-mssa(licm<allowspeculation>),alignment-from-assumptions,loop-sink,instsimplify,div-rem-pairs,tailcallelim,simplifycfg<bonus-inst-threshold=1;no-forward-switch-cond;switch-range-to-icmp;no-switch-to-lookup;keep-loops;no-hoist-common-insts;no-sink-common-insts;speculate-blocks;simplify-cond-branch;speculate-unpredictables>)" on module "/Volumes/Work/s/w/ir/x/w/llvm-llvm-project/compiler-rt/lib/ubsan_minimal/ubsan_minimal_handlers.cpp"
4.	Running pass "instcombine<max-iterations=1;no-use-loop-info;no-verify-fixpoint>" on function "_ZL12decorate_msgPcm"
 #0 0x000000010b9daf68 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x106234f68)
 #1 0x000000010b9d8c79 llvm::sys::RunSignalHandlers() (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x106232c79)
 #2 0x000000010b9da4ce llvm::sys::CleanupOnSignal(unsigned long) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x1062344ce)
 #3 0x000000010b92ac5e CrashRecoverySignalHandler(int) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x106184c5e)
 #4 0x00007ff80f6d95ed (/usr/lib/system/libsystem_platform.dylib+0x7ff8004265ed)
 #5 0x000060000355bbe0 
 #6 0x00007ff80f5d2b45 (/usr/lib/system/libsystem_c.dylib+0x7ff80031fb45)
 #7 0x00007ff80f5d1e5e (/usr/lib/system/libsystem_c.dylib+0x7ff80031ee5e)
 #8 0x000000010b666d9a llvm::ShuffleVectorInst::ShuffleVectorInst(llvm::Value*, llvm::Value*, llvm::ArrayRef<int>, llvm::Twine const&, llvm::InsertPosition) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x105ec0d9a)
 #9 0x00000001063bf0e7 llvm::IRBuilderBase::CreateShuffleVector(llvm::Value*, llvm::Value*, llvm::ArrayRef<int>, llvm::Twine const&) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x100c190e7)
#10 0x000000010a96ff2b llvm::InstCombinerImpl::visitCallInst(llvm::CallInst&) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x1051c9f2b)
#11 0x000000010a9253fe llvm::InstCombinerImpl::run() (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x10517f3fe)
#12 0x000000010a92844f combineInstructionsOverFunction(llvm::Function&, llvm::InstructionWorklist&, llvm::AAResults*, llvm::AssumptionCache&, llvm::TargetLibraryInfo&, llvm::TargetTransformInfo&, llvm::DominatorTree&, llvm::OptimizationRemarkEmitter&, llvm::BlockFrequencyInfo*, llvm::BranchProbabilityInfo*, llvm::ProfileSummaryInfo*, llvm::LoopInfo*, llvm::InstCombineOptions const&) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x10518244f)
#13 0x000000010a927c08 llvm::InstCombinePass::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x105181c08)
#14 0x00000001065d7e02 llvm::detail::PassModel<llvm::Function, llvm::InstCombinePass, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x100e31e02)
#15 0x000000010b6cb305 llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x105f25305)
#16 0x00000001065d9be2 llvm::detail::PassModel<llvm::Function, llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x100e33be2)
#17 0x000000010b6ce001 llvm::ModuleToFunctionPassAdaptor::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x105f28001)
#18 0x00000001065d3692 llvm::detail::PassModel<llvm::Module, llvm::ModuleToFunctionPassAdaptor, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x100e2d692)
#19 0x000000010b6ca505 llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x105f24505)
#20 0x00000001065d180a (anonymous namespace)::EmitAssemblyHelper::RunOptimizationPipeline(clang::BackendAction, std::__2::unique_ptr<llvm::raw_pwrite_stream, std::__2::default_delete<llvm::raw_pwrite_stream>>&, std::__2::unique_ptr<llvm::ToolOutputFile, std::__2::default_delete<llvm::ToolOutputFile>>&, clang::BackendConsumer*) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x100e2b80a)
#21 0x00000001065c8d83 clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::HeaderSearchOptions const&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::StringRef, llvm::Module*, clang::BackendAction, llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem>, std::__2::unique_ptr<llvm::raw_pwrite_stream, std::__2::default_delete<llvm::raw_pwrite_stream>>, clang::BackendConsumer*) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x100e22d83)
#22 0x00000001065bf078 clang::BackendConsumer::HandleTranslationUnit(clang::ASTContext&) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x100e19078)
#23 0x0000000107cf73e9 clang::ParseAST(clang::Sema&, bool, bool) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x1025513e9)
#24 0x00000001068e04c3 clang::FrontendAction::Execute() (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x10113a4c3)
#25 0x000000010684070d clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x10109a70d)
#26 0x00000001069cae18 clang::ExecuteCompilerInvocation(clang::CompilerInstance*) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x101224e18)
#27 0x000000010585d749 cc1_main(llvm::ArrayRef<char const*>, char const*, void*) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x1000b7749)
#28 0x0000000105859f5b ExecuteCC1Tool(llvm::SmallVectorImpl<char const*>&, llvm::ToolContext const&) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x1000b3f5b)
#29 0x0000000106689f0e void llvm::function_ref<void ()>::callback_fn<clang::driver::CC1Command::Execute(llvm::ArrayRef<std::__2::optional<llvm::StringRef>>, std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>>*, bool*) const::$_0>(long) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x100ee3f0e)
#30 0x000000010b92a97e llvm::CrashRecoveryContext::RunSafely(llvm::function_ref<void ()>) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x10618497e)
#31 0x00000001066897ba clang::driver::CC1Command::Execute(llvm::ArrayRef<std::__2::optional<llvm::StringRef>>, std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>>*, bool*) const (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x100ee37ba)
#32 0x00000001066463ca clang::driver::Compilation::ExecuteCommand(clang::driver::Command const&, clang::driver::Command const*&, bool) const (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x100ea03ca)
#33 0x00000001066466ef clang::driver::Compilation::ExecuteJobs(clang::driver::JobList const&, llvm::SmallVectorImpl<std::__2::pair<int, clang::driver::Command const*>>&, bool) const (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x100ea06ef)
#34 0x0000000106667f20 clang::driver::Driver::ExecuteCompilation(clang::driver::Compilation&, llvm::SmallVectorImpl<std::__2::pair<int, clang::driver::Command const*>>&) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x100ec1f20)
#35 0x00000001058593e3 clang_main(int, char**, llvm::ToolContext const&) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x1000b33e3)
#36 0x0000000105bc2bc6 findTool(int, char**, char const*) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x10041cbc6)
#37 0x0000000105bc2280 main (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x10041c280)
#38 0x00007ff80f35141f 
clang++: error: clang frontend command failed with exit code 134 (use -v to see invocation)

Would it be possible to take a look and revert the change if the issue cannot be addressed quickly?

alexey-bataev · 2024-08-22T23:30:34Z

There is something wrong with the instcombiner, will check it tomorrow

…ing" (#105780) with "[Vectorize] Fix warnings" It introduced compiler crashes, see #104144. This reverts commit 69332bb and 351f4a5.

mstorsjo · 2024-08-23T07:08:47Z

I also ran into assert failures due to this, on both aarch64 and i686 in the wild, and a reduced snippet also triggers on x86_64.

Reproducible with the following reduced snippet:

union a {
  short b
};
int c, d, e;
int *f;
unsigned h(int i, int j) {
  if (i & ~j)
    return i & 1;
  return i;
}
void k(short *i) {
  int l, m, n;
  for (;;) {
    int o;
    e += f[d];
    n += o = c;
    m = h(e, 6);
    short g = m << 8;
    ((union a *)&i[0])->b = g;
    l = h(o, 6);
    g = l << 8;
    ((union a *)&i[1])->b = g;
    ((union a *)&i[2])->b = g = n << 8;
    ((union a *)&i[3])->b = g;
  }
}

Compiled with:

$ clang -target aarch64-linux-gnu -c -O2 repro.c

Also reproduces with i686-linux-gnu and x86_64-linux-gnu.

mstorsjo · 2024-08-23T07:11:01Z

Second reproducer:

int a(char *b, int c) {
  int d, e, f = d = 0;
  for (; d < 3; d++) {
    e = 0;
    for (; e < 8; e++)
      f += -b[e] - b[e + c] >> 31;
    b += c;
  }
  return f;
}

This reproduces with i686 and x86_64, but not on aarch64.

RKSimon · 2024-08-23T09:50:09Z

This reproduces with i686 and x86_64, but not on aarch64.

In this one, SLP is producing the illegal instruction:

%51 = call <8 x i16> @llvm.vector.insert.v8i16.v4i32(<8 x i16> %50, <4 x i32> %41, i64 4)

https://simd.godbolt.org/z/1hY5e9xTY

SLP vectorizer has an estimation for gather/buildvector nodes, which contain some scalar loads. SLP vectorizer performs pretty similar (but large in SLOCs) estimation, which not always correct. Instead, this patch implements clustering analysis and actual node allocation with the full analysis for the vectorized clustered scalars (not only loads, but also some other instructions) with the correct cost estimation and vector insert instructions. Improves overall vectorization quality and simplifies analysis/estimations. Reviewers: RKSimon Reviewed By: RKSimon Pull Request: #104144

SLP vectorizer has an estimation for gather/buildvector nodes, which contain some scalar loads. SLP vectorizer performs pretty similar (but large in SLOCs) estimation, which not always correct. Instead, this patch implements clustering analysis and actual node allocation with the full analysis for the vectorized clustered scalars (not only loads, but also some other instructions) with the correct cost estimation and vector insert instructions. Improves overall vectorization quality and simplifies analysis/estimations. Reviewers: RKSimon Reviewed By: RKSimon Pull Request: llvm#104144

…ing" (llvm#105780) with "[Vectorize] Fix warnings" It introduced compiler crashes, see llvm#104144. This reverts commit 69332bb and 351f4a5.

SLP vectorizer has an estimation for gather/buildvector nodes, which contain some scalar loads. SLP vectorizer performs pretty similar (but large in SLOCs) estimation, which not always correct. Instead, this patch implements clustering analysis and actual node allocation with the full analysis for the vectorized clustered scalars (not only loads, but also some other instructions) with the correct cost estimation and vector insert instructions. Improves overall vectorization quality and simplifies analysis/estimations. Reviewers: RKSimon Reviewed By: RKSimon Pull Request: llvm#104144

SLP vectorizer has an estimation for gather/buildvector nodes, which contain some scalar loads. SLP vectorizer performs pretty similar (but large in SLOCs) estimation, which not always correct. Instead, this patch implements clustering analysis and actual node allocation with the full analysis for the vectorized clustered scalars (not only loads, but also some other instructions) with the correct cost estimation and vector insert instructions. Improves overall vectorization quality and simplifies analysis/estimations. Reviewers: RKSimon Reviewed By: RKSimon Pull Request: #104144

tcwzxx · 2024-08-26T09:56:18Z

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

@@ -13122,6 +13069,8 @@ ResTy BoUpSLP::processBuildVector(const TreeEntry *E, Type *ScalarTy,
 }

 Value *BoUpSLP::createBuildVector(const TreeEntry *E, Type *ScalarTy) {
+  for (const auto [EIdx, _] : E->CombinedEntriesWithIndices)
+    (void)vectorizeTree(VectorizableTree[EIdx].get(), /*PostponedPHIs=*/false);


I noticed issue #105904. This patch breaks the precondition of the Gather node, which must have only a single user in vectorizeOperand

dklimkin · 2024-08-28T09:04:36Z

We are hitting an assert error with the last reland:

assert.h assertion failed at llvm/lib/IR/Instructions.cpp:1854 in bool isSingleSourceMaskImpl(ArrayRef, int): I >= 0 && I < (NumOpElts * 2) && "Out-of-bounds shuffle mask element" in https://github.com/openxla/xla/blob/main/xla/service/gpu/fusions/triton/triton_fusion_emitter_parametrized_test.cc#L65

I'll see if we can create a simpler reproducer but we may need to revert in the meanwhile.

alexey-bataev · 2024-08-28T13:35:15Z

We are hitting an assert error with the last reland:

assert.h assertion failed at llvm/lib/IR/Instructions.cpp:1854 in bool isSingleSourceMaskImpl(ArrayRef, int): I >= 0 && I < (NumOpElts * 2) && "Out-of-bounds shuffle mask element" in https://github.com/openxla/xla/blob/main/xla/service/gpu/fusions/triton/triton_fusion_emitter_parametrized_test.cc#L65

I'll see if we can create a simpler reproducer but we may need to revert in the meanwhile.

Need a reproducer

dklimkin · 2024-08-29T14:58:28Z

Unfortunately it's in a very large generated block and we were not able to isolate it to an external repro yet.

One pointer is, dropping lines 12657-12661 from SLPVectorizer.cpp fixes our test. Could you take a look if the ranges here, and the values set look correct?

Update: corrected line number.

alexey-bataev · 2024-08-29T15:07:22Z

Unfortunately it's in a very large generated block and we were not able to isolate it to an external repro yet.

One pointer is, dropping lines 12657-12667 from SLPVectorizer.cpp fixes our test. Could you take a look if the ranges here, and the values set look correct?

Could you provide full stack trace? Also, large code is not a problem, attach it, I will reduce it myself

dklimkin · 2024-08-29T15:26:45Z

Stack trace:

PC: @     0x7f902096334c  (unknown)  llvm::slpvectorizer::BoUpSLP::ShuffleInstructionBuilder::finalize()
    @     0x7f905174fe80  (unknown)  (unknown)
    @     0x7f902096334c        288  llvm::slpvectorizer::BoUpSLP::ShuffleInstructionBuilder::finalize()
    @     0x7f9020934fab        288  llvm::slpvectorizer::BoUpSLP::vectorizeOperand()::$_5::operator()()
    @     0x7f9020931478        240  llvm::slpvectorizer::BoUpSLP::vectorizeOperand()
    @     0x7f9020931a8c       2400  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f90209311ca        240  llvm::slpvectorizer::BoUpSLP::vectorizeOperand()
    @     0x7f9020931bd9       2400  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f9020939fb7        736  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f9020939e17        160  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f902094ce10       1248  llvm::SLPVectorizerPass::tryToVectorizeList()
    @     0x7f902095201d        304  tryToVectorizeSequence<>()
    @     0x7f902097a9ee        240  llvm::SLPVectorizerPass::vectorizeCmpInsts<>()
    @     0x7f9020948469        640  llvm::SLPVectorizerPass::vectorizeChainsInBlock()
    @     0x7f9020945a11       4848  llvm::SLPVectorizerPass::runImpl()
    @     0x7f902094551c        160  llvm::SLPVectorizerPass::run()
    @     0x7f902565b252         32  llvm::detail::PassModel<>::run()
    @     0x7f8febfa8f19        272  llvm::PassManager<>::run()
    @     0x7f904d8e1152         32  llvm::detail::PassModel<>::run()
    @     0x7f8febfad2a9        320  llvm::ModuleToFunctionPassAdaptor::run()
    @     0x7f904d8e0f12         32  llvm::detail::PassModel<>::run()
    @     0x7f8febfa7e69        272  llvm::PassManager<>::run()
    @     0x7f903bbed862       2880  std::__u::__function::__policy_invoker<>::__call_impl<>()
    @     0x7f9077c1475e        208  xla::gpu::TranslateLLVMToLLVMIR()
    @     0x7f9077c16f2d        656  xla::gpu::CompileTritonToLLVM()
    @     0x7f9077c162cf        336  xla::gpu::TritonWrapper()
    @     0x7f9078036bbb        448  xla::gpu::TritonFusion::GenerateTritonKernelAndWrapper()
    @     0x7f907803852d        688  std::__u::__function::__policy_invoker<>::__call_impl<>()
    @     0x7f90186f3e47        384  xla::gpu::KernelReuseCache::GetWithStatus()
    @     0x7f90186f3c89        128  xla::gpu::KernelReuseCache::GetWithStatus()
    @     0x7f90780374a4        480  xla::gpu::TritonFusion::Emit()
    @     0x7f91aa629614        464  xla::gpu::IrEmitterUnnested::EmitFusion()
    @     0x7f91aa61829b         96  xla::gpu::IrEmitterUnnested::EmitHloComputation()
    @     0x7f91b9e378f0       1824  xla::gpu::CompileModuleToLlvmIr()
    @     0x7f91ba68b22d       2320  xla::gpu::GpuCompiler::CompileToBackendResult()
    @     0x7f91ba68d599       3280  xla::gpu::GpuCompiler::RunBackend()
    @     0x7f908e3bc71c        592  xla::gpu::AutotunerCompileUtil::Compile()
    @     0x7f908e7e6296        240  xla::gpu::GemmFusionAutotunerImpl::CompileAll()::$_6::operator()()
    @     0x7f908e7fa18d        288  std::__u::__function::__policy_invoker<>::__call_impl<>()
    @     0x7f8e64a87efa        112  Eigen::ThreadPoolTempl<>::WorkerLoop()
    @     0x7f8e64a879fa         48  absl::internal_any_invocable::RemoteInvoker<>()
    @     0x7f8c7d3b3c93        256  Thread::ThreadBody()
    @     0x7f90517467db        192  start_thread
    @     0x7f903b33d05f  (unknown)  clone

Looks like invalid mask with values 124 and 126 here:

Invalid LLVM IR before optimizations:
Invalid shufflevector operands!
  %55 = shufflevector <4 x i32> %35, <4 x i32> poison, <128 x i32> <i32 0, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 124, i32 1, i32 126, i32 1>

(The test in question is this: https://github.com/openxla/xla/blob/main/xla/service/gpu/fusions/triton/triton_fusion_emitter_parametrized_test.cc#L65 , if you could run the test, you'll probably be able to collect more useful information. This is significantly out of my normal problem area).

I got ~800 Mb of IR generated and I can't validate if I can share all of it.

alexey-bataev · 2024-08-29T15:44:20Z

Stack trace:

PC: @     0x7f902096334c  (unknown)  llvm::slpvectorizer::BoUpSLP::ShuffleInstructionBuilder::finalize()
    @     0x7f905174fe80  (unknown)  (unknown)
    @     0x7f902096334c        288  llvm::slpvectorizer::BoUpSLP::ShuffleInstructionBuilder::finalize()
    @     0x7f9020934fab        288  llvm::slpvectorizer::BoUpSLP::vectorizeOperand()::$_5::operator()()
    @     0x7f9020931478        240  llvm::slpvectorizer::BoUpSLP::vectorizeOperand()
    @     0x7f9020931a8c       2400  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f90209311ca        240  llvm::slpvectorizer::BoUpSLP::vectorizeOperand()
    @     0x7f9020931bd9       2400  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f9020939fb7        736  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f9020939e17        160  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f902094ce10       1248  llvm::SLPVectorizerPass::tryToVectorizeList()
    @     0x7f902095201d        304  tryToVectorizeSequence<>()
    @     0x7f902097a9ee        240  llvm::SLPVectorizerPass::vectorizeCmpInsts<>()
    @     0x7f9020948469        640  llvm::SLPVectorizerPass::vectorizeChainsInBlock()
    @     0x7f9020945a11       4848  llvm::SLPVectorizerPass::runImpl()
    @     0x7f902094551c        160  llvm::SLPVectorizerPass::run()
    @     0x7f902565b252         32  llvm::detail::PassModel<>::run()
    @     0x7f8febfa8f19        272  llvm::PassManager<>::run()
    @     0x7f904d8e1152         32  llvm::detail::PassModel<>::run()
    @     0x7f8febfad2a9        320  llvm::ModuleToFunctionPassAdaptor::run()
    @     0x7f904d8e0f12         32  llvm::detail::PassModel<>::run()
    @     0x7f8febfa7e69        272  llvm::PassManager<>::run()
    @     0x7f903bbed862       2880  std::__u::__function::__policy_invoker<>::__call_impl<>()
    @     0x7f9077c1475e        208  xla::gpu::TranslateLLVMToLLVMIR()
    @     0x7f9077c16f2d        656  xla::gpu::CompileTritonToLLVM()
    @     0x7f9077c162cf        336  xla::gpu::TritonWrapper()
    @     0x7f9078036bbb        448  xla::gpu::TritonFusion::GenerateTritonKernelAndWrapper()
    @     0x7f907803852d        688  std::__u::__function::__policy_invoker<>::__call_impl<>()
    @     0x7f90186f3e47        384  xla::gpu::KernelReuseCache::GetWithStatus()
    @     0x7f90186f3c89        128  xla::gpu::KernelReuseCache::GetWithStatus()
    @     0x7f90780374a4        480  xla::gpu::TritonFusion::Emit()
    @     0x7f91aa629614        464  xla::gpu::IrEmitterUnnested::EmitFusion()
    @     0x7f91aa61829b         96  xla::gpu::IrEmitterUnnested::EmitHloComputation()
    @     0x7f91b9e378f0       1824  xla::gpu::CompileModuleToLlvmIr()
    @     0x7f91ba68b22d       2320  xla::gpu::GpuCompiler::CompileToBackendResult()
    @     0x7f91ba68d599       3280  xla::gpu::GpuCompiler::RunBackend()
    @     0x7f908e3bc71c        592  xla::gpu::AutotunerCompileUtil::Compile()
    @     0x7f908e7e6296        240  xla::gpu::GemmFusionAutotunerImpl::CompileAll()::$_6::operator()()
    @     0x7f908e7fa18d        288  std::__u::__function::__policy_invoker<>::__call_impl<>()
    @     0x7f8e64a87efa        112  Eigen::ThreadPoolTempl<>::WorkerLoop()
    @     0x7f8e64a879fa         48  absl::internal_any_invocable::RemoteInvoker<>()
    @     0x7f8c7d3b3c93        256  Thread::ThreadBody()
    @     0x7f90517467db        192  start_thread
    @     0x7f903b33d05f  (unknown)  clone

Looks like invalid mask with values 124 and 126 here:

Invalid LLVM IR before optimizations:
Invalid shufflevector operands!
  %55 = shufflevector <4 x i32> %35, <4 x i32> poison, <128 x i32> <i32 0, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 124, i32 1, i32 126, i32 1>

(The test in question is this: https://github.com/openxla/xla/blob/main/xla/service/gpu/fusions/triton/triton_fusion_emitter_parametrized_test.cc#L65 , if you could run the test, you'll probably be able to collect more useful information. This is significantly out of my normal problem area).

I got ~800 Mb of IR generated and I can't validate if I can share all of it.

Unfortunately, it does not help, need a reproducer

rupprecht · 2024-08-30T03:05:38Z

Stack trace:

PC: @     0x7f902096334c  (unknown)  llvm::slpvectorizer::BoUpSLP::ShuffleInstructionBuilder::finalize()
    @     0x7f905174fe80  (unknown)  (unknown)
    @     0x7f902096334c        288  llvm::slpvectorizer::BoUpSLP::ShuffleInstructionBuilder::finalize()
    @     0x7f9020934fab        288  llvm::slpvectorizer::BoUpSLP::vectorizeOperand()::$_5::operator()()
    @     0x7f9020931478        240  llvm::slpvectorizer::BoUpSLP::vectorizeOperand()
    @     0x7f9020931a8c       2400  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f90209311ca        240  llvm::slpvectorizer::BoUpSLP::vectorizeOperand()
    @     0x7f9020931bd9       2400  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f9020939fb7        736  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f9020939e17        160  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f902094ce10       1248  llvm::SLPVectorizerPass::tryToVectorizeList()
    @     0x7f902095201d        304  tryToVectorizeSequence<>()
    @     0x7f902097a9ee        240  llvm::SLPVectorizerPass::vectorizeCmpInsts<>()
    @     0x7f9020948469        640  llvm::SLPVectorizerPass::vectorizeChainsInBlock()
    @     0x7f9020945a11       4848  llvm::SLPVectorizerPass::runImpl()
    @     0x7f902094551c        160  llvm::SLPVectorizerPass::run()
    @     0x7f902565b252         32  llvm::detail::PassModel<>::run()
    @     0x7f8febfa8f19        272  llvm::PassManager<>::run()
    @     0x7f904d8e1152         32  llvm::detail::PassModel<>::run()
    @     0x7f8febfad2a9        320  llvm::ModuleToFunctionPassAdaptor::run()
    @     0x7f904d8e0f12         32  llvm::detail::PassModel<>::run()
    @     0x7f8febfa7e69        272  llvm::PassManager<>::run()
    @     0x7f903bbed862       2880  std::__u::__function::__policy_invoker<>::__call_impl<>()
    @     0x7f9077c1475e        208  xla::gpu::TranslateLLVMToLLVMIR()
    @     0x7f9077c16f2d        656  xla::gpu::CompileTritonToLLVM()
    @     0x7f9077c162cf        336  xla::gpu::TritonWrapper()
    @     0x7f9078036bbb        448  xla::gpu::TritonFusion::GenerateTritonKernelAndWrapper()
    @     0x7f907803852d        688  std::__u::__function::__policy_invoker<>::__call_impl<>()
    @     0x7f90186f3e47        384  xla::gpu::KernelReuseCache::GetWithStatus()
    @     0x7f90186f3c89        128  xla::gpu::KernelReuseCache::GetWithStatus()
    @     0x7f90780374a4        480  xla::gpu::TritonFusion::Emit()
    @     0x7f91aa629614        464  xla::gpu::IrEmitterUnnested::EmitFusion()
    @     0x7f91aa61829b         96  xla::gpu::IrEmitterUnnested::EmitHloComputation()
    @     0x7f91b9e378f0       1824  xla::gpu::CompileModuleToLlvmIr()
    @     0x7f91ba68b22d       2320  xla::gpu::GpuCompiler::CompileToBackendResult()
    @     0x7f91ba68d599       3280  xla::gpu::GpuCompiler::RunBackend()
    @     0x7f908e3bc71c        592  xla::gpu::AutotunerCompileUtil::Compile()
    @     0x7f908e7e6296        240  xla::gpu::GemmFusionAutotunerImpl::CompileAll()::$_6::operator()()
    @     0x7f908e7fa18d        288  std::__u::__function::__policy_invoker<>::__call_impl<>()
    @     0x7f8e64a87efa        112  Eigen::ThreadPoolTempl<>::WorkerLoop()
    @     0x7f8e64a879fa         48  absl::internal_any_invocable::RemoteInvoker<>()
    @     0x7f8c7d3b3c93        256  Thread::ThreadBody()
    @     0x7f90517467db        192  start_thread
    @     0x7f903b33d05f  (unknown)  clone

Looks like invalid mask with values 124 and 126 here:

Invalid LLVM IR before optimizations:
Invalid shufflevector operands!
  %55 = shufflevector <4 x i32> %35, <4 x i32> poison, <128 x i32> <i32 0, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 124, i32 1, i32 126, i32 1>

(The test in question is this: https://github.com/openxla/xla/blob/main/xla/service/gpu/fusions/triton/triton_fusion_emitter_parametrized_test.cc#L65 , if you could run the test, you'll probably be able to collect more useful information. This is significantly out of my normal problem area).
I got ~800 Mb of IR generated and I can't validate if I can share all of it.

Unfortunately, it does not help, need a reproducer

Reduced as #106655

alexey-bataev · 2024-08-30T12:29:02Z

Stack trace:

PC: @     0x7f902096334c  (unknown)  llvm::slpvectorizer::BoUpSLP::ShuffleInstructionBuilder::finalize()
    @     0x7f905174fe80  (unknown)  (unknown)
    @     0x7f902096334c        288  llvm::slpvectorizer::BoUpSLP::ShuffleInstructionBuilder::finalize()
    @     0x7f9020934fab        288  llvm::slpvectorizer::BoUpSLP::vectorizeOperand()::$_5::operator()()
    @     0x7f9020931478        240  llvm::slpvectorizer::BoUpSLP::vectorizeOperand()
    @     0x7f9020931a8c       2400  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f90209311ca        240  llvm::slpvectorizer::BoUpSLP::vectorizeOperand()
    @     0x7f9020931bd9       2400  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f9020939fb7        736  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f9020939e17        160  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f902094ce10       1248  llvm::SLPVectorizerPass::tryToVectorizeList()
    @     0x7f902095201d        304  tryToVectorizeSequence<>()
    @     0x7f902097a9ee        240  llvm::SLPVectorizerPass::vectorizeCmpInsts<>()
    @     0x7f9020948469        640  llvm::SLPVectorizerPass::vectorizeChainsInBlock()
    @     0x7f9020945a11       4848  llvm::SLPVectorizerPass::runImpl()
    @     0x7f902094551c        160  llvm::SLPVectorizerPass::run()
    @     0x7f902565b252         32  llvm::detail::PassModel<>::run()
    @     0x7f8febfa8f19        272  llvm::PassManager<>::run()
    @     0x7f904d8e1152         32  llvm::detail::PassModel<>::run()
    @     0x7f8febfad2a9        320  llvm::ModuleToFunctionPassAdaptor::run()
    @     0x7f904d8e0f12         32  llvm::detail::PassModel<>::run()
    @     0x7f8febfa7e69        272  llvm::PassManager<>::run()
    @     0x7f903bbed862       2880  std::__u::__function::__policy_invoker<>::__call_impl<>()
    @     0x7f9077c1475e        208  xla::gpu::TranslateLLVMToLLVMIR()
    @     0x7f9077c16f2d        656  xla::gpu::CompileTritonToLLVM()
    @     0x7f9077c162cf        336  xla::gpu::TritonWrapper()
    @     0x7f9078036bbb        448  xla::gpu::TritonFusion::GenerateTritonKernelAndWrapper()
    @     0x7f907803852d        688  std::__u::__function::__policy_invoker<>::__call_impl<>()
    @     0x7f90186f3e47        384  xla::gpu::KernelReuseCache::GetWithStatus()
    @     0x7f90186f3c89        128  xla::gpu::KernelReuseCache::GetWithStatus()
    @     0x7f90780374a4        480  xla::gpu::TritonFusion::Emit()
    @     0x7f91aa629614        464  xla::gpu::IrEmitterUnnested::EmitFusion()
    @     0x7f91aa61829b         96  xla::gpu::IrEmitterUnnested::EmitHloComputation()
    @     0x7f91b9e378f0       1824  xla::gpu::CompileModuleToLlvmIr()
    @     0x7f91ba68b22d       2320  xla::gpu::GpuCompiler::CompileToBackendResult()
    @     0x7f91ba68d599       3280  xla::gpu::GpuCompiler::RunBackend()
    @     0x7f908e3bc71c        592  xla::gpu::AutotunerCompileUtil::Compile()
    @     0x7f908e7e6296        240  xla::gpu::GemmFusionAutotunerImpl::CompileAll()::$_6::operator()()
    @     0x7f908e7fa18d        288  std::__u::__function::__policy_invoker<>::__call_impl<>()
    @     0x7f8e64a87efa        112  Eigen::ThreadPoolTempl<>::WorkerLoop()
    @     0x7f8e64a879fa         48  absl::internal_any_invocable::RemoteInvoker<>()
    @     0x7f8c7d3b3c93        256  Thread::ThreadBody()
    @     0x7f90517467db        192  start_thread
    @     0x7f903b33d05f  (unknown)  clone

Looks like invalid mask with values 124 and 126 here:

Invalid LLVM IR before optimizations:
Invalid shufflevector operands!
  %55 = shufflevector <4 x i32> %35, <4 x i32> poison, <128 x i32> <i32 0, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 124, i32 1, i32 126, i32 1>

(The test in question is this: https://github.com/openxla/xla/blob/main/xla/service/gpu/fusions/triton/triton_fusion_emitter_parametrized_test.cc#L65 , if you could run the test, you'll probably be able to collect more useful information. This is significantly out of my normal problem area).
I got ~800 Mb of IR generated and I can't validate if I can share all of it.

Unfortunately, it does not help, need a reproducer

Reduced as #106655

Thanks, fixed

SLP vectorizer has an estimation for gather/buildvector nodes, which contain some scalar loads. SLP vectorizer performs pretty similar (but large in SLOCs) estimation, which not always correct. Instead, this patch implements clustering analysis and actual node allocation with the full analysis for the vectorized clustered scalars (not only loads, but also some other instructions) with the correct cost estimation and vector insert instructions. Improves overall vectorization quality and simplifies analysis/estimations. Reviewers: RKSimon Reviewed By: RKSimon Pull Request: llvm#104144

…ing" (llvm#105780) with "[Vectorize] Fix warnings" It introduced compiler crashes, see llvm#104144. This reverts commit 69332bb and 351f4a5.

SLP vectorizer has an estimation for gather/buildvector nodes, which contain some scalar loads. SLP vectorizer performs pretty similar (but large in SLOCs) estimation, which not always correct. Instead, this patch implements clustering analysis and actual node allocation with the full analysis for the vectorized clustered scalars (not only loads, but also some other instructions) with the correct cost estimation and vector insert instructions. Improves overall vectorization quality and simplifies analysis/estimations. Reviewers: RKSimon Reviewed By: RKSimon Pull Request: llvm#104144

fmayer · 2024-09-13T16:20:22Z

This CL seems to have changed behavior: https://gist.github.com/fmayer/6030063dfacd2abfb0898a5855949034

This IR here generates a program that returns 0 before, and 1 after this change. It seems related to inline asm instructions.

alexey-bataev · 2024-09-13T18:32:00Z

This CL seems to have changed behavior: https://gist.github.com/fmayer/6030063dfacd2abfb0898a5855949034

This IR here generates a program that returns 0 before, and 1 after this change. It seems related to inline asm instructions.

Must be fixed in c13bf6d

fmayer · 2024-09-13T18:58:21Z

This CL seems to have changed behavior: https://gist.github.com/fmayer/6030063dfacd2abfb0898a5855949034
This IR here generates a program that returns 0 before, and 1 after this change. It seems related to inline asm instructions.

Must be fixed in c13bf6d

I can confirm this fixed it.

Should not return the original phi vector instruction, need to return actual vectorized value as a result.

[𝘀𝗽𝗿] initial version

92b3a6a

Created using spr 1.3.5

llvmbot added backend:SystemZ vectorizers llvm:transforms labels Aug 14, 2024

alexey-bataev requested a review from RKSimon August 15, 2024 19:37

Rebase, preallocate storage for submasks

a421006

Created using spr 1.3.5

RKSimon reviewed Aug 21, 2024

View reviewed changes

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp Outdated Show resolved Hide resolved

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp Show resolved Hide resolved

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp Outdated Show resolved Hide resolved

Rebase, address comments

6c92fa4

Created using spr 1.3.5

RKSimon approved these changes Aug 22, 2024

View reviewed changes

alexey-bataev merged commit 69332bb into main Aug 22, 2024
8 checks passed

alexey-bataev deleted the users/alexey-bataev/spr/slpimprovefix-subvectors-in-gatherbuildvector-nodes-handling branch August 22, 2024 15:24

vitalybuka mentioned this pull request Aug 23, 2024

Revert "[SLP]Improve/fix subvectors in gather/buildvector nodes handling" #105780

Merged

vitalybuka added a commit that referenced this pull request Aug 23, 2024

Revert "[SLP]Improve/fix subvectors in gather/buildvector nodes handl…

96b3166

…ing" (#105780) with "[Vectorize] Fix warnings" It introduced compiler crashes, see #104144. This reverts commit 69332bb and 351f4a5.

tcwzxx reviewed Aug 26, 2024

View reviewed changes

rupprecht mentioned this pull request Aug 30, 2024

[SLP] Assertion `I >= 0 && I < (NumOpElts * 2) && "Out-of-bounds shuffle mask element"' failed. #106655

Closed

vitalybuka referenced this pull request Sep 13, 2024

[SLP]Return proper value for phi vectorized node

c13bf6d

Should not return the original phi vector instruction, need to return actual vectorized value as a result.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SLP]Improve/fix subvectors in gather/buildvector nodes handling #104144

[SLP]Improve/fix subvectors in gather/buildvector nodes handling #104144

alexey-bataev commented Aug 14, 2024

llvmbot commented Aug 14, 2024 •

edited

Loading

alexey-bataev commented Aug 18, 2024

alexey-bataev commented Aug 20, 2024

RKSimon left a comment

petrhosek commented Aug 22, 2024

alexey-bataev commented Aug 22, 2024

mstorsjo commented Aug 23, 2024

mstorsjo commented Aug 23, 2024

RKSimon commented Aug 23, 2024 •

edited

Loading

tcwzxx Aug 26, 2024

dklimkin commented Aug 28, 2024

alexey-bataev commented Aug 28, 2024

dklimkin commented Aug 29, 2024 •

edited

Loading

alexey-bataev commented Aug 29, 2024 •

edited by dklimkin

Loading

dklimkin commented Aug 29, 2024

alexey-bataev commented Aug 29, 2024

rupprecht commented Aug 30, 2024

alexey-bataev commented Aug 30, 2024

fmayer commented Sep 13, 2024

alexey-bataev commented Sep 13, 2024

fmayer commented Sep 13, 2024

[SLP]Improve/fix subvectors in gather/buildvector nodes handling #104144

[SLP]Improve/fix subvectors in gather/buildvector nodes handling #104144

Conversation

alexey-bataev commented Aug 14, 2024

llvmbot commented Aug 14, 2024 • edited Loading

alexey-bataev commented Aug 18, 2024

alexey-bataev commented Aug 20, 2024

RKSimon left a comment

Choose a reason for hiding this comment

petrhosek commented Aug 22, 2024

alexey-bataev commented Aug 22, 2024

mstorsjo commented Aug 23, 2024

mstorsjo commented Aug 23, 2024

RKSimon commented Aug 23, 2024 • edited Loading

tcwzxx Aug 26, 2024

Choose a reason for hiding this comment

dklimkin commented Aug 28, 2024

alexey-bataev commented Aug 28, 2024

dklimkin commented Aug 29, 2024 • edited Loading

alexey-bataev commented Aug 29, 2024 • edited by dklimkin Loading

dklimkin commented Aug 29, 2024

alexey-bataev commented Aug 29, 2024

rupprecht commented Aug 30, 2024

alexey-bataev commented Aug 30, 2024

fmayer commented Sep 13, 2024

alexey-bataev commented Sep 13, 2024

fmayer commented Sep 13, 2024

llvmbot commented Aug 14, 2024 •

edited

Loading

RKSimon commented Aug 23, 2024 •

edited

Loading

dklimkin commented Aug 29, 2024 •

edited

Loading

alexey-bataev commented Aug 29, 2024 •

edited by dklimkin

Loading