Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SLP]Improve/fix subvectors in gather/buildvector nodes handling #104144

Conversation

alexey-bataev
Copy link
Member

SLP vectorizer has an estimation for gather/buildvector nodes, which
contain some scalar loads. SLP vectorizer performs pretty similar (but
large in SLOCs) estimation, which not always correct. Instead, this
patch implements clustering analysis and actual node allocation with the
full analysis for the vectorized clustered scalars (not only loads, but
also some other instructions) with the correct cost estimation and
vector insert instructions. Improves overall vectorization quality and
simplifies analysis/estimations.

Created using spr 1.3.5
@llvmbot
Copy link
Member

llvmbot commented Aug 14, 2024

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-systemz

Author: Alexey Bataev (alexey-bataev)

Changes

SLP vectorizer has an estimation for gather/buildvector nodes, which
contain some scalar loads. SLP vectorizer performs pretty similar (but
large in SLOCs) estimation, which not always correct. Instead, this
patch implements clustering analysis and actual node allocation with the
full analysis for the vectorized clustered scalars (not only loads, but
also some other instructions) with the correct cost estimation and
vector insert instructions. Improves overall vectorization quality and
simplifies analysis/estimations.


Patch is 183.96 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/104144.diff

27 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp (+160-185)
  • (modified) llvm/test/Transforms/PhaseOrdering/AArch64/slpordering.ll (+37-37)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/getelementptr.ll (+6-5)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/loadorder.ll (+96-96)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/multiple_reduction.ll (+147-218)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/scalarization-overhead.ll (+43-19)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/shuffle-vectors-mask-size.ll (+2-5)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/tsc-s116.ll (+4-4)
  • (modified) llvm/test/Transforms/SLPVectorizer/AArch64/vectorizable-selects-uniform-cmps.ll (+19-13)
  • (modified) llvm/test/Transforms/SLPVectorizer/RISCV/combined-loads-stored.ll (+3-4)
  • (modified) llvm/test/Transforms/SLPVectorizer/RISCV/reductions.ll (+22-26)
  • (modified) llvm/test/Transforms/SLPVectorizer/SystemZ/pr34619.ll (+5-6)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/addsub.ll (+8-10)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/extract-many-users-buildvector.ll (+19-24)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/extract-scalar-from-undef.ll (+13-14)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/gather-node-same-as-vect-but-order.ll (+6-7)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/horizontal-minmax.ll (+7-9)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/inst_size_bug.ll (+12-6)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/landing_pad.ll (+10-9)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/phi.ll (+27-26)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/reduction-logical.ll (+9-8)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/remark-partial-loads-vectorize.ll (+3-13)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/scatter-vectorize-reused-pointer.ll (+14-12)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/schedule_budget_debug_info.ll (+28-12)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/split-load8_2-unord.ll (+17-22)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/tiny-tree.ll (+2-3)
  • (modified) llvm/test/Transforms/SLPVectorizer/X86/vect-gather-same-nodes.ll (+3-3)
diff --git a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 81841a8f692870..b32017adcf8ca8 100644
--- a/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
@@ -3104,6 +3104,10 @@ class BoUpSLP {
     /// The index of this treeEntry in VectorizableTree.
     int Idx = -1;
 
+    /// For gather/buildvector/alt opcode (TODO) nodes, which are combined from
+    /// other nodes as a serie of insertvector instructions.
+    SmallVector<std::pair<unsigned, unsigned>, 0> CombinedEntriesWithIndices;
+
   private:
     /// The operands of each instruction in each lane Operands[op_index][lane].
     /// Note: This helps avoid the replication of the code that performs the
@@ -3404,7 +3408,9 @@ class BoUpSLP {
         if (!isConstant(V)) {
           auto *I = dyn_cast<CastInst>(V);
           AllConstsOrCasts &= I && I->getType()->isIntegerTy();
-          ValueToGatherNodes.try_emplace(V).first->getSecond().insert(Last);
+          if (UserTreeIdx.EdgeIdx != UINT_MAX || !UserTreeIdx.UserTE ||
+              !UserTreeIdx.UserTE->isGather())
+            ValueToGatherNodes.try_emplace(V).first->getSecond().insert(Last);
         }
       if (AllConstsOrCasts)
         CastMaxMinBWSizes =
@@ -8361,8 +8367,49 @@ getGEPCosts(const TargetTransformInfo &TTI, ArrayRef<Value *> Ptrs,
 
 void BoUpSLP::transformNodes() {
   constexpr TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
-  for (std::unique_ptr<TreeEntry> &TE : VectorizableTree) {
-    TreeEntry &E = *TE;
+  // The tree may grow here, so iterate over nodes, built before.
+  for (unsigned Idx : seq<unsigned>(VectorizableTree.size())) {
+    TreeEntry &E = *VectorizableTree[Idx];
+    if (E.isGather()) {
+      ArrayRef<Value *> VL = E.Scalars;
+      const unsigned Sz = getVectorElementSize(VL.front());
+      unsigned MinVF = getMinVF(2 * Sz);
+      if (VL.size() <= 2 ||
+          (E.getOpcode() &&
+           (E.isAltShuffle() || E.getOpcode() != Instruction::Load)))
+        continue;
+      // Try to find vectorizable sequences and transform them into a serie of
+      // insertvector instructions.
+      unsigned StartIdx = 0;
+      unsigned End = VL.size();
+      for (unsigned VF = VL.size() / 2; VF >= MinVF; VF /= 2) {
+        for (unsigned Cnt = StartIdx; Cnt + VF <= End; Cnt += VF) {
+          ArrayRef<Value *> Slice = VL.slice(Cnt, VF);
+          InstructionsState S = getSameOpcode(Slice, *TLI);
+          if (!S.getOpcode() || S.isAltShuffle() ||
+              (S.getOpcode() != Instruction::Load &&
+               any_of(Slice, [&](Value *V) {
+                 return !areAllUsersVectorized(cast<Instruction>(V),
+                                               UserIgnoreList);
+               })))
+            continue;
+          if (!getTreeEntry(Slice.front()) && !getTreeEntry(Slice.back())) {
+            unsigned PrevSize = VectorizableTree.size();
+            buildTree_rec(Slice, 0, EdgeInfo(&E, UINT_MAX));
+            if (PrevSize + 1 == VectorizableTree.size() &&
+                VectorizableTree[PrevSize]->isGather()) {
+              VectorizableTree.pop_back();
+              continue;
+            }
+            E.CombinedEntriesWithIndices.emplace_back(PrevSize, Cnt);
+            if (StartIdx == Cnt)
+              StartIdx = Cnt + VF;
+            if (End == Cnt + VF)
+              End = Cnt;
+          }
+        }
+      }
+    }
     switch (E.getOpcode()) {
     case Instruction::Load: {
       // No need to reorder masked gather loads, just reorder the scalar
@@ -8485,175 +8532,7 @@ class BoUpSLP::ShuffleCostEstimator : public BaseShuffleAnalysis {
     auto *VecTy = getWidenedType(ScalarTy, VL.size());
     InstructionCost GatherCost = 0;
     SmallVector<Value *> Gathers(VL);
-    // Improve gather cost for gather of loads, if we can group some of the
-    // loads into vector loads.
-    InstructionsState S = getSameOpcode(VL, *R.TLI);
-    const unsigned Sz = R.DL->getTypeSizeInBits(ScalarTy);
-    unsigned MinVF = R.getMinVF(2 * Sz);
-    if (VL.size() > 2 &&
-        ((S.getOpcode() == Instruction::Load && !S.isAltShuffle()) ||
-         (InVectors.empty() &&
-          any_of(seq<unsigned>(0, VL.size() / MinVF),
-                 [&](unsigned Idx) {
-                   ArrayRef<Value *> SubVL = VL.slice(Idx * MinVF, MinVF);
-                   InstructionsState S = getSameOpcode(SubVL, *R.TLI);
-                   return S.getOpcode() == Instruction::Load &&
-                          !S.isAltShuffle();
-                 }))) &&
-        !all_of(Gathers, [&](Value *V) { return R.getTreeEntry(V); }) &&
-        !isSplat(Gathers)) {
-      InstructionCost BaseCost = R.getGatherCost(Gathers, !Root, ScalarTy);
-      SetVector<Value *> VectorizedLoads;
-      SmallVector<std::pair<unsigned, LoadsState>> VectorizedStarts;
-      SmallVector<unsigned> ScatterVectorized;
-      unsigned StartIdx = 0;
-      unsigned VF = VL.size() / 2;
-      for (; VF >= MinVF; VF /= 2) {
-        for (unsigned Cnt = StartIdx, End = VL.size(); Cnt + VF <= End;
-             Cnt += VF) {
-          ArrayRef<Value *> Slice = VL.slice(Cnt, VF);
-          if (S.getOpcode() != Instruction::Load || S.isAltShuffle()) {
-            InstructionsState SliceS = getSameOpcode(Slice, *R.TLI);
-            if (SliceS.getOpcode() != Instruction::Load ||
-                SliceS.isAltShuffle())
-              continue;
-          }
-          if (!VectorizedLoads.count(Slice.front()) &&
-              !VectorizedLoads.count(Slice.back()) && allSameBlock(Slice)) {
-            SmallVector<Value *> PointerOps;
-            OrdersType CurrentOrder;
-            LoadsState LS = R.canVectorizeLoads(Slice, Slice.front(),
-                                                CurrentOrder, PointerOps);
-            switch (LS) {
-            case LoadsState::Vectorize:
-            case LoadsState::ScatterVectorize:
-            case LoadsState::StridedVectorize:
-              // Mark the vectorized loads so that we don't vectorize them
-              // again.
-              // TODO: better handling of loads with reorders.
-              if (((LS == LoadsState::Vectorize ||
-                    LS == LoadsState::StridedVectorize) &&
-                   CurrentOrder.empty()) ||
-                  (LS == LoadsState::StridedVectorize &&
-                   isReverseOrder(CurrentOrder)))
-                VectorizedStarts.emplace_back(Cnt, LS);
-              else
-                ScatterVectorized.push_back(Cnt);
-              VectorizedLoads.insert(Slice.begin(), Slice.end());
-              // If we vectorized initial block, no need to try to vectorize
-              // it again.
-              if (Cnt == StartIdx)
-                StartIdx += VF;
-              break;
-            case LoadsState::Gather:
-              break;
-            }
-          }
-        }
-        // Check if the whole array was vectorized already - exit.
-        if (StartIdx >= VL.size())
-          break;
-        // Found vectorizable parts - exit.
-        if (!VectorizedLoads.empty())
-          break;
-      }
-      if (!VectorizedLoads.empty()) {
-        unsigned NumParts = TTI.getNumberOfParts(VecTy);
-        bool NeedInsertSubvectorAnalysis =
-            !NumParts || (VL.size() / VF) > NumParts;
-        // Get the cost for gathered loads.
-        for (unsigned I = 0, End = VL.size(); I < End; I += VF) {
-          if (VectorizedLoads.contains(VL[I]))
-            continue;
-          GatherCost +=
-              getBuildVectorCost(VL.slice(I, std::min(End - I, VF)), Root);
-        }
-        // Exclude potentially vectorized loads from list of gathered
-        // scalars.
-        Gathers.assign(Gathers.size(), PoisonValue::get(VL.front()->getType()));
-        // The cost for vectorized loads.
-        InstructionCost ScalarsCost = 0;
-        for (Value *V : VectorizedLoads) {
-          auto *LI = cast<LoadInst>(V);
-          ScalarsCost +=
-              TTI.getMemoryOpCost(Instruction::Load, LI->getType(),
-                                  LI->getAlign(), LI->getPointerAddressSpace(),
-                                  CostKind, TTI::OperandValueInfo(), LI);
-        }
-        auto *LoadTy = getWidenedType(VL.front()->getType(), VF);
-        for (const std::pair<unsigned, LoadsState> &P : VectorizedStarts) {
-          auto *LI = cast<LoadInst>(VL[P.first]);
-          Align Alignment = LI->getAlign();
-          GatherCost +=
-              P.second == LoadsState::Vectorize
-                  ? TTI.getMemoryOpCost(Instruction::Load, LoadTy, Alignment,
-                                        LI->getPointerAddressSpace(), CostKind,
-                                        TTI::OperandValueInfo(), LI)
-                  : TTI.getStridedMemoryOpCost(
-                        Instruction::Load, LoadTy, LI->getPointerOperand(),
-                        /*VariableMask=*/false, Alignment, CostKind, LI);
-          // Add external uses costs.
-          for (auto [Idx, V] : enumerate(VL.slice(
-                   P.first, std::min<unsigned>(VL.size() - P.first, VF))))
-            if (!R.areAllUsersVectorized(cast<Instruction>(V)))
-              GatherCost += TTI.getVectorInstrCost(Instruction::ExtractElement,
-                                                   LoadTy, CostKind, Idx);
-          // Estimate GEP cost.
-          SmallVector<Value *> PointerOps(VF);
-          for (auto [I, V] : enumerate(VL.slice(P.first, VF)))
-            PointerOps[I] = cast<LoadInst>(V)->getPointerOperand();
-          auto [ScalarGEPCost, VectorGEPCost] =
-              getGEPCosts(TTI, PointerOps, LI->getPointerOperand(),
-                          Instruction::Load, CostKind, LI->getType(), LoadTy);
-          GatherCost += VectorGEPCost - ScalarGEPCost;
-        }
-        for (unsigned P : ScatterVectorized) {
-          auto *LI0 = cast<LoadInst>(VL[P]);
-          ArrayRef<Value *> Slice = VL.slice(P, VF);
-          Align CommonAlignment = computeCommonAlignment<LoadInst>(Slice);
-          GatherCost += TTI.getGatherScatterOpCost(
-              Instruction::Load, LoadTy, LI0->getPointerOperand(),
-              /*VariableMask=*/false, CommonAlignment, CostKind, LI0);
-          // Estimate GEP cost.
-          SmallVector<Value *> PointerOps(VF);
-          for (auto [I, V] : enumerate(Slice))
-            PointerOps[I] = cast<LoadInst>(V)->getPointerOperand();
-          OrdersType Order;
-          if (sortPtrAccesses(PointerOps, LI0->getType(), *R.DL, *R.SE,
-                              Order)) {
-            // TODO: improve checks if GEPs can be vectorized.
-            Value *Ptr0 = PointerOps.front();
-            Type *ScalarTy = Ptr0->getType();
-            auto *VecTy = getWidenedType(ScalarTy, VF);
-            auto [ScalarGEPCost, VectorGEPCost] =
-                getGEPCosts(TTI, PointerOps, Ptr0, Instruction::GetElementPtr,
-                            CostKind, ScalarTy, VecTy);
-            GatherCost += VectorGEPCost - ScalarGEPCost;
-            if (!Order.empty()) {
-              SmallVector<int> Mask;
-              inversePermutation(Order, Mask);
-              GatherCost += ::getShuffleCost(TTI, TTI::SK_PermuteSingleSrc,
-                                             VecTy, Mask, CostKind);
-            }
-          } else {
-            GatherCost += R.getGatherCost(PointerOps, /*ForPoisonSrc=*/true,
-                                          PointerOps.front()->getType());
-          }
-        }
-        if (NeedInsertSubvectorAnalysis) {
-          // Add the cost for the subvectors insert.
-          SmallVector<int> ShuffleMask(VL.size());
-          for (unsigned I = VF, E = VL.size(); I < E; I += VF) {
-            for (unsigned Idx : seq<unsigned>(0, E))
-              ShuffleMask[Idx] = Idx / VF == I ? E + Idx % VF : Idx;
-            GatherCost += ::getShuffleCost(TTI, TTI::SK_InsertSubvector, VecTy,
-                                           ShuffleMask, CostKind, I, LoadTy);
-          }
-        }
-        GatherCost -= ScalarsCost;
-      }
-      GatherCost = std::min(BaseCost, GatherCost);
-    } else if (!Root && isSplat(VL)) {
+    if (!Root && isSplat(VL)) {
       // Found the broadcasting of the single scalar, calculate the cost as
       // the broadcast.
       const auto *It = find_if_not(VL, IsaPred<UndefValue>);
@@ -9401,7 +9280,9 @@ class BoUpSLP::ShuffleCostEstimator : public BaseShuffleAnalysis {
   InstructionCost createFreeze(InstructionCost Cost) { return Cost; }
   /// Finalize emission of the shuffles.
   InstructionCost
-  finalize(ArrayRef<int> ExtMask, unsigned VF = 0,
+  finalize(ArrayRef<int> ExtMask,
+           ArrayRef<std::pair<const TreeEntry *, unsigned>> SubVectors,
+           unsigned VF = 0,
            function_ref<void(Value *&, SmallVectorImpl<int> &)> Action = {}) {
     IsFinalized = true;
     if (Action) {
@@ -9419,6 +9300,29 @@ class BoUpSLP::ShuffleCostEstimator : public BaseShuffleAnalysis {
       Action(V, CommonMask);
       InVectors.front() = V;
     }
+    if (!SubVectors.empty()) {
+      const PointerUnion<Value *, const TreeEntry *> &Vec = InVectors.front();
+      if (InVectors.size() == 2)
+        Cost += createShuffle(Vec, InVectors.back(), CommonMask);
+      else
+        Cost += createShuffle(Vec, nullptr, CommonMask);
+      for (unsigned Idx = 0, Sz = CommonMask.size(); Idx < Sz; ++Idx)
+        if (CommonMask[Idx] != PoisonMaskElem)
+          CommonMask[Idx] = Idx;
+      for (const auto [E, Idx] : SubVectors) {
+        Cost += ::getShuffleCost(
+            TTI, TTI::SK_InsertSubvector,
+            FixedVectorType::get(ScalarTy, CommonMask.size()), std::nullopt,
+            CostKind, Idx,
+            FixedVectorType::get(ScalarTy, E->getVectorFactor()));
+        if (!CommonMask.empty()) {
+          std::iota(std::next(CommonMask.begin(), Idx),
+                    std::next(CommonMask.begin(), Idx + E->getVectorFactor()),
+                    Idx);
+        }
+      }
+    }
+
     ::addMask(CommonMask, ExtMask, /*ExtendingManyInputs=*/true);
     if (CommonMask.empty()) {
       assert(InVectors.size() == 1 && "Expected only one vector with no mask");
@@ -10942,8 +10846,31 @@ InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals) {
       if (CanBeUsedAsScalar) {
         InstructionCost ScalarCost = TTI->getInstructionCost(Inst, CostKind);
         bool KeepScalar = ScalarCost <= ExtraCost;
-        if (KeepScalar && ScalarCost != TTI::TCC_Free &&
-            ExtraCost - ScalarCost <= TTI::TCC_Basic) {
+        // Try to keep original scalar if the user is the phi node from the same
+        // block as the root phis, currently vectorized. It allows to keep
+        // better ordering info of PHIs, being vectorized currently.
+        bool IsProfitablePHIUser =
+            (KeepScalar || (ScalarCost - ExtraCost <= TTI::TCC_Basic &&
+                            VectorizableTree.front()->Scalars.size() > 2)) &&
+            VectorizableTree.front()->getOpcode() == Instruction::PHI &&
+            !Inst->hasNUsesOrMore(UsesLimit) &&
+            none_of(Inst->users(),
+                    [&](User *U) {
+                      auto *PHIUser = dyn_cast<PHINode>(U);
+                      return (!PHIUser ||
+                              PHIUser->getParent() !=
+                                  cast<Instruction>(
+                                      VectorizableTree.front()->getMainOp())
+                                      ->getParent()) &&
+                             !getTreeEntry(U);
+                    }) &&
+            count_if(Entry->Scalars, [&](Value *V) {
+              return ValueToExtUses->contains(V);
+            }) <= 2;
+        if (IsProfitablePHIUser) {
+          KeepScalar = true;
+        } else if (KeepScalar && ScalarCost != TTI::TCC_Free &&
+                   ExtraCost - ScalarCost <= TTI::TCC_Basic) {
           unsigned ScalarUsesCount = count_if(Entry->Scalars, [&](Value *V) {
             return ValueToExtUses->contains(V);
           });
@@ -12490,7 +12417,9 @@ class BoUpSLP::ShuffleInstructionBuilder final : public BaseShuffleAnalysis {
   /// \param Action the action (if any) to be performed before final applying of
   /// the \p ExtMask mask.
   Value *
-  finalize(ArrayRef<int> ExtMask, unsigned VF = 0,
+  finalize(ArrayRef<int> ExtMask,
+           ArrayRef<std::pair<const TreeEntry *, unsigned>> SubVectors,
+           unsigned VF = 0,
            function_ref<void(Value *&, SmallVectorImpl<int> &)> Action = {}) {
     IsFinalized = true;
     SmallVector<int> NewExtMask(ExtMask);
@@ -12524,6 +12453,29 @@ class BoUpSLP::ShuffleInstructionBuilder final : public BaseShuffleAnalysis {
       Action(Vec, CommonMask);
       InVectors.front() = Vec;
     }
+    if (!SubVectors.empty()) {
+      Value *Vec = InVectors.front();
+      if (InVectors.size() == 2) {
+        Vec = createShuffle(Vec, InVectors.back(), CommonMask);
+        InVectors.pop_back();
+      } else {
+        Vec = createShuffle(Vec, nullptr, CommonMask);
+      }
+      for (unsigned Idx = 0, Sz = CommonMask.size(); Idx < Sz; ++Idx)
+        if (CommonMask[Idx] != PoisonMaskElem)
+          CommonMask[Idx] = Idx;
+      for (const auto [E, Idx] : SubVectors) {
+        Vec = Builder.CreateInsertVector(
+            Vec->getType(), Vec, E->VectorizedValue, Builder.getInt64(Idx));
+        if (!CommonMask.empty()) {
+          std::iota(std::next(CommonMask.begin(), Idx),
+                    std::next(CommonMask.begin(), Idx + E->getVectorFactor()),
+                    Idx);
+        }
+      }
+      InVectors.front() = Vec;
+    }
+
     if (!ExtMask.empty()) {
       if (CommonMask.empty()) {
         CommonMask.assign(ExtMask.begin(), ExtMask.end());
@@ -12602,7 +12554,10 @@ Value *BoUpSLP::vectorizeOperand(TreeEntry *E, unsigned NodeIdx,
                              : ScalarTy,
             Builder, *this);
         ShuffleBuilder.add(V, Mask);
-        return ShuffleBuilder.finalize(std::nullopt);
+        SmallVector<std::pair<const TreeEntry *, unsigned>> SubVectors;
+        for (const auto [EIdx, Idx] : E->CombinedEntriesWithIndices)
+          SubVectors.emplace_back(VectorizableTree[EIdx].get(), Idx);
+        return ShuffleBuilder.finalize(std::nullopt, SubVectors);
       };
       Value *V = vectorizeTree(VE, PostponedPHIs);
       if (VF * getNumElements(VL[0]->getType()) !=
@@ -12685,6 +12640,14 @@ ResTy BoUpSLP::processBuildVector(const TreeEntry *E, Type *ScalarTy,
   SmallVector<int> ReuseShuffleIndices(E->ReuseShuffleIndices.begin(),
                                        E->ReuseShuffleIndices.end());
   SmallVector<Value *> GatheredScalars(E->Scalars.begin(), E->Scalars.end());
+  // Clear values, to be replaced by insertvector instructions.
+  for (const auto [EIdx, Idx] : E->CombinedEntriesWithIndices)
+    for_each(MutableArrayRef(GatheredScalars)
+                 .slice(Idx, VectorizableTree[EIdx]->getVectorFactor()),
+             [&](Value *&V) { V = PoisonValue::get(V->getType()); });
+  SmallVector<std::pair<const TreeEntry *, unsigned>> SubVectors;
+  for (const auto [EIdx, Idx] : E->CombinedEntriesWithIndices)
+    SubVectors.emplace_back(VectorizableTree[EIdx].get(), Idx);
   // Build a mask out of the reorder indices and reorder scalars per this
   // mask.
   SmallVector<int> ReorderMask;
@@ -12822,7 +12785,7 @@ ResTy BoUpSLP::processBuildVector(const TreeEntry *E, Type *ScalarTy,
           }
         }
         ShuffleBuilder.add(*FrontTE, Mask);
-        Res = ShuffleBuilder.finalize(E->getCommonMask());
+        Res = ShuffleBuilder.finalize(E->getCommonMask(), SubVectors);
         return Res;
       }
       if (!Resized) {
@@ -13079,10 +13042,10 @@ ResTy BoUpSLP::processBuildVector(const TreeEntry *E, Type *ScalarTy,
                  (IsSingleShuffle && ((IsIdentityShuffle &&
                   IsNonPoisoned) || IsUsedInExpr) && isa<UndefValue>(V));
         }))
-      Res = ShuffleBuilder.finalize(E->ReuseShuffleIndices);
+      Res = ShuffleBuilder.finalize(E->ReuseShuffleIndices, SubVectors);
     else
       Res = ShuffleBuilder.finalize(
-          E->ReuseShuffleIndices, E->Scalars.size(),
+          E->ReuseShuffleIndices, SubVectors, E->Scalars.size(),
           [&](Value *&Vec, SmallVectorImpl<int> &Mask) {
             TryPackScalars(NonConstant...
[truncated]

@alexey-bataev alexey-bataev requested a review from RKSimon August 15, 2024 19:37
@alexey-bataev
Copy link
Member Author

Ping!

1 similar comment
@alexey-bataev
Copy link
Member Author

Ping!

Created using spr 1.3.5
Copy link
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alexey-bataev alexey-bataev merged commit 69332bb into main Aug 22, 2024
8 checks passed
@alexey-bataev alexey-bataev deleted the users/alexey-bataev/spr/slpimprovefix-subvectors-in-gatherbuildvector-nodes-handling branch August 22, 2024 15:24
@petrhosek
Copy link
Member

This change appears to have broken the UBSan runtime build with the following error:

FAILED: compiler-rt/lib/ubsan_minimal/CMakeFiles/RTUbsan_minimal.aarch64.dir/ubsan_minimal_handlers.cpp.o 
/Volumes/Work/s/w/ir/x/w/llvm_build/./bin/clang++ --target=aarch64-unknown-linux-gnu --sysroot=/Volumes/Work/s/w/ir/x/w/cipd/linux -D_DEBUG -D_GLIBCXX_ASSERTIONS -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/Volumes/Work/s/w/ir/x/w/llvm-llvm-project/compiler-rt/lib/ubsan_minimal/.. --target=aarch64-unknown-linux-gnu -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -ffile-prefix-map=/Volumes/Work/s/w/ir/x/w/llvm_build/runtimes/runtimes-aarch64-unknown-linux-gnu-bins=../../../llvm-llvm-project -ffile-prefix-map=/Volumes/Work/s/w/ir/x/w/llvm-llvm-project/= -no-canonical-prefixes -Wall -Wno-unused-parameter -O2 -g -DNDEBUG -std=c++17 -fPIC -fno-builtin -fno-exceptions -fomit-frame-pointer -funwind-tables -fno-stack-protector -fno-sanitize=safe-stack -fvisibility=hidden -fno-lto -Wthread-safety -Wthread-safety-reference -Wthread-safety-beta -O3 -gline-tables-only -Wno-gnu -Wno-variadic-macros -Wno-c99-extensions -ftrivial-auto-var-init=pattern -nostdinc++ -DSANITIZER_COMMON_NO_REDEFINE_BUILTINS -fno-rtti -MD -MT compiler-rt/lib/ubsan_minimal/CMakeFiles/RTUbsan_minimal.aarch64.dir/ubsan_minimal_handlers.cpp.o -MF compiler-rt/lib/ubsan_minimal/CMakeFiles/RTUbsan_minimal.aarch64.dir/ubsan_minimal_handlers.cpp.o.d -o compiler-rt/lib/ubsan_minimal/CMakeFiles/RTUbsan_minimal.aarch64.dir/ubsan_minimal_handlers.cpp.o -c /Volumes/Work/s/w/ir/x/w/llvm-llvm-project/compiler-rt/lib/ubsan_minimal/ubsan_minimal_handlers.cpp
Assertion failed: (isValidOperands(V1, V2, Mask) && "Invalid shuffle vector instruction operands!"), function ShuffleVectorInst, file Instructions.cpp, line 1700.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.	Program arguments: /Volumes/Work/s/w/ir/x/w/llvm_build/./bin/clang++ --target=aarch64-unknown-linux-gnu --sysroot=/Volumes/Work/s/w/ir/x/w/cipd/linux -D_DEBUG -D_GLIBCXX_ASSERTIONS -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/Volumes/Work/s/w/ir/x/w/llvm-llvm-project/compiler-rt/lib/ubsan_minimal/.. --target=aarch64-unknown-linux-gnu -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -ffile-prefix-map=/Volumes/Work/s/w/ir/x/w/llvm_build/runtimes/runtimes-aarch64-unknown-linux-gnu-bins=../../../llvm-llvm-project -ffile-prefix-map=/Volumes/Work/s/w/ir/x/w/llvm-llvm-project/= -no-canonical-prefixes -Wall -Wno-unused-parameter -O2 -g -DNDEBUG -std=c++17 -fPIC -fno-builtin -fno-exceptions -fomit-frame-pointer -funwind-tables -fno-stack-protector -fno-sanitize=safe-stack -fvisibility=hidden -fno-lto -Wthread-safety -Wthread-safety-reference -Wthread-safety-beta -O3 -gline-tables-only -Wno-gnu -Wno-variadic-macros -Wno-c99-extensions -ftrivial-auto-var-init=pattern -nostdinc++ -DSANITIZER_COMMON_NO_REDEFINE_BUILTINS -fno-rtti -MD -MT compiler-rt/lib/ubsan_minimal/CMakeFiles/RTUbsan_minimal.aarch64.dir/ubsan_minimal_handlers.cpp.o -MF compiler-rt/lib/ubsan_minimal/CMakeFiles/RTUbsan_minimal.aarch64.dir/ubsan_minimal_handlers.cpp.o.d -o compiler-rt/lib/ubsan_minimal/CMakeFiles/RTUbsan_minimal.aarch64.dir/ubsan_minimal_handlers.cpp.o -c /Volumes/Work/s/w/ir/x/w/llvm-llvm-project/compiler-rt/lib/ubsan_minimal/ubsan_minimal_handlers.cpp
1.	<eof> parser at end of file
2.	Optimizer
3.	Running pass "function<eager-inv>(float2int,lower-constant-intrinsics,chr,loop(loop-rotate<header-duplication;no-prepare-for-lto>,loop-deletion),loop-distribute,inject-tli-mappings,loop-vectorize<no-interleave-forced-only;no-vectorize-forced-only;>,infer-alignment,loop-load-elim,instcombine<max-iterations=1;no-use-loop-info;no-verify-fixpoint>,simplifycfg<bonus-inst-threshold=1;forward-switch-cond;switch-range-to-icmp;switch-to-lookup;no-keep-loops;hoist-common-insts;sink-common-insts;speculate-blocks;simplify-cond-branch;no-speculate-unpredictables>,slp-vectorizer,vector-combine,instcombine<max-iterations=1;no-use-loop-info;no-verify-fixpoint>,loop-unroll<O3>,transform-warning,sroa<preserve-cfg>,infer-alignment,instcombine<max-iterations=1;no-use-loop-info;no-verify-fixpoint>,loop-mssa(licm<allowspeculation>),alignment-from-assumptions,loop-sink,instsimplify,div-rem-pairs,tailcallelim,simplifycfg<bonus-inst-threshold=1;no-forward-switch-cond;switch-range-to-icmp;no-switch-to-lookup;keep-loops;no-hoist-common-insts;no-sink-common-insts;speculate-blocks;simplify-cond-branch;speculate-unpredictables>)" on module "/Volumes/Work/s/w/ir/x/w/llvm-llvm-project/compiler-rt/lib/ubsan_minimal/ubsan_minimal_handlers.cpp"
4.	Running pass "instcombine<max-iterations=1;no-use-loop-info;no-verify-fixpoint>" on function "_ZL12decorate_msgPcm"
 #0 0x000000010b9daf68 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x106234f68)
 #1 0x000000010b9d8c79 llvm::sys::RunSignalHandlers() (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x106232c79)
 #2 0x000000010b9da4ce llvm::sys::CleanupOnSignal(unsigned long) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x1062344ce)
 #3 0x000000010b92ac5e CrashRecoverySignalHandler(int) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x106184c5e)
 #4 0x00007ff80f6d95ed (/usr/lib/system/libsystem_platform.dylib+0x7ff8004265ed)
 #5 0x000060000355bbe0 
 #6 0x00007ff80f5d2b45 (/usr/lib/system/libsystem_c.dylib+0x7ff80031fb45)
 #7 0x00007ff80f5d1e5e (/usr/lib/system/libsystem_c.dylib+0x7ff80031ee5e)
 #8 0x000000010b666d9a llvm::ShuffleVectorInst::ShuffleVectorInst(llvm::Value*, llvm::Value*, llvm::ArrayRef<int>, llvm::Twine const&, llvm::InsertPosition) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x105ec0d9a)
 #9 0x00000001063bf0e7 llvm::IRBuilderBase::CreateShuffleVector(llvm::Value*, llvm::Value*, llvm::ArrayRef<int>, llvm::Twine const&) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x100c190e7)
#10 0x000000010a96ff2b llvm::InstCombinerImpl::visitCallInst(llvm::CallInst&) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x1051c9f2b)
#11 0x000000010a9253fe llvm::InstCombinerImpl::run() (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x10517f3fe)
#12 0x000000010a92844f combineInstructionsOverFunction(llvm::Function&, llvm::InstructionWorklist&, llvm::AAResults*, llvm::AssumptionCache&, llvm::TargetLibraryInfo&, llvm::TargetTransformInfo&, llvm::DominatorTree&, llvm::OptimizationRemarkEmitter&, llvm::BlockFrequencyInfo*, llvm::BranchProbabilityInfo*, llvm::ProfileSummaryInfo*, llvm::LoopInfo*, llvm::InstCombineOptions const&) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x10518244f)
#13 0x000000010a927c08 llvm::InstCombinePass::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x105181c08)
#14 0x00000001065d7e02 llvm::detail::PassModel<llvm::Function, llvm::InstCombinePass, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x100e31e02)
#15 0x000000010b6cb305 llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x105f25305)
#16 0x00000001065d9be2 llvm::detail::PassModel<llvm::Function, llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>, llvm::AnalysisManager<llvm::Function>>::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x100e33be2)
#17 0x000000010b6ce001 llvm::ModuleToFunctionPassAdaptor::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x105f28001)
#18 0x00000001065d3692 llvm::detail::PassModel<llvm::Module, llvm::ModuleToFunctionPassAdaptor, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x100e2d692)
#19 0x000000010b6ca505 llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module>>::run(llvm::Module&, llvm::AnalysisManager<llvm::Module>&) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x105f24505)
#20 0x00000001065d180a (anonymous namespace)::EmitAssemblyHelper::RunOptimizationPipeline(clang::BackendAction, std::__2::unique_ptr<llvm::raw_pwrite_stream, std::__2::default_delete<llvm::raw_pwrite_stream>>&, std::__2::unique_ptr<llvm::ToolOutputFile, std::__2::default_delete<llvm::ToolOutputFile>>&, clang::BackendConsumer*) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x100e2b80a)
#21 0x00000001065c8d83 clang::EmitBackendOutput(clang::DiagnosticsEngine&, clang::HeaderSearchOptions const&, clang::CodeGenOptions const&, clang::TargetOptions const&, clang::LangOptions const&, llvm::StringRef, llvm::Module*, clang::BackendAction, llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem>, std::__2::unique_ptr<llvm::raw_pwrite_stream, std::__2::default_delete<llvm::raw_pwrite_stream>>, clang::BackendConsumer*) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x100e22d83)
#22 0x00000001065bf078 clang::BackendConsumer::HandleTranslationUnit(clang::ASTContext&) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x100e19078)
#23 0x0000000107cf73e9 clang::ParseAST(clang::Sema&, bool, bool) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x1025513e9)
#24 0x00000001068e04c3 clang::FrontendAction::Execute() (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x10113a4c3)
#25 0x000000010684070d clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x10109a70d)
#26 0x00000001069cae18 clang::ExecuteCompilerInvocation(clang::CompilerInstance*) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x101224e18)
#27 0x000000010585d749 cc1_main(llvm::ArrayRef<char const*>, char const*, void*) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x1000b7749)
#28 0x0000000105859f5b ExecuteCC1Tool(llvm::SmallVectorImpl<char const*>&, llvm::ToolContext const&) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x1000b3f5b)
#29 0x0000000106689f0e void llvm::function_ref<void ()>::callback_fn<clang::driver::CC1Command::Execute(llvm::ArrayRef<std::__2::optional<llvm::StringRef>>, std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>>*, bool*) const::$_0>(long) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x100ee3f0e)
#30 0x000000010b92a97e llvm::CrashRecoveryContext::RunSafely(llvm::function_ref<void ()>) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x10618497e)
#31 0x00000001066897ba clang::driver::CC1Command::Execute(llvm::ArrayRef<std::__2::optional<llvm::StringRef>>, std::__2::basic_string<char, std::__2::char_traits<char>, std::__2::allocator<char>>*, bool*) const (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x100ee37ba)
#32 0x00000001066463ca clang::driver::Compilation::ExecuteCommand(clang::driver::Command const&, clang::driver::Command const*&, bool) const (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x100ea03ca)
#33 0x00000001066466ef clang::driver::Compilation::ExecuteJobs(clang::driver::JobList const&, llvm::SmallVectorImpl<std::__2::pair<int, clang::driver::Command const*>>&, bool) const (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x100ea06ef)
#34 0x0000000106667f20 clang::driver::Driver::ExecuteCompilation(clang::driver::Compilation&, llvm::SmallVectorImpl<std::__2::pair<int, clang::driver::Command const*>>&) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x100ec1f20)
#35 0x00000001058593e3 clang_main(int, char**, llvm::ToolContext const&) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x1000b33e3)
#36 0x0000000105bc2bc6 findTool(int, char**, char const*) (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x10041cbc6)
#37 0x0000000105bc2280 main (/Volumes/Work/s/w/ir/x/w/llvm_build/bin/llvm+0x10041c280)
#38 0x00007ff80f35141f 
clang++: error: clang frontend command failed with exit code 134 (use -v to see invocation)

Would it be possible to take a look and revert the change if the issue cannot be addressed quickly?

@alexey-bataev
Copy link
Member Author

There is something wrong with the instcombiner, will check it tomorrow

vitalybuka added a commit that referenced this pull request Aug 23, 2024
…ing" (#105780)

with "[Vectorize] Fix warnings"

It introduced compiler crashes, see #104144.

This reverts commit 69332bb and
351f4a5.
@mstorsjo
Copy link
Member

I also ran into assert failures due to this, on both aarch64 and i686 in the wild, and a reduced snippet also triggers on x86_64.

Reproducible with the following reduced snippet:

union a {
  short b
};
int c, d, e;
int *f;
unsigned h(int i, int j) {
  if (i & ~j)
    return i & 1;
  return i;
}
void k(short *i) {
  int l, m, n;
  for (;;) {
    int o;
    e += f[d];
    n += o = c;
    m = h(e, 6);
    short g = m << 8;
    ((union a *)&i[0])->b = g;
    l = h(o, 6);
    g = l << 8;
    ((union a *)&i[1])->b = g;
    ((union a *)&i[2])->b = g = n << 8;
    ((union a *)&i[3])->b = g;
  }
}

Compiled with:

$ clang -target aarch64-linux-gnu -c -O2 repro.c 

Also reproduces with i686-linux-gnu and x86_64-linux-gnu.

@mstorsjo
Copy link
Member

Second reproducer:

int a(char *b, int c) {
  int d, e, f = d = 0;
  for (; d < 3; d++) {
    e = 0;
    for (; e < 8; e++)
      f += -b[e] - b[e + c] >> 31;
    b += c;
  }
  return f;
}

This reproduces with i686 and x86_64, but not on aarch64.

@RKSimon
Copy link
Collaborator

RKSimon commented Aug 23, 2024

This reproduces with i686 and x86_64, but not on aarch64.

In this one, SLP is producing the illegal instruction:

%51 = call <8 x i16> @llvm.vector.insert.v8i16.v4i32(<8 x i16> %50, <4 x i32> %41, i64 4)

https://simd.godbolt.org/z/1hY5e9xTY

alexey-bataev added a commit that referenced this pull request Aug 23, 2024
SLP vectorizer has an estimation for gather/buildvector nodes, which
contain some scalar loads. SLP vectorizer performs pretty similar (but
large in SLOCs) estimation, which not always correct. Instead, this
patch implements clustering analysis and actual node allocation with the
full analysis for the vectorized clustered scalars (not only loads, but
also some other instructions) with the correct cost estimation and
vector insert instructions. Improves overall vectorization quality and
simplifies analysis/estimations.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: #104144
cjdb pushed a commit to cjdb/llvm-project that referenced this pull request Aug 23, 2024
SLP vectorizer has an estimation for gather/buildvector nodes, which
contain some scalar loads. SLP vectorizer performs pretty similar (but
large in SLOCs) estimation, which not always correct. Instead, this
patch implements clustering analysis and actual node allocation with the
full analysis for the vectorized clustered scalars (not only loads, but
also some other instructions) with the correct cost estimation and
vector insert instructions. Improves overall vectorization quality and
simplifies analysis/estimations.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: llvm#104144
cjdb pushed a commit to cjdb/llvm-project that referenced this pull request Aug 23, 2024
…ing" (llvm#105780)

with "[Vectorize] Fix warnings"

It introduced compiler crashes, see llvm#104144.

This reverts commit 69332bb and
351f4a5.
cjdb pushed a commit to cjdb/llvm-project that referenced this pull request Aug 23, 2024
SLP vectorizer has an estimation for gather/buildvector nodes, which
contain some scalar loads. SLP vectorizer performs pretty similar (but
large in SLOCs) estimation, which not always correct. Instead, this
patch implements clustering analysis and actual node allocation with the
full analysis for the vectorized clustered scalars (not only loads, but
also some other instructions) with the correct cost estimation and
vector insert instructions. Improves overall vectorization quality and
simplifies analysis/estimations.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: llvm#104144
5chmidti pushed a commit that referenced this pull request Aug 24, 2024
SLP vectorizer has an estimation for gather/buildvector nodes, which
contain some scalar loads. SLP vectorizer performs pretty similar (but
large in SLOCs) estimation, which not always correct. Instead, this
patch implements clustering analysis and actual node allocation with the
full analysis for the vectorized clustered scalars (not only loads, but
also some other instructions) with the correct cost estimation and
vector insert instructions. Improves overall vectorization quality and
simplifies analysis/estimations.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: #104144
@@ -13122,6 +13069,8 @@ ResTy BoUpSLP::processBuildVector(const TreeEntry *E, Type *ScalarTy,
}

Value *BoUpSLP::createBuildVector(const TreeEntry *E, Type *ScalarTy) {
for (const auto [EIdx, _] : E->CombinedEntriesWithIndices)
(void)vectorizeTree(VectorizableTree[EIdx].get(), /*PostponedPHIs=*/false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed issue #105904. This patch breaks the precondition of the Gather node, which must have only a single user in vectorizeOperand

@dklimkin
Copy link
Member

We are hitting an assert error with the last reland:

assert.h assertion failed at llvm/lib/IR/Instructions.cpp:1854 in bool isSingleSourceMaskImpl(ArrayRef, int): I >= 0 && I < (NumOpElts * 2) && "Out-of-bounds shuffle mask element" in https://github.com/openxla/xla/blob/main/xla/service/gpu/fusions/triton/triton_fusion_emitter_parametrized_test.cc#L65

I'll see if we can create a simpler reproducer but we may need to revert in the meanwhile.

@alexey-bataev
Copy link
Member Author

We are hitting an assert error with the last reland:

assert.h assertion failed at llvm/lib/IR/Instructions.cpp:1854 in bool isSingleSourceMaskImpl(ArrayRef, int): I >= 0 && I < (NumOpElts * 2) && "Out-of-bounds shuffle mask element" in https://github.com/openxla/xla/blob/main/xla/service/gpu/fusions/triton/triton_fusion_emitter_parametrized_test.cc#L65

I'll see if we can create a simpler reproducer but we may need to revert in the meanwhile.

Need a reproducer

@dklimkin
Copy link
Member

dklimkin commented Aug 29, 2024

Unfortunately it's in a very large generated block and we were not able to isolate it to an external repro yet.

One pointer is, dropping lines 12657-12661 from SLPVectorizer.cpp fixes our test. Could you take a look if the ranges here, and the values set look correct?

Update: corrected line number.

@alexey-bataev
Copy link
Member Author

alexey-bataev commented Aug 29, 2024

Unfortunately it's in a very large generated block and we were not able to isolate it to an external repro yet.

One pointer is, dropping lines 12657-12667 from SLPVectorizer.cpp fixes our test. Could you take a look if the ranges here, and the values set look correct?

Could you provide full stack trace? Also, large code is not a problem, attach it, I will reduce it myself

@dklimkin
Copy link
Member

Stack trace:

PC: @     0x7f902096334c  (unknown)  llvm::slpvectorizer::BoUpSLP::ShuffleInstructionBuilder::finalize()
    @     0x7f905174fe80  (unknown)  (unknown)
    @     0x7f902096334c        288  llvm::slpvectorizer::BoUpSLP::ShuffleInstructionBuilder::finalize()
    @     0x7f9020934fab        288  llvm::slpvectorizer::BoUpSLP::vectorizeOperand()::$_5::operator()()
    @     0x7f9020931478        240  llvm::slpvectorizer::BoUpSLP::vectorizeOperand()
    @     0x7f9020931a8c       2400  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f90209311ca        240  llvm::slpvectorizer::BoUpSLP::vectorizeOperand()
    @     0x7f9020931bd9       2400  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f9020939fb7        736  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f9020939e17        160  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f902094ce10       1248  llvm::SLPVectorizerPass::tryToVectorizeList()
    @     0x7f902095201d        304  tryToVectorizeSequence<>()
    @     0x7f902097a9ee        240  llvm::SLPVectorizerPass::vectorizeCmpInsts<>()
    @     0x7f9020948469        640  llvm::SLPVectorizerPass::vectorizeChainsInBlock()
    @     0x7f9020945a11       4848  llvm::SLPVectorizerPass::runImpl()
    @     0x7f902094551c        160  llvm::SLPVectorizerPass::run()
    @     0x7f902565b252         32  llvm::detail::PassModel<>::run()
    @     0x7f8febfa8f19        272  llvm::PassManager<>::run()
    @     0x7f904d8e1152         32  llvm::detail::PassModel<>::run()
    @     0x7f8febfad2a9        320  llvm::ModuleToFunctionPassAdaptor::run()
    @     0x7f904d8e0f12         32  llvm::detail::PassModel<>::run()
    @     0x7f8febfa7e69        272  llvm::PassManager<>::run()
    @     0x7f903bbed862       2880  std::__u::__function::__policy_invoker<>::__call_impl<>()
    @     0x7f9077c1475e        208  xla::gpu::TranslateLLVMToLLVMIR()
    @     0x7f9077c16f2d        656  xla::gpu::CompileTritonToLLVM()
    @     0x7f9077c162cf        336  xla::gpu::TritonWrapper()
    @     0x7f9078036bbb        448  xla::gpu::TritonFusion::GenerateTritonKernelAndWrapper()
    @     0x7f907803852d        688  std::__u::__function::__policy_invoker<>::__call_impl<>()
    @     0x7f90186f3e47        384  xla::gpu::KernelReuseCache::GetWithStatus()
    @     0x7f90186f3c89        128  xla::gpu::KernelReuseCache::GetWithStatus()
    @     0x7f90780374a4        480  xla::gpu::TritonFusion::Emit()
    @     0x7f91aa629614        464  xla::gpu::IrEmitterUnnested::EmitFusion()
    @     0x7f91aa61829b         96  xla::gpu::IrEmitterUnnested::EmitHloComputation()
    @     0x7f91b9e378f0       1824  xla::gpu::CompileModuleToLlvmIr()
    @     0x7f91ba68b22d       2320  xla::gpu::GpuCompiler::CompileToBackendResult()
    @     0x7f91ba68d599       3280  xla::gpu::GpuCompiler::RunBackend()
    @     0x7f908e3bc71c        592  xla::gpu::AutotunerCompileUtil::Compile()
    @     0x7f908e7e6296        240  xla::gpu::GemmFusionAutotunerImpl::CompileAll()::$_6::operator()()
    @     0x7f908e7fa18d        288  std::__u::__function::__policy_invoker<>::__call_impl<>()
    @     0x7f8e64a87efa        112  Eigen::ThreadPoolTempl<>::WorkerLoop()
    @     0x7f8e64a879fa         48  absl::internal_any_invocable::RemoteInvoker<>()
    @     0x7f8c7d3b3c93        256  Thread::ThreadBody()
    @     0x7f90517467db        192  start_thread
    @     0x7f903b33d05f  (unknown)  clone

Looks like invalid mask with values 124 and 126 here:

Invalid LLVM IR before optimizations:
Invalid shufflevector operands!
  %55 = shufflevector <4 x i32> %35, <4 x i32> poison, <128 x i32> <i32 0, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 124, i32 1, i32 126, i32 1>

(The test in question is this: https://github.com/openxla/xla/blob/main/xla/service/gpu/fusions/triton/triton_fusion_emitter_parametrized_test.cc#L65 , if you could run the test, you'll probably be able to collect more useful information. This is significantly out of my normal problem area).

I got ~800 Mb of IR generated and I can't validate if I can share all of it.

@alexey-bataev
Copy link
Member Author

Stack trace:

PC: @     0x7f902096334c  (unknown)  llvm::slpvectorizer::BoUpSLP::ShuffleInstructionBuilder::finalize()
    @     0x7f905174fe80  (unknown)  (unknown)
    @     0x7f902096334c        288  llvm::slpvectorizer::BoUpSLP::ShuffleInstructionBuilder::finalize()
    @     0x7f9020934fab        288  llvm::slpvectorizer::BoUpSLP::vectorizeOperand()::$_5::operator()()
    @     0x7f9020931478        240  llvm::slpvectorizer::BoUpSLP::vectorizeOperand()
    @     0x7f9020931a8c       2400  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f90209311ca        240  llvm::slpvectorizer::BoUpSLP::vectorizeOperand()
    @     0x7f9020931bd9       2400  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f9020939fb7        736  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f9020939e17        160  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f902094ce10       1248  llvm::SLPVectorizerPass::tryToVectorizeList()
    @     0x7f902095201d        304  tryToVectorizeSequence<>()
    @     0x7f902097a9ee        240  llvm::SLPVectorizerPass::vectorizeCmpInsts<>()
    @     0x7f9020948469        640  llvm::SLPVectorizerPass::vectorizeChainsInBlock()
    @     0x7f9020945a11       4848  llvm::SLPVectorizerPass::runImpl()
    @     0x7f902094551c        160  llvm::SLPVectorizerPass::run()
    @     0x7f902565b252         32  llvm::detail::PassModel<>::run()
    @     0x7f8febfa8f19        272  llvm::PassManager<>::run()
    @     0x7f904d8e1152         32  llvm::detail::PassModel<>::run()
    @     0x7f8febfad2a9        320  llvm::ModuleToFunctionPassAdaptor::run()
    @     0x7f904d8e0f12         32  llvm::detail::PassModel<>::run()
    @     0x7f8febfa7e69        272  llvm::PassManager<>::run()
    @     0x7f903bbed862       2880  std::__u::__function::__policy_invoker<>::__call_impl<>()
    @     0x7f9077c1475e        208  xla::gpu::TranslateLLVMToLLVMIR()
    @     0x7f9077c16f2d        656  xla::gpu::CompileTritonToLLVM()
    @     0x7f9077c162cf        336  xla::gpu::TritonWrapper()
    @     0x7f9078036bbb        448  xla::gpu::TritonFusion::GenerateTritonKernelAndWrapper()
    @     0x7f907803852d        688  std::__u::__function::__policy_invoker<>::__call_impl<>()
    @     0x7f90186f3e47        384  xla::gpu::KernelReuseCache::GetWithStatus()
    @     0x7f90186f3c89        128  xla::gpu::KernelReuseCache::GetWithStatus()
    @     0x7f90780374a4        480  xla::gpu::TritonFusion::Emit()
    @     0x7f91aa629614        464  xla::gpu::IrEmitterUnnested::EmitFusion()
    @     0x7f91aa61829b         96  xla::gpu::IrEmitterUnnested::EmitHloComputation()
    @     0x7f91b9e378f0       1824  xla::gpu::CompileModuleToLlvmIr()
    @     0x7f91ba68b22d       2320  xla::gpu::GpuCompiler::CompileToBackendResult()
    @     0x7f91ba68d599       3280  xla::gpu::GpuCompiler::RunBackend()
    @     0x7f908e3bc71c        592  xla::gpu::AutotunerCompileUtil::Compile()
    @     0x7f908e7e6296        240  xla::gpu::GemmFusionAutotunerImpl::CompileAll()::$_6::operator()()
    @     0x7f908e7fa18d        288  std::__u::__function::__policy_invoker<>::__call_impl<>()
    @     0x7f8e64a87efa        112  Eigen::ThreadPoolTempl<>::WorkerLoop()
    @     0x7f8e64a879fa         48  absl::internal_any_invocable::RemoteInvoker<>()
    @     0x7f8c7d3b3c93        256  Thread::ThreadBody()
    @     0x7f90517467db        192  start_thread
    @     0x7f903b33d05f  (unknown)  clone

Looks like invalid mask with values 124 and 126 here:

Invalid LLVM IR before optimizations:
Invalid shufflevector operands!
  %55 = shufflevector <4 x i32> %35, <4 x i32> poison, <128 x i32> <i32 0, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 124, i32 1, i32 126, i32 1>

(The test in question is this: https://github.com/openxla/xla/blob/main/xla/service/gpu/fusions/triton/triton_fusion_emitter_parametrized_test.cc#L65 , if you could run the test, you'll probably be able to collect more useful information. This is significantly out of my normal problem area).

I got ~800 Mb of IR generated and I can't validate if I can share all of it.

Unfortunately, it does not help, need a reproducer

@rupprecht
Copy link
Collaborator

Stack trace:

PC: @     0x7f902096334c  (unknown)  llvm::slpvectorizer::BoUpSLP::ShuffleInstructionBuilder::finalize()
    @     0x7f905174fe80  (unknown)  (unknown)
    @     0x7f902096334c        288  llvm::slpvectorizer::BoUpSLP::ShuffleInstructionBuilder::finalize()
    @     0x7f9020934fab        288  llvm::slpvectorizer::BoUpSLP::vectorizeOperand()::$_5::operator()()
    @     0x7f9020931478        240  llvm::slpvectorizer::BoUpSLP::vectorizeOperand()
    @     0x7f9020931a8c       2400  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f90209311ca        240  llvm::slpvectorizer::BoUpSLP::vectorizeOperand()
    @     0x7f9020931bd9       2400  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f9020939fb7        736  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f9020939e17        160  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f902094ce10       1248  llvm::SLPVectorizerPass::tryToVectorizeList()
    @     0x7f902095201d        304  tryToVectorizeSequence<>()
    @     0x7f902097a9ee        240  llvm::SLPVectorizerPass::vectorizeCmpInsts<>()
    @     0x7f9020948469        640  llvm::SLPVectorizerPass::vectorizeChainsInBlock()
    @     0x7f9020945a11       4848  llvm::SLPVectorizerPass::runImpl()
    @     0x7f902094551c        160  llvm::SLPVectorizerPass::run()
    @     0x7f902565b252         32  llvm::detail::PassModel<>::run()
    @     0x7f8febfa8f19        272  llvm::PassManager<>::run()
    @     0x7f904d8e1152         32  llvm::detail::PassModel<>::run()
    @     0x7f8febfad2a9        320  llvm::ModuleToFunctionPassAdaptor::run()
    @     0x7f904d8e0f12         32  llvm::detail::PassModel<>::run()
    @     0x7f8febfa7e69        272  llvm::PassManager<>::run()
    @     0x7f903bbed862       2880  std::__u::__function::__policy_invoker<>::__call_impl<>()
    @     0x7f9077c1475e        208  xla::gpu::TranslateLLVMToLLVMIR()
    @     0x7f9077c16f2d        656  xla::gpu::CompileTritonToLLVM()
    @     0x7f9077c162cf        336  xla::gpu::TritonWrapper()
    @     0x7f9078036bbb        448  xla::gpu::TritonFusion::GenerateTritonKernelAndWrapper()
    @     0x7f907803852d        688  std::__u::__function::__policy_invoker<>::__call_impl<>()
    @     0x7f90186f3e47        384  xla::gpu::KernelReuseCache::GetWithStatus()
    @     0x7f90186f3c89        128  xla::gpu::KernelReuseCache::GetWithStatus()
    @     0x7f90780374a4        480  xla::gpu::TritonFusion::Emit()
    @     0x7f91aa629614        464  xla::gpu::IrEmitterUnnested::EmitFusion()
    @     0x7f91aa61829b         96  xla::gpu::IrEmitterUnnested::EmitHloComputation()
    @     0x7f91b9e378f0       1824  xla::gpu::CompileModuleToLlvmIr()
    @     0x7f91ba68b22d       2320  xla::gpu::GpuCompiler::CompileToBackendResult()
    @     0x7f91ba68d599       3280  xla::gpu::GpuCompiler::RunBackend()
    @     0x7f908e3bc71c        592  xla::gpu::AutotunerCompileUtil::Compile()
    @     0x7f908e7e6296        240  xla::gpu::GemmFusionAutotunerImpl::CompileAll()::$_6::operator()()
    @     0x7f908e7fa18d        288  std::__u::__function::__policy_invoker<>::__call_impl<>()
    @     0x7f8e64a87efa        112  Eigen::ThreadPoolTempl<>::WorkerLoop()
    @     0x7f8e64a879fa         48  absl::internal_any_invocable::RemoteInvoker<>()
    @     0x7f8c7d3b3c93        256  Thread::ThreadBody()
    @     0x7f90517467db        192  start_thread
    @     0x7f903b33d05f  (unknown)  clone

Looks like invalid mask with values 124 and 126 here:

Invalid LLVM IR before optimizations:
Invalid shufflevector operands!
  %55 = shufflevector <4 x i32> %35, <4 x i32> poison, <128 x i32> <i32 0, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 124, i32 1, i32 126, i32 1>

(The test in question is this: https://github.com/openxla/xla/blob/main/xla/service/gpu/fusions/triton/triton_fusion_emitter_parametrized_test.cc#L65 , if you could run the test, you'll probably be able to collect more useful information. This is significantly out of my normal problem area).
I got ~800 Mb of IR generated and I can't validate if I can share all of it.

Unfortunately, it does not help, need a reproducer

Reduced as #106655

@alexey-bataev
Copy link
Member Author

Stack trace:

PC: @     0x7f902096334c  (unknown)  llvm::slpvectorizer::BoUpSLP::ShuffleInstructionBuilder::finalize()
    @     0x7f905174fe80  (unknown)  (unknown)
    @     0x7f902096334c        288  llvm::slpvectorizer::BoUpSLP::ShuffleInstructionBuilder::finalize()
    @     0x7f9020934fab        288  llvm::slpvectorizer::BoUpSLP::vectorizeOperand()::$_5::operator()()
    @     0x7f9020931478        240  llvm::slpvectorizer::BoUpSLP::vectorizeOperand()
    @     0x7f9020931a8c       2400  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f90209311ca        240  llvm::slpvectorizer::BoUpSLP::vectorizeOperand()
    @     0x7f9020931bd9       2400  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f9020939fb7        736  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f9020939e17        160  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f902094ce10       1248  llvm::SLPVectorizerPass::tryToVectorizeList()
    @     0x7f902095201d        304  tryToVectorizeSequence<>()
    @     0x7f902097a9ee        240  llvm::SLPVectorizerPass::vectorizeCmpInsts<>()
    @     0x7f9020948469        640  llvm::SLPVectorizerPass::vectorizeChainsInBlock()
    @     0x7f9020945a11       4848  llvm::SLPVectorizerPass::runImpl()
    @     0x7f902094551c        160  llvm::SLPVectorizerPass::run()
    @     0x7f902565b252         32  llvm::detail::PassModel<>::run()
    @     0x7f8febfa8f19        272  llvm::PassManager<>::run()
    @     0x7f904d8e1152         32  llvm::detail::PassModel<>::run()
    @     0x7f8febfad2a9        320  llvm::ModuleToFunctionPassAdaptor::run()
    @     0x7f904d8e0f12         32  llvm::detail::PassModel<>::run()
    @     0x7f8febfa7e69        272  llvm::PassManager<>::run()
    @     0x7f903bbed862       2880  std::__u::__function::__policy_invoker<>::__call_impl<>()
    @     0x7f9077c1475e        208  xla::gpu::TranslateLLVMToLLVMIR()
    @     0x7f9077c16f2d        656  xla::gpu::CompileTritonToLLVM()
    @     0x7f9077c162cf        336  xla::gpu::TritonWrapper()
    @     0x7f9078036bbb        448  xla::gpu::TritonFusion::GenerateTritonKernelAndWrapper()
    @     0x7f907803852d        688  std::__u::__function::__policy_invoker<>::__call_impl<>()
    @     0x7f90186f3e47        384  xla::gpu::KernelReuseCache::GetWithStatus()
    @     0x7f90186f3c89        128  xla::gpu::KernelReuseCache::GetWithStatus()
    @     0x7f90780374a4        480  xla::gpu::TritonFusion::Emit()
    @     0x7f91aa629614        464  xla::gpu::IrEmitterUnnested::EmitFusion()
    @     0x7f91aa61829b         96  xla::gpu::IrEmitterUnnested::EmitHloComputation()
    @     0x7f91b9e378f0       1824  xla::gpu::CompileModuleToLlvmIr()
    @     0x7f91ba68b22d       2320  xla::gpu::GpuCompiler::CompileToBackendResult()
    @     0x7f91ba68d599       3280  xla::gpu::GpuCompiler::RunBackend()
    @     0x7f908e3bc71c        592  xla::gpu::AutotunerCompileUtil::Compile()
    @     0x7f908e7e6296        240  xla::gpu::GemmFusionAutotunerImpl::CompileAll()::$_6::operator()()
    @     0x7f908e7fa18d        288  std::__u::__function::__policy_invoker<>::__call_impl<>()
    @     0x7f8e64a87efa        112  Eigen::ThreadPoolTempl<>::WorkerLoop()
    @     0x7f8e64a879fa         48  absl::internal_any_invocable::RemoteInvoker<>()
    @     0x7f8c7d3b3c93        256  Thread::ThreadBody()
    @     0x7f90517467db        192  start_thread
    @     0x7f903b33d05f  (unknown)  clone

Looks like invalid mask with values 124 and 126 here:

Invalid LLVM IR before optimizations:
Invalid shufflevector operands!
  %55 = shufflevector <4 x i32> %35, <4 x i32> poison, <128 x i32> <i32 0, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 124, i32 1, i32 126, i32 1>

(The test in question is this: https://github.com/openxla/xla/blob/main/xla/service/gpu/fusions/triton/triton_fusion_emitter_parametrized_test.cc#L65 , if you could run the test, you'll probably be able to collect more useful information. This is significantly out of my normal problem area).
I got ~800 Mb of IR generated and I can't validate if I can share all of it.

Unfortunately, it does not help, need a reproducer

Reduced as #106655

Thanks, fixed

dmpolukhin pushed a commit to dmpolukhin/llvm-project that referenced this pull request Sep 2, 2024
SLP vectorizer has an estimation for gather/buildvector nodes, which
contain some scalar loads. SLP vectorizer performs pretty similar (but
large in SLOCs) estimation, which not always correct. Instead, this
patch implements clustering analysis and actual node allocation with the
full analysis for the vectorized clustered scalars (not only loads, but
also some other instructions) with the correct cost estimation and
vector insert instructions. Improves overall vectorization quality and
simplifies analysis/estimations.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: llvm#104144
dmpolukhin pushed a commit to dmpolukhin/llvm-project that referenced this pull request Sep 2, 2024
…ing" (llvm#105780)

with "[Vectorize] Fix warnings"

It introduced compiler crashes, see llvm#104144.

This reverts commit 69332bb and
351f4a5.
dmpolukhin pushed a commit to dmpolukhin/llvm-project that referenced this pull request Sep 2, 2024
SLP vectorizer has an estimation for gather/buildvector nodes, which
contain some scalar loads. SLP vectorizer performs pretty similar (but
large in SLOCs) estimation, which not always correct. Instead, this
patch implements clustering analysis and actual node allocation with the
full analysis for the vectorized clustered scalars (not only loads, but
also some other instructions) with the correct cost estimation and
vector insert instructions. Improves overall vectorization quality and
simplifies analysis/estimations.

Reviewers: RKSimon

Reviewed By: RKSimon

Pull Request: llvm#104144
@fmayer
Copy link
Contributor

fmayer commented Sep 13, 2024

This CL seems to have changed behavior: https://gist.github.com/fmayer/6030063dfacd2abfb0898a5855949034

This IR here generates a program that returns 0 before, and 1 after this change. It seems related to inline asm instructions.

@alexey-bataev
Copy link
Member Author

This CL seems to have changed behavior: https://gist.github.com/fmayer/6030063dfacd2abfb0898a5855949034

This IR here generates a program that returns 0 before, and 1 after this change. It seems related to inline asm instructions.

Must be fixed in c13bf6d

@fmayer
Copy link
Contributor

fmayer commented Sep 13, 2024

This CL seems to have changed behavior: https://gist.github.com/fmayer/6030063dfacd2abfb0898a5855949034
This IR here generates a program that returns 0 before, and 1 after this change. It seems related to inline asm instructions.

Must be fixed in c13bf6d

I can confirm this fixed it.

vitalybuka referenced this pull request Sep 13, 2024
Should not return the original phi vector instruction, need to return
actual vectorized value as a result.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants