Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LLD][ELF][AArch64] Add BTI Aware long branch thunks #108989

Merged
merged 11 commits into from
Oct 1, 2024

Conversation

smithp35
Copy link
Collaborator

When Branch Target Identification BTI is enabled all indirect branches must target a BTI instruction. A long branch thunk is a source of indirect branches. To date LLD has been assuming that the object producer is responsible for putting a BTI instruction at all places the linker might generate an indirect branch to. This is true for clang, but not for GCC. GCC will elide the BTI instruction when it can prove that there are no indirect branches from outside the translation unit(s). GNU ld was fixed to generate a landing pad stub (gnu ld speak for thunk) for the destination when a long range stub was needed [1].

This means that using GCC compiled objects with LLD may lead to LLD generating an indirect branch to a location without a BTI. The ABI [2] has also been clarified to say that it is a static linker's responsibility to generate a landing pad when the target does not have a BTI.

This patch implements the same mechansim as GNU ld. When the output ELF file is setting the
GNU_PROPERTY_AARCH64_FEATURE_1_BTI property, then we check the destination to see if it has a BTI instruction. If it does not we generate a landing pad consisting of:
BTI c
B

The B can be elided if the thunk can be placed so that control flow drops through. For example:
BTI c
:
This will be common when -ffunction-sections is used.

The landing pad thunks are effectively alternative entry points for the function. Direct branches are unaffected but any linker generated indirect branch needs to use the alternative. We place these as close as possible to the destination section.

There is some further optimization possible. Consider the case:
.text
fn1
...
fn2
...

If we need landing pad thunks for both fn1 and fn2 we could order them so that the thunk for fn1 immediately precedes fn1. This could save a single branch. However I didn't think that would be worth the additional complexity.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106671
[2] ARM-software/abi-aa#196

When Branch Target Identification BTI is enabled all indirect
branches must target a BTI instruction. A long branch thunk
is a source of indirect branches. To date LLD has been
assuming that the object producer is responsible for putting
a BTI instruction at all places the linker might generate an
indirect branch to. This is true for clang, but not for GCC.
GCC will elide the BTI instruction when it can prove that
there are no indirect branches from outside the translation
unit(s). GNU ld was fixed to generate a landing pad stub
(gnu ld speak for thunk) for the destination when a long
range stub was needed [1].

This means that using GCC compiled objects with LLD may
lead to LLD generating an indirect branch to a location
without a BTI. The ABI [2] has also been clarified to say
that it is a static linker's responsibility to generate
a landing pad when the target does not have a BTI.

This patch implements the same mechansim as GNU ld. When
the output ELF file is setting the
GNU_PROPERTY_AARCH64_FEATURE_1_BTI property, then we check
the destination to see if it has a BTI instruction. If it
does not we generate a landing pad consisting of:
BTI c
B <destination>

The B <destination> can be elided if the thunk can be placed
so that control flow drops through. For example:
BTI c
<destination>:
This will be common when -ffunction-sections is used.

The landing pad thunks are effectively alternative entry
points for the function. Direct branches are unaffected
but any linker generated indirect branch needs to use
the alternative. We place these as close as possible
to the destination section.

There is some further optimization possible. Consider the
case:
.text
fn1
...
fn2
...

If we need landing pad thunks for both fn1 and fn2 we could
order them so that the thunk for fn1 immediately precedes fn1.
This could save a single branch. However I didn't think that
would be worth the additional complexity.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106671
[2] ARM-software/abi-aa#196
@llvmbot
Copy link
Member

llvmbot commented Sep 17, 2024

@llvm/pr-subscribers-lld

@llvm/pr-subscribers-lld-elf

Author: Peter Smith (smithp35)

Changes

When Branch Target Identification BTI is enabled all indirect branches must target a BTI instruction. A long branch thunk is a source of indirect branches. To date LLD has been assuming that the object producer is responsible for putting a BTI instruction at all places the linker might generate an indirect branch to. This is true for clang, but not for GCC. GCC will elide the BTI instruction when it can prove that there are no indirect branches from outside the translation unit(s). GNU ld was fixed to generate a landing pad stub (gnu ld speak for thunk) for the destination when a long range stub was needed [1].

This means that using GCC compiled objects with LLD may lead to LLD generating an indirect branch to a location without a BTI. The ABI [2] has also been clarified to say that it is a static linker's responsibility to generate a landing pad when the target does not have a BTI.

This patch implements the same mechansim as GNU ld. When the output ELF file is setting the
GNU_PROPERTY_AARCH64_FEATURE_1_BTI property, then we check the destination to see if it has a BTI instruction. If it does not we generate a landing pad consisting of:
BTI c
B <destination>

The B <destination> can be elided if the thunk can be placed so that control flow drops through. For example:
BTI c
<destination>:
This will be common when -ffunction-sections is used.

The landing pad thunks are effectively alternative entry points for the function. Direct branches are unaffected but any linker generated indirect branch needs to use the alternative. We place these as close as possible to the destination section.

There is some further optimization possible. Consider the case:
.text
fn1
...
fn2
...

If we need landing pad thunks for both fn1 and fn2 we could order them so that the thunk for fn1 immediately precedes fn1. This could save a single branch. However I didn't think that would be worth the additional complexity.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106671
[2] ARM-software/abi-aa#196


Patch is 35.20 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/108989.diff

7 Files Affected:

  • (modified) lld/ELF/Arch/AArch64.cpp (+32)
  • (modified) lld/ELF/Relocations.cpp (+23)
  • (modified) lld/ELF/Relocations.h (+11)
  • (modified) lld/ELF/Target.h (+1)
  • (modified) lld/ELF/Thunks.cpp (+107-9)
  • (modified) lld/ELF/Thunks.h (+10-1)
  • (added) lld/test/ELF/aarch64-thunk-bti.s (+513)
diff --git a/lld/ELF/Arch/AArch64.cpp b/lld/ELF/Arch/AArch64.cpp
index 36880bf67e9f23..d56f4b4ae5136a 100644
--- a/lld/ELF/Arch/AArch64.cpp
+++ b/lld/ELF/Arch/AArch64.cpp
@@ -28,6 +28,38 @@ uint64_t elf::getAArch64Page(uint64_t expr) {
   return expr & ~static_cast<uint64_t>(0xFFF);
 }
 
+// A BTI landing pad is a valid target for an indirect branch
+// when the Branch Target Identification has been enabled.
+// As linker generated branches are via x16 the
+// BTI landing pads are defined as:
+// BTI C, BTI J, BTI JC, PACIASP, PACIBSP.
+bool elf::isAArch64BTILandingPad(Symbol &s, int64_t a) {
+  // PLT entries accessed indirectly have a BTI c.
+  if (s.isInPlt())
+    return true;
+  Defined *d = dyn_cast_or_null<Defined>(&s);
+  if (d == nullptr || d->section == nullptr ||
+      d->section->kind() != InputSectionBase::Regular)
+    // All places that we cannot disassemble are responsible for making
+    // the target a BTI landing pad.
+    return true;
+  InputSection *isec = cast<InputSection>(d->section);
+  int64_t off = d->value + a;
+  // Likely user error, but protect ourselves against out of bounds
+  // access.
+  if (off < 0 || static_cast<uint64_t>(off) >= isec->getSize())
+    return true;
+  const uint8_t *buf = isec->content().begin();
+  const uint32_t instr = read32le(buf + off);
+  if (instr == 0xd503233f || // PACIASP.
+      instr == 0xd503237f || // PACIBSP.
+      instr == 0xd503245f || // BTI C.
+      instr == 0xd503249f || // BTI J.
+      instr == 0xd50324df)   // BTI JC.
+    return true;
+  return false;
+}
+
 namespace {
 class AArch64 : public TargetInfo {
 public:
diff --git a/lld/ELF/Relocations.cpp b/lld/ELF/Relocations.cpp
index 6c07051a231537..b7f714bc2158b7 100644
--- a/lld/ELF/Relocations.cpp
+++ b/lld/ELF/Relocations.cpp
@@ -2243,6 +2243,14 @@ std::pair<Thunk *, bool> ThunkCreator::getThunk(InputSection *isec,
   return std::make_pair(t, true);
 }
 
+std::pair<Thunk *, bool> ThunkCreator::getSyntheticLandingPad(Defined &d,
+                                                              int64_t a) {
+  Thunk *lp = landingPadsBySectionAndAddend[{{d.section, d.value}, a}];
+  if (lp)
+    return std::make_pair(lp, false);
+  return std::make_pair(addLandingPadThunk(d, a), true);
+}
+
 // Return true if the relocation target is an in range Thunk.
 // Return false if the relocation is not to a Thunk. If the relocation target
 // was originally to a Thunk, but is no longer in range we revert the
@@ -2326,6 +2334,21 @@ bool ThunkCreator::createThunks(uint32_t pass,
                 ts = getISDThunkSec(os, isec, isd, rel, src);
               ts->addThunk(t);
               thunks[t->getThunkTargetSym()] = t;
+
+              // When indirect branches are restricted, such as AArch64 BTI
+              // Thunks may need to target a linker generated landing pad
+              // instead of the target.
+              if (t->needsSyntheticLandingPad()) {
+                Thunk *lpt;
+                auto &dr = cast<Defined>(t->destination);
+                std::tie(lpt, isNew) = getSyntheticLandingPad(dr, t->addend);
+                if (isNew) {
+                  InputSection *targetsec = dyn_cast<InputSection>(dr.section);
+                  ts = getISThunkSec(targetsec);
+                  ts->addThunk(lpt);
+                }
+                t->landingPad = lpt->getThunkTargetSym();
+              }
             }
 
             // Redirect relocation to Thunk, we never go via the PLT to a Thunk
diff --git a/lld/ELF/Relocations.h b/lld/ELF/Relocations.h
index aaa4581490a285..2e53962d800fc1 100644
--- a/lld/ELF/Relocations.h
+++ b/lld/ELF/Relocations.h
@@ -16,6 +16,7 @@
 #include <vector>
 
 namespace lld::elf {
+class Defined;
 class Symbol;
 class InputSection;
 class InputSectionBase;
@@ -173,6 +174,8 @@ class ThunkCreator {
   std::pair<Thunk *, bool> getThunk(InputSection *isec, Relocation &rel,
                                     uint64_t src);
 
+  std::pair<Thunk *, bool> getSyntheticLandingPad(Defined &d, int64_t a);
+
   ThunkSection *addThunkSection(OutputSection *os, InputSectionDescription *,
                                 uint64_t off);
 
@@ -200,6 +203,14 @@ class ThunkCreator {
   // The Mips LA25 Thunk is an example of an inline ThunkSection.
   llvm::DenseMap<InputSection *, ThunkSection *> thunkedSections;
 
+  // Record landing pads, generated for a section + offset destination.
+  // Landling pads are alternative entry points for destinations that need
+  // to be reached via thunks that use indirect branches. A destination
+  // needs at most one landing pad as that can be reused by all callers.
+  llvm::DenseMap<std::pair<std::pair<SectionBase *, uint64_t>, int64_t>,
+                 Thunk *>
+      landingPadsBySectionAndAddend;
+
   // The number of completed passes of createThunks this permits us
   // to do one time initialization on Pass 0 and put a limit on the
   // number of times it can be called to prevent infinite loops.
diff --git a/lld/ELF/Target.h b/lld/ELF/Target.h
index 9894fb32c503c3..3142a3ec9959d9 100644
--- a/lld/ELF/Target.h
+++ b/lld/ELF/Target.h
@@ -230,6 +230,7 @@ void writePrefixedInstruction(uint8_t *loc, uint64_t insn);
 void addPPC64SaveRestore();
 uint64_t getPPC64TocBase();
 uint64_t getAArch64Page(uint64_t expr);
+bool isAArch64BTILandingPad(Symbol &s, int64_t a);
 template <typename ELFT> void writeARMCmseImportLib();
 uint64_t getLoongArchPageDelta(uint64_t dest, uint64_t pc, RelType type);
 void riscvFinalizeRelax(int passes);
diff --git a/lld/ELF/Thunks.cpp b/lld/ELF/Thunks.cpp
index 349e31332f59aa..599e053c13a6e5 100644
--- a/lld/ELF/Thunks.cpp
+++ b/lld/ELF/Thunks.cpp
@@ -51,12 +51,20 @@ namespace {
 // distance from the thunk to the target is less than 128MB. Long thunks can
 // branch to any virtual address and they are implemented in the derived
 // classes. This class tries to create a short thunk if the target is in range,
-// otherwise it creates a long thunk.
+// otherwise it creates a long thunk. When BTI is enabled indirect branches
+// must land on a BTI instruction. If the destination does not have a BTI
+// instruction mayNeedLandingPad is set to true and Thunk::landingPad points
+// to an alternative entry point with a BTI.
 class AArch64Thunk : public Thunk {
 public:
-  AArch64Thunk(Symbol &dest, int64_t addend) : Thunk(dest, addend) {}
+  AArch64Thunk(Symbol &dest, int64_t addend, bool mayNeedLandingPad)
+      : Thunk(dest, addend), mayNeedLandingPad(mayNeedLandingPad) {}
   bool getMayUseShortThunk();
   void writeTo(uint8_t *buf) override;
+  bool needsSyntheticLandingPad() override;
+
+protected:
+  bool mayNeedLandingPad;
 
 private:
   bool mayUseShortThunk = true;
@@ -66,8 +74,8 @@ class AArch64Thunk : public Thunk {
 // AArch64 long range Thunks.
 class AArch64ABSLongThunk final : public AArch64Thunk {
 public:
-  AArch64ABSLongThunk(Symbol &dest, int64_t addend)
-      : AArch64Thunk(dest, addend) {}
+  AArch64ABSLongThunk(Symbol &dest, int64_t addend, bool mayNeedLandingPad)
+      : AArch64Thunk(dest, addend, mayNeedLandingPad) {}
   uint32_t size() override { return getMayUseShortThunk() ? 4 : 16; }
   void addSymbols(ThunkSection &isec) override;
 
@@ -77,7 +85,8 @@ class AArch64ABSLongThunk final : public AArch64Thunk {
 
 class AArch64ADRPThunk final : public AArch64Thunk {
 public:
-  AArch64ADRPThunk(Symbol &dest, int64_t addend) : AArch64Thunk(dest, addend) {}
+  AArch64ADRPThunk(Symbol &dest, int64_t addend, bool mayNeedLandingPad)
+      : AArch64Thunk(dest, addend, mayNeedLandingPad) {}
   uint32_t size() override { return getMayUseShortThunk() ? 4 : 12; }
   void addSymbols(ThunkSection &isec) override;
 
@@ -85,6 +94,27 @@ class AArch64ADRPThunk final : public AArch64Thunk {
   void writeLong(uint8_t *buf) override;
 };
 
+// AArch64 BTI Landing Pad
+// When BTI is enabled indirect branches must land on a BTI
+// compatible instruction. When the destination does not have a
+// BTI compatible instruction a Thunk doing an indirect branch
+// targets a Landing Pad Thunk that direct branches to the target.
+class AArch64BTILandingPadThunk final : public Thunk {
+public:
+  AArch64BTILandingPadThunk(Symbol &dest, int64_t addend)
+      : Thunk(dest, addend) {}
+
+  uint32_t size() override { return getMayUseShortThunk() ? 4 : 8; }
+  void addSymbols(ThunkSection &isec) override;
+  InputSection *getTargetInputSection() const override;
+  void writeTo(uint8_t *buf) override;
+
+private:
+  bool getMayUseShortThunk();
+  void writeLong(uint8_t *buf);
+  bool mayUseShortThunk = true;
+};
+
 // Base class for ARM thunks.
 //
 // An ARM thunk may be either short or long. A short thunk is simply a branch
@@ -532,6 +562,12 @@ void AArch64Thunk::writeTo(uint8_t *buf) {
   ctx.target->relocateNoSym(buf, R_AARCH64_CALL26, s - p);
 }
 
+bool AArch64Thunk::needsSyntheticLandingPad() {
+  // Short Thunks use a direct branch, no synthetic landing pad
+  // required.
+  return mayNeedLandingPad && !getMayUseShortThunk();
+}
+
 // AArch64 long range Thunks.
 void AArch64ABSLongThunk::writeLong(uint8_t *buf) {
   const uint8_t data[] = {
@@ -540,7 +576,11 @@ void AArch64ABSLongThunk::writeLong(uint8_t *buf) {
     0x00, 0x00, 0x00, 0x00, // L0: .xword S
     0x00, 0x00, 0x00, 0x00,
   };
-  uint64_t s = getAArch64ThunkDestVA(destination, addend);
+  // if mayNeedLandingPad is true then destination is an
+  // AArch64BTILandingPadThunk that defines landingPad.
+  assert(!mayNeedLandingPad || landingPad != nullptr);
+  uint64_t s = mayNeedLandingPad ? landingPad->getVA(0)
+                                 : getAArch64ThunkDestVA(destination, addend);
   memcpy(buf, data, sizeof(data));
   ctx.target->relocateNoSym(buf + 8, R_AARCH64_ABS64, s);
 }
@@ -564,7 +604,11 @@ void AArch64ADRPThunk::writeLong(uint8_t *buf) {
       0x10, 0x02, 0x00, 0x91, // add  x16, x16, R_AARCH64_ADD_ABS_LO12_NC(Dest)
       0x00, 0x02, 0x1f, 0xd6, // br   x16
   };
-  uint64_t s = getAArch64ThunkDestVA(destination, addend);
+  // if mayNeedLandingPad is true then destination is an
+  // AArch64BTILandingPadThunk that defines landingPad.
+  assert(!mayNeedLandingPad || landingPad != nullptr);
+  uint64_t s = mayNeedLandingPad ? landingPad->getVA(0)
+                                 : getAArch64ThunkDestVA(destination, addend);
   uint64_t p = getThunkTargetSym()->getVA();
   memcpy(buf, data, sizeof(data));
   ctx.target->relocateNoSym(buf, R_AARCH64_ADR_PREL_PG_HI21,
@@ -578,6 +622,48 @@ void AArch64ADRPThunk::addSymbols(ThunkSection &isec) {
   addSymbol("$x", STT_NOTYPE, 0, isec);
 }
 
+void AArch64BTILandingPadThunk::addSymbols(ThunkSection &isec) {
+  addSymbol(saver().save("__AArch64BTIThunk_" + destination.getName()),
+            STT_FUNC, 0, isec);
+  addSymbol("$x", STT_NOTYPE, 0, isec);
+}
+
+InputSection *AArch64BTILandingPadThunk::getTargetInputSection() const {
+  auto &dr = cast<Defined>(destination);
+  return dyn_cast<InputSection>(dr.section);
+}
+
+void AArch64BTILandingPadThunk::writeTo(uint8_t *buf) {
+  if (!getMayUseShortThunk()) {
+    writeLong(buf);
+    return;
+  }
+  write32(buf, 0xd503245f); // BTI c
+  // Control falls through to target in following section.
+}
+
+bool AArch64BTILandingPadThunk::getMayUseShortThunk() {
+  if (!mayUseShortThunk)
+    return false;
+  // If the target is the following instruction then
+  // we can fall through without the indirect branch.
+  uint64_t s = destination.getVA(addend);
+  uint64_t p = getThunkTargetSym()->getVA();
+  // <= 4 as Thunks start off with the same offset
+  // within the section as the destination, so
+  // s - p == 0 until addresses are assigned.
+  mayUseShortThunk = (s - p <= 4);
+  return mayUseShortThunk;
+}
+
+void AArch64BTILandingPadThunk::writeLong(uint8_t *buf) {
+  uint64_t s = destination.getVA(addend);
+  uint64_t p = getThunkTargetSym()->getVA() + 4;
+  write32(buf, 0xd503245f);     // BTI c
+  write32(buf + 4, 0x14000000); // B S
+  ctx.target->relocateNoSym(buf + 4, R_AARCH64_CALL26, s - p);
+}
+
 // ARM Target Thunks
 static uint64_t getARMThunkDestVA(const Symbol &s) {
   uint64_t v = s.isInPlt() ? s.getPltVA() : s.getVA();
@@ -1264,9 +1350,12 @@ static Thunk *addThunkAArch64(RelType type, Symbol &s, int64_t a) {
   if (type != R_AARCH64_CALL26 && type != R_AARCH64_JUMP26 &&
       type != R_AARCH64_PLT32)
     fatal("unrecognized relocation type");
+  bool mayNeedLandingPad =
+      (config->andFeatures & GNU_PROPERTY_AARCH64_FEATURE_1_BTI) &&
+      !isAArch64BTILandingPad(s, a);
   if (config->picThunk)
-    return make<AArch64ADRPThunk>(s, a);
-  return make<AArch64ABSLongThunk>(s, a);
+    return make<AArch64ADRPThunk>(s, a, mayNeedLandingPad);
+  return make<AArch64ABSLongThunk>(s, a, mayNeedLandingPad);
 }
 
 // Creates a thunk for long branches or Thumb-ARM interworking.
@@ -1480,3 +1569,12 @@ Thunk *elf::addThunk(const InputSection &isec, Relocation &rel) {
     llvm_unreachable("add Thunk only supported for ARM, AVR, Mips and PowerPC");
   }
 }
+
+Thunk *elf::addLandingPadThunk(Symbol &s, int64_t a) {
+  switch (config->emachine) {
+  case EM_AARCH64:
+    return make<AArch64BTILandingPadThunk>(s, a);
+  default:
+    llvm_unreachable("add landing pad only supported for AArch64");
+  }
+}
diff --git a/lld/ELF/Thunks.h b/lld/ELF/Thunks.h
index 12ddf08cadc090..4d797434aa70d0 100644
--- a/lld/ELF/Thunks.h
+++ b/lld/ELF/Thunks.h
@@ -54,10 +54,17 @@ class Thunk {
     return true;
   }
 
+  // Thunks that indirectly branch to targets may need a synthetic landing
+  // pad generated close to the target. For example AArch64 when BTI is
+  // enabled.
+  virtual bool needsSyntheticLandingPad() { return false; }
+
   Defined *getThunkTargetSym() const { return syms[0]; }
 
   Symbol &destination;
   int64_t addend;
+  // Alternative target when indirect branch to destination can't be used.
+  Symbol *landingPad = nullptr;
   llvm::SmallVector<Defined *, 3> syms;
   uint64_t offset = 0;
   // The alignment requirement for this Thunk, defaults to the size of the
@@ -68,7 +75,9 @@ class Thunk {
 // For a Relocation to symbol S create a Thunk to be added to a synthetic
 // ThunkSection.
 Thunk *addThunk(const InputSection &isec, Relocation &rel);
-
+// Create a landing pad Thunk for use when indirect branches from Thunks
+// are restricted.
+Thunk *addLandingPadThunk(Symbol &s, int64_t a);
 void writePPC32PltCallStub(uint8_t *buf, uint64_t gotPltVA,
                            const InputFile *file, int64_t addend);
 void writePPC64LoadAndBranch(uint8_t *buf, int64_t offset);
diff --git a/lld/test/ELF/aarch64-thunk-bti.s b/lld/test/ELF/aarch64-thunk-bti.s
new file mode 100644
index 00000000000000..f23c7c1c789b09
--- /dev/null
+++ b/lld/test/ELF/aarch64-thunk-bti.s
@@ -0,0 +1,513 @@
+// REQUIRES: aarch64
+// RUN: split-file %s %t
+// RUN: llvm-mc -filetype=obj -triple=aarch64 %t/asm -o %t.o
+// RUN: ld.lld --shared --script=%t/lds %t.o -o %t.so --defsym absolute=0xf0000000
+// RUN: llvm-objdump -d --no-show-raw-insn --print-imm-hex --start-address=0x10000000 --stop-address=0x10000028 %t.so | FileCheck --check-prefix=CHECK-SO %s
+// RUN: llvm-objdump -d --no-show-raw-insn --print-imm-hex --start-address=0x17fc0028 --stop-address=0x17fc0050 %t.so | FileCheck --check-prefix=CHECK-SO2 %s
+// RUN: llvm-objdump -d --no-show-raw-insn --print-imm-hex --start-address=0x18000050 --stop-address=0x180000a8 %t.so | FileCheck --check-prefix=CHECK-BTI %s
+// RUN: llvm-objdump -d --no-show-raw-insn --print-imm-hex --start-address=0x180000b0 --stop-address=0x18000100 %t.so | FileCheck --check-prefix=CHECK-SO3 %s
+// RUN: llvm-objdump -d --no-show-raw-insn --print-imm-hex --start-address=0x30000000 --stop-address=0x300000c0 %t.so | FileCheck --check-prefix=CHECK-SO4 %s
+// RUN: rm %t.so
+// RUN: llvm-mc -filetype=obj -triple=aarch64 %t/shared -o %tshared.o
+// RUN: ld.lld --shared -o %tshared.so %tshared.o
+// RUN: ld.lld %tshared.so --script=%t/lds %t.o -o %t.exe --defsym absolute=0xf0000000
+// RUN: llvm-objdump -d --no-show-raw-insn --print-imm-hex --start-address=0x10000000 --stop-address=0x10000028 %t.exe | FileCheck --check-prefix=CHECK-EXE %s
+// RUN: llvm-objdump -d --no-show-raw-insn --print-imm-hex --start-address=0x17fc0028 --stop-address=0x17fc0050 %t.exe | FileCheck --check-prefix=CHECK-EXE2 %s
+// RUN: llvm-objdump -d --no-show-raw-insn --print-imm-hex --start-address=0x18000050 --stop-address=0x180000a8 %t.exe | FileCheck --check-prefix=CHECK-BTI %s
+// RUN: llvm-objdump -d --no-show-raw-insn --print-imm-hex --start-address=0x180000b0 --stop-address=0x180000e8 %t.exe | FileCheck --check-prefix=CHECK-EXE3 %s
+// RUN: llvm-objdump -d --no-show-raw-insn --print-imm-hex --start-address=0x30000000 --stop-address=0x300000ec %t.exe | FileCheck --check-prefix=CHECK-EXE4 %s
+// RUN: rm %t.o %tshared.o %tshared.so
+
+/// Test thunk generation when destination does not have
+/// a BTI compatible landing pad. Linker must generate
+/// landing pad sections for thunks that use indirect
+/// branches.
+
+//--- asm
+.section ".note.gnu.property", "a"
+.p2align 3
+.long 4
+.long 0x10
+.long 0x5
+.asciz "GNU"
+
+/// Enable BTI.
+.long 0xc0000000 // GNU_PROPERTY_AARCH64_FEATURE_1_AND.
+.long 4
+.long 1          // GNU_PROPERTY_AARCH64_FEATURE_1_BTI.
+.long 0
+
+
+/// Short thunks are direct branches so we don't
+/// need landing pads. Expect all thunks to branch
+/// to target.
+.section .text.0, "ax", %progbits
+.balign 0x1000
+.global _start
+.type _start, %function
+_start:
+ bl bti_c_target
+ bl bti_j_target
+ bl bti_jc_target
+ bl paciasp_target
+ bl pacibsp_target
+ bl fn1
+ bl fn2
+ bl fn3
+ bl  fn4
+ bl via_plt
+
+// CHECK-SO-LABEL: <_start>:
+// CHECK-SO-NEXT: 10000000: bl      0x17fc0028 <__AArch64ADRPThunk_>
+// CHECK-SO-NEXT:           bl      0x17fc002c <__AArch64ADRPThunk_>
+// CHECK-SO-NEXT:           bl      0x17fc0030 <__AArch64ADRPThunk_>
+// CHECK-SO-NEXT:           bl      0x17fc0034 <__AArch64ADRPThunk_>
+// CHECK-SO-NEXT:           bl      0x17fc0038 <__AArch64ADRPThunk_>
+// CHECK-SO-NEXT:           bl      0x17fc003c <__AArch64ADRPThunk_>
+// CHECK-SO-NEXT:           bl      0x17fc0040 <__AArch64ADRPThunk_>
+// CHECK-SO-NEXT:           bl      0x17fc0044 <__AArch64ADRPThunk_>
+// CHECK-SO-NEXT:           bl      0x17fc0048 <__AArch64ADRPThunk_>
+// CHECK-SO-NEXT:           bl      0x17fc004c <__AArch64ADRPThunk_via_plt>
+
+// CHECK-SO2:      <__AArch64ADRPThunk_>:
+// CHECK-SO2-NEXT:17fc0028: b       0x18000050 <bti_c_target>
+// CHECK-SO2-EMPTY:
+// CHECK-SO2-NEXT: <__AArch64ADRPThunk_>:
+// CHECK-SO2-NEXT:          b       0x18000058 <bti_j_target>
+// CHECK-SO2-EMPTY:
+// CHECK-SO2-NEXT: <__AArch64ADRPThunk_>:
+// CHECK-SO2-NEXT:          b       0x18000060 <bti_jc_target>
+// CHECK-SO2-EMPTY:
+// CHECK-SO2-NEXT: <__AArch64ADRPThunk_>:
+// CHECK-SO2-NEXT:          b       0x18000068 <paciasp_target>
+// CHECK-SO2-EMPTY:
+// CHECK-SO2-NEXT: <__AArch64ADRPThunk_>:
+// CHECK-SO2-NEXT:          b       0x18000070 <pacibsp_target>
+// CHECK-SO2-EMPTY:
+// CHECK-SO2-NEXT: <__AArch64ADRPThunk_>:
+// CHECK-SO2-NEXT:          b       0x18000088 <fn1>
+// CHECK-SO2-EMPTY:
+// CHECK-SO2-NEXT: <__AArch64ADRPThunk_>:
+// CHECK-SO2-NEXT:          b       0x1800008c <fn2>
+// CHECK-SO2-EMPTY:
+// CHECK-SO2-NEXT: <__AArch64ADRPThunk_>:
+// CHECK-SO2-NEXT:          b       0x18000094 <fn3>
+// CHECK-SO2-EMPTY:
+// CHECK-SO2-NEXT: <__AArch64ADRPThunk_>:
+// CHECK-SO2-NEXT:          b       0x180000a0 <fn4>
+// CHECK-SO2-EMPTY:
+// CHECK-SO2-NEXT: <__AArch64ADRPThunk_via_plt>:
+// CHECK-SO2-NEXT:          b       0x180000d0 <fn4+0x30>
+
+// CHECK-EXE-LABEL: <_start>:
+// CHECK-EXE-NEXT: 10000000: bl      0x17fc0028 <__AArch64AbsLongThunk_>
+// CHECK-EXE-NEXT:           bl      0x17fc002c <__AArch64AbsLongThunk_>
+// CHECK-EXE-NEXT:           bl      0x17fc0030 <__AArch64AbsLongThunk_>
+// CHECK-EXE-NEXT:           bl      0x17fc0034 <__AArch64AbsLongThunk_>
+// CHECK-EXE-NEXT:           bl      0x17fc0038 <__AArch64AbsLongThunk_>
+// CHECK-EXE-NEXT:           bl      0x17fc003c <__AArch64AbsLongThunk_>
+// CHECK-EXE-NEXT:           bl      0x17fc0040 <__AArch64AbsLongThunk_>
+// CHECK-EXE-NEXT:           bl      0x17fc0044 <__AArch64AbsLongThunk_>
+// CHECK-EXE-NEXT:           bl      0x17fc0048 <__AArch64AbsLongThunk_>
+// CHECK-EXE-NEXT:           bl      0x17fc004c <__AArch64AbsLongThunk_via_plt>
+
+// CHECK-EXE2: <__AArch64AbsLongThunk_>:
+// CHECK-EXE2-NEXT: 17fc0028: b       0x18000050 <bti_c_target>
+// CH...
[truncated]

@smithp35
Copy link
Collaborator Author

smithp35 commented Sep 17, 2024

Hmm, the tests are passing on my local linux machine, not sure what has happened with Windows. Will need to try and find a Windows machine and investigate.

[UPDATE] It looks like it is due to the positioning of the .dynstr section, the test's .dynamic section has a DT_NEEDED which has a file path in it, this will affect the offset between the .plt and .got.plt. I'll update the test to be more robust.

The .dynstr is in between the .plt and the .got.plt. If this
contains a DT_NEEDED tag then the pathname of the shared object
will be used. This path is machine dependent. Don't hard code
the offsets in the .plt in this case. We are only disassembling
the .plt to show that there is a BTI at the start of it.
@MaskRay
Copy link
Member

MaskRay commented Sep 17, 2024

Thanks for the patch. I am back from vacation and will check this shortly.

FYI: I plan to replace config->x with ctx.arg.x to move toward the goal to eliminate the global states. It should require minimal changes to this patch.

@MaskRay
Copy link
Member

MaskRay commented Sep 21, 2024

Hmm, the tests are passing on my local linux machine, not sure what has happened with Windows. Will need to try and find a Windows machine and investigate.

[UPDATE] It looks like it is due to the positioning of the .dynstr section, the test's .dynamic section has a DT_NEEDED which has a file path in it, this will affect the offset between the .plt and .got.plt. I'll update the test to be more robust.

If an output DSO is used to link another component, we do ld.lld -shared -soname=t.so ... -o %t.so to ensure that the DT_NEEDED entry needs a fixed-length .dynstr entry.

This is error-prone, which is one of the reasons I prefer rm -rf %t && split-file %s %t && cd %t :)

The initial rm -rf %t ensures whether the test directory %t remains after test execution doesn't matter, allowing us to remove rm %t.o %tshared.o %tshared.so %t.exe

// RUN: split-file %s %t
// RUN: llvm-mc -filetype=obj -triple=aarch64 %t/asm -o %t.o
// RUN: ld.lld --shared --script=%t/lds %t.o -o %t.so --defsym absolute=0xf0000000
// RUN: llvm-objdump -d --no-show-raw-insn --print-imm-hex --start-address=0x10000000 --stop-address=0x10000028 %t.so | FileCheck --check-prefix=CHECK-SO %s
Copy link
Member

@MaskRay MaskRay Sep 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--print-imm-hex may be unnecessary.


#109553 might be useful to allow removing some llvm-objdump commands, making maintenance more convenient.

llvm/tools/llvm-objdump/llvm-objdump.cpp:countSkippableZeroBytes needs to spend time to skip zeroes, so the performance is not as good as multiple llvm-objdump invocations though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rewritten the test using OutputSections to need much fewer zeroes.

high PT_LOAD FLAGS(0x1 | 0x4);
}
SECTIONS {
.text_low 0x10000000 : {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unintended 4-column when PHDRS uses 2-column indentation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK. I've made the indentation consistent.

/// alignment requirement. Check that we don't fall through into
/// alignment padding.
.section .text.9, "ax", %progbits
.balign 16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the output DSO, fn4 would be placed at 0x180000b0 even without .balign 16.

Perhaps we can remove fn4 and add a suitable alignment to fn1.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the test I've made sure that we don't end up on an alignment boundary for this case.

// CHECK-BTI-NEXT: ret

/// PLT has bti c.
// CHECK-SO3: 180000b0: bti c
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What triggers the creation of this piece of code at 0x180000b0? I don't find any branches to this address.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's PLT[0] which isn't particularly useful so I've taken it out.

// RUN: llvm-objdump -d --no-show-raw-insn --print-imm-hex --start-address=0x30000000 --stop-address=0x300000ec %t.exe | FileCheck --check-prefix=CHECK-EXE4 %s
// RUN: rm %t.o %tshared.o %tshared.so %t.exe

/// Test thunk generation when destination does not have
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments seem to wrap around column 56, much smaller than the typical 80 (though many tests essentially wrap at a much larger column as we don't rigidly require 80).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reformatted all the comments.

// place a pool that can short thunk
// to targets.
.section .text.1, "ax", %progbits
.space 0x2000000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.space 0x2000000 => 32MiB in the relocatable file.

We can use multiple output sections with the output section address feature to reduce file sizes, make llvm-objdump faster, and potentially reduce llvm-objdump --start-address= invocations.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, I've used OutputSections and the output is much smaller now.

.section .long_calls, "ax", %progbits
.global long_calls
.type long_calls, %function
/// Expect thunk to target as targets have
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These commands have a wrap column at about 45.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK. Reformatted all comments in file.

// RUN: llvm-mc -filetype=obj -triple=aarch64 %t/shared -o %tshared.o
// RUN: ld.lld --shared -o %tshared.so %tshared.o
// RUN: ld.lld %tshared.so --script=%t/lds %t.o -o %t.exe --defsym absolute=0xf0000000
// RUN: llvm-objdump -d --no-show-raw-insn --print-imm-hex --start-address=0x10000000 --stop-address=0x10000028 %t.exe | FileCheck --check-prefix=CHECK-EXE %s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The -no-pie executable link perhaps provide coverage for abs thunks. Since many check strings are similar to pic thunks, some check prefixes can optionally be simplified.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had a go at sharing more parts.


/// functions do not have BTI compatible landing pads.
/// Expect linker generated landing pads.
.section .text.7, "ax", %progbits
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add a large alignment for fn1. bl fn2 can probably be changed to b .text.7+offset to test thunks with a non-zero addend. (https://reviews.llvm.org/D70637)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a .text+offset case.

I've not added in the alignment here. I reordered fn1 and fn2 so that I could show that the fallthrough didn't happen with alignment, only to discover that the heuristic I was using for short thunks wasn't good enough. I've made a change to the heuristic, which I now need a test for.

return true;
Defined *d = dyn_cast_or_null<Defined>(&s);
if (d == nullptr || d->section == nullptr ||
d->section->kind() != InputSectionBase::Regular)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!isa_and_nonnull<InputSection>(d->section)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

// PLT entries accessed indirectly have a BTI c.
if (s.isInPlt())
return true;
Defined *d = dyn_cast_or_null<Defined>(&s);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dyn_cast since &s cannot be nullptr.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

// the target a BTI landing pad.
return true;
InputSection *isec = cast<InputSection>(d->section);
int64_t off = d->value + a;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sum has a type of uint64_t. Using uint64_t can avoid off < 0 below.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

return true;
const uint8_t *buf = isec->content().begin();
const uint32_t instr = read32le(buf + off);
if (instr == 0xd503233f || // PACIASP.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return instr == ... || ...

or perhaps return is_contained(/*PACIASP*/0xd503233f, ...)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK. I've also used the common parts of the HINT encoding to avoid testing all combinations.

@@ -2243,6 +2243,14 @@ std::pair<Thunk *, bool> ThunkCreator::getThunk(InputSection *isec,
return std::make_pair(t, true);
}

std::pair<Thunk *, bool> ThunkCreator::getSyntheticLandingPad(Defined &d,
int64_t a) {
Thunk *lp = landingPadsBySectionAndAddend[{{d.section, d.value}, a}];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer try_emplace over operator[]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK

auto &dr = cast<Defined>(t->destination);
std::tie(lpt, isNew) = getSyntheticLandingPad(dr, t->addend);
if (isNew) {
InputSection *targetsec = dyn_cast<InputSection>(dr.section);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that targetSec could be nullptr in erroneous non-InputSection cases (MergeInputSection) and would cause a null pointer dereference in getISThunkSec.

If we don't write code guarding against such cases, perhaps just use cast<>.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used cast<>

The code in isAArch64BTILandingPad has a test

if (!isa_and_nonnull<InputSection>(d->section))

so t->needsSyntheticLandingPad() should return false in those cases.

bool AArch64BTILandingPadThunk::getMayUseShortThunk() {
if (!mayUseShortThunk)
return false;
// If the target is the following instruction then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wrap column is smaller than 80.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK. I've reformatted.

addSymbol("$x", STT_NOTYPE, 0, isec);
}

InputSection *AArch64BTILandingPadThunk::getTargetInputSection() const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is not needed.

Relocations.cpp calls ts = getISThunkSec(targetsec) without checking getTargetInputSection.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is called in mergeCmp() but I've removed it anyway as it is not a strict requirement that the landing pads are before the target section, they just need to be within direct branch range.

@smithp35
Copy link
Collaborator Author

smithp35 commented Sep 23, 2024

Thanks for the review. I'll work on an updated patch, hopefully this week.

Please forgive the github fumbling. Last time I tried to update a PR with a merge conflict I tried rebasing and my pull-request ended up with thousands of commits.

I'll post a comment when I've got an update.

* Remove getTargetInputSection.
* Simplify AArch64 BTI detection.
* Make short thunk heuristic more robust when there is more than one landing
  pad thunk.
Use more Output sections to avoid needing close to 128 MiB of zeroes.
Permits fewer llvm-objcopy instances as we don't need --start-address etc.
@smithp35
Copy link
Collaborator Author

Thanks for your patience. I've reworked the test case to use a lot fewer zeroes.

{{d.section, d.value}, a}, nullptr);
if (res.second)
res.first->second = addLandingPadThunk(d, a);
;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete stray ;

@@ -2257,6 +2257,16 @@ std::pair<Thunk *, bool> ThunkCreator::getThunk(InputSection *isec,
return std::make_pair(t, true);
}

std::pair<Thunk *, bool> ThunkCreator::getSyntheticLandingPad(Defined &d,
int64_t a) {
auto res = landingPadsBySectionAndAddend.try_emplace(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use auto [it, isNew] to improve clarity

auto &dr = cast<Defined>(t->destination);
std::tie(lpt, isNew) = getSyntheticLandingPad(dr, t->addend);
if (isNew) {
InputSection *targetsec = cast<InputSection>(dr.section);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(targetsec => targetSec) since the variable is only used once, just inline it.

@@ -200,6 +203,14 @@ class ThunkCreator {
// The Mips LA25 Thunk is an example of an inline ThunkSection.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps update this comment to mention AArch64BTILandingPadThunk

@@ -0,0 +1,482 @@
// REQUIRES: aarch64
// RUN: rm -rf %t && split-file %s %t && cd %t
// RUN: llvm-mc -filetype=obj -triple=aarch64 asm -o %t.o
Copy link
Member

@MaskRay MaskRay Sep 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After rm -rf %t && split-file %s %t && cd %t, %t can be completely avoided.
For example, name an output file a.o and output as out instead %t.exe

asm

Name assembly files .s. For example, llvm-mc ... a.s -o a.o

@MaskRay
Copy link
Member

MaskRay commented Sep 29, 2024

Sorry that Relocations.cpp Thunks.cpp need rebasing due to my Ctx &ctx changes.

Remove spurious semi-colon.
Use structured bindings.
Inline function call to remove variable.
Improve comment about ThunkedSections to include landing pads.
Remove uses of %t in test.
@smithp35
Copy link
Collaborator Author

Sorry that Relocations.cpp Thunks.cpp need rebasing due to my Ctx &ctx changes.

Thanks for the warning. I've applied the final set of review comments and rebased. I'll wait till tomorrow before merging.

Copy link

github-actions bot commented Sep 30, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

{{d.section, d.value}, a}, nullptr);
if (isNew)
it->second = addLandingPadThunk(ctx, d, a);
return std::make_pair(it->second, isNew);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{it->second, isNew} is more idiomatic post C++11

@@ -28,6 +28,39 @@ uint64_t elf::getAArch64Page(uint64_t expr) {
return expr & ~static_cast<uint64_t>(0xFFF);
}

// A BTI landing pad is a valid target for an indirect branch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reflow the comment after adopting 80-column wrap?

@@ -553,7 +588,11 @@ void AArch64ABSLongThunk::writeLong(uint8_t *buf) {
0x00, 0x00, 0x00, 0x00, // L0: .xword S
0x00, 0x00, 0x00, 0x00,
};
uint64_t s = getAArch64ThunkDestVA(destination, addend);
// if mayNeedLandingPad is true then destination is an
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// If

@@ -577,7 +616,11 @@ void AArch64ADRPThunk::writeLong(uint8_t *buf) {
0x10, 0x02, 0x00, 0x91, // add x16, x16, R_AARCH64_ADD_ABS_LO12_NC(Dest)
0x00, 0x02, 0x1f, 0xd6, // br x16
};
uint64_t s = getAArch64ThunkDestVA(destination, addend);
// if mayNeedLandingPad is true then destination is an
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// If

// RUN: llvm-objdump -d --no-show-raw-insn out.so | FileCheck %s
// RUN: llvm-objdump -d --no-show-raw-insn out.so | FileCheck %s --check-prefix=CHECK-PADS
// RUN: llvm-mc -filetype=obj -triple=aarch64 shared -o shared.o
// RUN: ld.lld --shared -o shared.so shared.o --soname=shared.so
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--soname= could be omitted as we are using relative paths

use {} rather than std::make_pair.
reflow comments.
remove --soname from test.
@smithp35
Copy link
Collaborator Author

smithp35 commented Oct 1, 2024

Thanks for the comments. I've applied those and merged.

@smithp35 smithp35 merged commit c4d9cd8 into llvm:main Oct 1, 2024
8 checks passed
Sterling-Augustine pushed a commit to Sterling-Augustine/llvm-project that referenced this pull request Oct 3, 2024
When Branch Target Identification BTI is enabled all indirect branches
must target a BTI instruction. A long branch thunk is a source of
indirect branches. To date LLD has been assuming that the object
producer is responsible for putting a BTI instruction at all places the
linker might generate an indirect branch to. This is true for clang, but
not for GCC. GCC will elide the BTI instruction when it can prove that
there are no indirect branches from outside the translation unit(s). GNU
ld was fixed to generate a landing pad stub (gnu ld speak for thunk) for
the destination when a long range stub was needed [1].

This means that using GCC compiled objects with LLD may lead to LLD
generating an indirect branch to a location without a BTI. The ABI [2]
has also been clarified to say that it is a static linker's
responsibility to generate a landing pad when the target does not have a
BTI.

This patch implements the same mechansim as GNU ld. When the output ELF
file is setting the
GNU_PROPERTY_AARCH64_FEATURE_1_BTI property, then we check the
destination to see if it has a BTI instruction. If it does not we
generate a landing pad consisting of:
BTI c
B <destination>

The B <destination> can be elided if the thunk can be placed so that
control flow drops through. For example:
BTI c
<destination>:
This will be common when -ffunction-sections is used.

The landing pad thunks are effectively alternative entry points for the
function. Direct branches are unaffected but any linker generated
indirect branch needs to use the alternative. We place these as close as
possible to the destination section.

There is some further optimization possible. Consider the case:
.text
fn1
...
fn2
...

If we need landing pad thunks for both fn1 and fn2 we could order them
so that the thunk for fn1 immediately precedes fn1. This could save a
single branch. However I didn't think that would be worth the additional
complexity.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106671
[2] ARM-software/abi-aa#196
@zmodem
Copy link
Collaborator

zmodem commented Nov 15, 2024

We're hitting asserts after this. I've attached an lld reproducer tarball at https://crbug.com/377438309#comment8

$ tar zxf repro.tar.gz
$ ( cd repro && ../build/bin/ld.lld @response.txt )
ld.lld: /work/llvm-project/lld/ELF/Thunks.cpp:621: virtual void (anonymous namespace)::AArch64ADRPThunk::writeLong(uint8_t *): Assertion `!mayNeedLandingPad || landingPad != nullptr' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
 #0 0x000055666b0b8c68 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (../bin/ld.lld+0xc8ac68)
 #1 0x000055666b0b681e llvm::sys::RunSignalHandlers() (../bin/ld.lld+0xc8881e)
 #2 0x000055666b0b9638 SignalHandler(int) Signals.cpp:0:0
 #3 0x00007f0f642591a0 (/lib/x86_64-linux-gnu/libc.so.6+0x3d1a0)
 #4 0x00007f0f642a70ec __pthread_kill_implementation ./nptl/pthread_kill.c:44:76
 #5 0x00007f0f64259102 gsignal ./signal/../sysdeps/posix/raise.c:27:6
 #6 0x00007f0f642424f2 abort ./stdlib/abort.c:81:7
 #7 0x00007f0f64242415 _nl_load_domain ./intl/loadmsgcat.c:1177:9
 #8 0x00007f0f64251d32 (/lib/x86_64-linux-gnu/libc.so.6+0x35d32)
 #9 0x000055666b337831 (anonymous namespace)::AArch64ADRPThunk::writeLong(unsigned char*) Thunks.cpp:0:0
#10 0x000055666b2dbf21 lld::elf::ThunkSection::writeTo(unsigned char*) (../bin/ld.lld+0xeadf21)
#11 0x000055666b2a152b void lld::elf::OutputSection::writeTo<llvm::object::ELFType<(llvm::endianness)1, true>>(unsigned char*, llvm::parallel::TaskGroup&)::'lambda'(unsigned long, unsigned long)::operator()(unsigned long, unsigned long) const (../bin/ld.lld+0xe7352b)
#12 0x000055666ef4c785 std::_Function_handler<void (), llvm::parallel::TaskGroup::spawn(std::function<void ()>)::$_0>::_M_invoke(std::_Any_data const&) Parallel.cpp:0:0
#13 0x000055666ef4c263 llvm::parallel::detail::(anonymous namespace)::ThreadPoolExecutor::work(llvm::ThreadPoolStrategy, unsigned int) Parallel.cpp:0:0
#14 0x00007f0f644dee64 (/lib/x86_64-linux-gnu/libstdc++.so.6+0xdee64)
#15 0x00007f0f642a539c start_thread ./nptl/pthread_create.c:444:8
#16 0x00007f0f64326608 clone3 ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:80:0
Aborted

Could you take a look?

@smithp35
Copy link
Collaborator Author

Thanks for the reproducer. I've downloaded it and reproduced. I'll take a look today.

@smithp35
Copy link
Collaborator Author

To give you a quick update on the cause. There is an optimisation that won't create a BTI landing pad for a location if the linker can use a direct branch in the Thunk. https://github.com/llvm/llvm-project/blob/main/lld/ELF/Thunks.cpp#L580

It looks like what is happening is:

  • Pass N thunk is created. It is in short-branch range so a direct branch can be used, no BTI needs creating.
  • Pass M (where M > N). More content changes are made so that the thunk can no longer use a short-branch.
  • writeTo() I need an indirect branch, where's my landing pad, assert failure.

I need to make a smaller test case, and to make the landing pad generation run each pass. Hopefully I can get this done Today, but just in case it can't, how urgent is this for you?

Right now the hwasan checks

0000000000000000 <__hwasan_check_x0_0>:
       0: 9344dc10      sbfx    x16, x0, #4, #52
       4: 38706930      ldrb    w16, [x9, x16]
       8: eb40e21f      cmp     x16, x0, lsr #56
       c: 54000041      b.ne    0x14 <__hwasan_check_x0_0+0x14>
      10: d65f03c0      ret
      14: a9b007e0      stp     x0, x1, [sp, #-0x100]!
      18: a90efbfd      stp     x29, x30, [sp, #0xe8]
      1c: d2800001      mov     x1, #0x0                // =0
      20: 90000010      adrp    x16, 0x0 <__hwasan_check_x0_0>
      24: f9400210      ldr     x16, [x16]
      28: d61f0200      br      x16

Do not have a BTI or compatible instruction, so these would cause a runtime failure if BTI was enabled and the linker did generate an indirect branch (and code-path went through it).

There's a quick fix to disable the optimisation, but that would mean updating a rather large test, so I'd prefer to take the time to fix with the optimisation.

smithp35 added a commit to smithp35/llvm-project that referenced this pull request Nov 15, 2024
Move Landing Pad Creation to a new function that checks each thunk
every pass to see if it needs a landing pad. This permits a thunk
to be created without needing a landing pad, but later needing one
due to drifting out of direct branch range and requiring an indirect
branch.

We record all the Thunks created so far in a new vector rather than
trying to iterate over the DenseMap as we need a deterministic order
of adding LandingPadThunks due to the short branch fall through. We
cannot use normalizeExistingThunk() either as that only iterates
through live thunks.

Fixes: https://crbug.com/377438309
Original PR: llvm#108989

Sending without a new test case to fix existing test. A new regression
test will come in a separate PR as coming up with a small enough
reproducer for this case is non-trivial.
@smithp35
Copy link
Collaborator Author

I've created #116402 to fix the repro in https://crbug.com/377438309#comment8 I've not got a regression test yet as that will take some time to get one small enough. I hope to add one next week.

smithp35 added a commit that referenced this pull request Nov 15, 2024
Move Landing Pad Creation to a new function that checks each thunk every
pass to see if it needs a landing pad. This permits a thunk to be
created without needing a landing pad, but later needing one due to
drifting out of direct branch range and requiring an indirect branch.

We record all the Thunks created so far in a new vector rather than
trying to iterate over the DenseMap as we need a deterministic order of
adding LandingPadThunks due to the short branch fall through. We cannot
use normalizeExistingThunk() either as that only iterates through live
thunks.

Fixes: https://crbug.com/377438309
Original PR: #108989

Sending without a new test case to fix existing test. A new regression
test will come in a separate PR as coming up with a small enough
reproducer for this case is non-trivial.
@zmodem
Copy link
Collaborator

zmodem commented Nov 15, 2024

Thanks for the quick fix!

smithp35 added a commit that referenced this pull request Nov 21, 2024
When we create a thunk we don't know whether it will be short or long.
Move the emission of the long thunk mapping symbol to when we transition
to a long thunk. This improves disassembly and binary analysis as tools
like BOLT identify thunks by disassembly.

This removes a FIXME added in #108989 aarch64-thunk-bti-multipass.s
which had a corrupt disassembly due to missing mapping symbols.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants