ELF: CFI jump table relaxation. #147424

pcc · 2025-07-07T23:41:19Z

Indirection via the jump table increases the icache and TLB miss rate
associated with indirect calls, and according to internal benchmarking
was identified as one of the main runtime costs of CFI, contributing
around 30% of the total overhead. #145579 addressed the problem for
direct calls to jump table entries, but the indirect call overhead is
still present. This patch implements jump table relaxation, which is a
technique for opportunistically reducing the indirect call overhead.

The basic idea is to eliminate the indirection by moving function
bodies into the jump table wherever possible. This is possible in two
circumstances:

When the body size is at most the size of a jump table entry.
When the function is the last function in the jump table.

In both cases, we may move the function body into the jump table by
splitting the jump table in two, with enough space in the middle for the
function body, and placing the function there.

We leave the last function in the jump table at its original location
and place the rest of the jump table behind it. The goal of this is to
decrease the TLB miss rate, on the assumption that it is more likely
for functions with the same type (and their callees) to be in the same
page as each other than for them to be in the same page as the original
location of the jump table (typically clustered together near the end
of the binary).

Jump table relaxation was found to reduce the overhead of CFI in a large
realistic internal Google benchmark by between 0.2 and 0.5 percentage
points, or 10-25%, depending on the microarchitecture.

TODO:

The jump table relaxation optimization as implemented is not sound in
general. At minimum, it must assume that it is only possible to branch
to entry points [1]. Therefore, we need a way to mark jump table sections
so that the linker knows that it is safe to do the optimization. In this
prototype implementation, I use the section name to identify jump tables
using the pattern that LLVM happens to use, but this is not particularly
sound. Possibilities that we may consider include:

Define a new section type, e.g. SHT_LLVM_CFI_JUMP_TABLE. Sections with
this type are considered to be jump table sections, and the sh_entsize
field of the section notifies the linker of the jump table entry size.
Define a new relocation type for jump table entries.

Of course, once we've decided on the appropriate way to identify the
jump table sections, tests will need to be written.

This implementation is for X86_64 only. I considered whether it would be
possible to implement it on AArch64. One difficulty is that the range
extension thunk pass is allowed to place a thunk in the middle of the
split up jump table, which would break the jump table check arithmetic,
so we would need to teach the thunk infrastructure to avoid doing this.
This would likely be done in a followup patch.

[1] In this prototype implementation, I made these additional assumptions
for expediency:

All functions referred to by the jump table may have their alignment
relaxed to fit in the jump table.
All functions referred to by the jump table may be moved to a different
section (i.e. the jump table section).
The jump table entry size is equal to the distance between relocations
in the jump table section.

I think it should be possible to find a way to avoid making these
assumptions in a later version of this change, for example the first one
can be avoided by modifying Clang to optionally use smaller alignments
for smaller functions.

Created using spr 1.3.6-beta.1

llvmbot · 2025-07-07T23:41:50Z

@llvm/pr-subscribers-lld-elf

@llvm/pr-subscribers-lld

Author: Peter Collingbourne (pcc)

Changes

Indirection via the jump table increases the icache and TLB miss rate
associated with indirect calls, and according to internal benchmarking
was identified as one of the main runtime costs of CFI, contributing
around 30% of the total overhead. #145579 addressed the problem for
direct calls to jump table entries, but the indirect call overhead is
still present. This patch implements jump table relaxation, which is a
technique for opportunistically reducing the indirect call overhead.

The basic idea is to eliminate the indirection by moving function
bodies into the jump table wherever possible. This is possible in two
circumstances:

When the body size is at most the size of a jump table entry.
When the function is the last function in the jump table.

In both cases, we may move the function body into the jump table by
splitting the jump table in two, with enough space in the middle for the
function body, and placing the function there.

We leave the last function in the jump table at its original location
and place the rest of the jump table behind it. The goal of this is to
decrease the TLB miss rate, on the assumption that it is more likely
for functions with the same type (and their callees) to be in the same
page as each other than for them to be in the same page as the original
location of the jump table (typically clustered together near the end
of the binary).

Jump table relaxation was found to reduce the overhead of CFI in a large
realistic internal Google benchmark by between 0.2 and 0.5 percentage
points, or 10-25%, depending on the microarchitecture.

TODO:

The jump table relaxation optimization as implemented is not sound in
general. At minimum, it must assume that it is only possible to branch
to entry points [1]. Therefore, we need a way to mark jump table sections
so that the linker knows that it is safe to do the optimization. In this
prototype implementation, I use the section name to identify jump tables
using the pattern that LLVM happens to use, but this is not particularly
sound. Possibilities that we may consider include:

Define a new section type, e.g. SHT_LLVM_CFI_JUMP_TABLE. Sections with
this type are considered to be jump table sections, and the sh_entsize
field of the section notifies the linker of the jump table entry size.
Define a new relocation type for jump table entries.

Of course, once we've decided on the appropriate way to identify the
jump table sections, tests will need to be written.

This implementation is for X86_64 only. I considered whether it would be
possible to implement it on AArch64. One difficulty is that the range
extension thunk pass is allowed to place a thunk in the middle of the
split up jump table, which would break the jump table check arithmetic,
so we would need to teach the thunk infrastructure to avoid doing this.
This would likely be done in a followup patch.

[1] In this prototype implementation, I made these additional assumptions
for expediency:

All functions referred to by the jump table may have their alignment
relaxed to fit in the jump table.
All functions referred to by the jump table may be moved to a different
section (i.e. the jump table section).
The jump table entry size is equal to the distance between relocations
in the jump table section.

I think it should be possible to find a way to avoid making these
assumptions in a later version of this change, for example the first one
can be avoided by modifying Clang to optionally use smaller alignments
for smaller functions.

Full diff: https://github.com/llvm/llvm-project/pull/147424.diff

3 Files Affected:

(modified) lld/ELF/Arch/X86_64.cpp (+95)
(modified) lld/ELF/Relocations.cpp (+1-1)
(modified) lld/ELF/Target.h (+1)

diff --git a/lld/ELF/Arch/X86_64.cpp b/lld/ELF/Arch/X86_64.cpp
index 488f4803b2cb4..04ca79befdc4a 100644
--- a/lld/ELF/Arch/X86_64.cpp
+++ b/lld/ELF/Arch/X86_64.cpp
@@ -318,6 +318,9 @@ bool X86_64::deleteFallThruJmpInsn(InputSection &is, InputFile *file,
 }
 
 bool X86_64::relaxOnce(int pass) const {
+  if (pass == 0)
+    relaxJumpTables(ctx);
+
   uint64_t minVA = UINT64_MAX, maxVA = 0;
   for (OutputSection *osec : ctx.outputSections) {
     if (!(osec->flags & SHF_ALLOC))
@@ -1231,6 +1234,98 @@ void X86_64::applyBranchToBranchOpt() const {
                              redirectControlTransferRelocations);
 }
 
+void elf::relaxJumpTables(Ctx &ctx) {
+  // Relax CFI jump tables.
+  // - Split jump table into pieces and place target functions inside the jump
+  //   table if small enough.
+  // - Move jump table before last called function and delete last branch
+  //   instruction.
+  std::map<InputSection *, std::vector<InputSection *>> sectionReplacements;
+  SmallVector<InputSection *, 0> storage;
+  for (OutputSection *osec : ctx.outputSections) {
+    if (!(osec->flags & SHF_EXECINSTR))
+      continue;
+    for (InputSection *sec : getInputSections(*osec, storage)) {
+      if (!sec->name.starts_with(".text..L.cfi.jumptable"))
+        continue;
+      std::vector<InputSection *> replacements;
+      replacements.push_back(sec);
+      auto addSectionSlice = [&](size_t begin, size_t end, Relocation *rbegin,
+                                 Relocation *rend) {
+        if (begin == end)
+          return;
+        auto *slice = make<InputSection>(
+            sec->file, sec->name, sec->type, sec->flags, 1, sec->entsize,
+            sec->contentMaybeDecompress().slice(begin, end - begin));
+        for (const Relocation &r : ArrayRef<Relocation>(rbegin, rend)) {
+          slice->relocations.push_back(
+              Relocation{r.expr, r.type, r.offset - begin, r.addend, r.sym});
+        }
+        replacements.push_back(slice);
+      };
+      auto getMovableSection = [&](Relocation &r) -> InputSection * {
+        auto *sym = dyn_cast_or_null<Defined>(r.sym);
+        if (!sym || sym->isPreemptible || sym->isGnuIFunc() || sym->value != 0)
+          return nullptr;
+        auto *sec = dyn_cast_or_null<InputSection>(sym->section);
+        if (!sec || sectionReplacements.count(sec))
+          return nullptr;
+        return sec;
+      };
+      size_t begin = 0;
+      Relocation *rbegin = sec->relocs().begin();
+      for (auto &r : sec->relocs().slice(0, sec->relocs().size() - 1)) {
+        auto entrySize = (&r + 1)->offset - r.offset;
+        InputSection *target = getMovableSection(r);
+        if (!target || target->size > entrySize)
+          continue;
+        target->addralign = 1;
+        addSectionSlice(begin, r.offset - 1, rbegin, &r);
+        replacements.push_back(target);
+        sectionReplacements[target] = {};
+        begin = r.offset - 1 + target->size;
+        rbegin = &r + 1;
+      }
+      InputSection *lastSec = getMovableSection(sec->relocs().back());
+      if (lastSec) {
+        lastSec->addralign = 1;
+        addSectionSlice(begin, sec->relocs().back().offset - 1, rbegin,
+                        &sec->relocs().back());
+        replacements.push_back(lastSec);
+        sectionReplacements[sec] = {};
+        sectionReplacements[lastSec] = replacements;
+        for (auto *s : replacements)
+          s->parent = lastSec->parent;
+      } else {
+        addSectionSlice(begin, sec->size, rbegin, sec->relocs().end());
+        sectionReplacements[sec] = replacements;
+        for (auto *s : replacements)
+          s->parent = sec->parent;
+      }
+      sec->relocations.clear();
+      sec->size = 0;
+    }
+  }
+  for (OutputSection *osec : ctx.outputSections) {
+    if (!(osec->flags & SHF_EXECINSTR))
+      continue;
+    for (SectionCommand *cmd : osec->commands) {
+      auto *isd = dyn_cast<InputSectionDescription>(cmd);
+      if (!isd)
+        continue;
+      SmallVector<InputSection *> newSections;
+      for (auto *sec : isd->sections) {
+        auto i = sectionReplacements.find(sec);
+        if (i == sectionReplacements.end())
+          newSections.push_back(sec);
+        else
+          newSections.append(i->second.begin(), i->second.end());
+      }
+      isd->sections = std::move(newSections);
+    }
+  }
+}
+
 // If Intel Indirect Branch Tracking is enabled, we have to emit special PLT
 // entries containing endbr64 instructions. A PLT entry will be split into two
 // parts, one in .plt.sec (writePlt), and the other in .plt (writeIBTPlt).
diff --git a/lld/ELF/Relocations.cpp b/lld/ELF/Relocations.cpp
index cebd564036b2c..f7e3d54878395 100644
--- a/lld/ELF/Relocations.cpp
+++ b/lld/ELF/Relocations.cpp
@@ -1674,7 +1674,7 @@ void RelocationScanner::scan(Relocs<RelTy> rels) {
   // R_RISCV_PCREL_HI20, R_PPC64_ADDR64 and the branch-to-branch optimization.
   if (ctx.arg.emachine == EM_RISCV ||
       (ctx.arg.emachine == EM_PPC64 && sec->name == ".toc") ||
-      ctx.arg.branchToBranch)
+      ctx.arg.branchToBranch || sec->name.starts_with(".text..L.cfi.jumptable"))
     llvm::stable_sort(sec->relocs(),
                       [](const Relocation &lhs, const Relocation &rhs) {
                         return lhs.offset < rhs.offset;
diff --git a/lld/ELF/Target.h b/lld/ELF/Target.h
index 6dd20b2f0cbaa..e6eb33fa5338c 100644
--- a/lld/ELF/Target.h
+++ b/lld/ELF/Target.h
@@ -195,6 +195,7 @@ void setSPARCV9TargetInfo(Ctx &);
 void setSystemZTargetInfo(Ctx &);
 void setX86TargetInfo(Ctx &);
 void setX86_64TargetInfo(Ctx &);
+void relaxJumpTables(Ctx &);
 
 struct ErrorPlace {
   InputSectionBase *isec;

MaskRay · 2025-07-09T03:36:40Z

There is no accompanying test, making its transformations unclear. Relying on a "magic" section name to trigger transformations feels unreliable and imprecise. Introducing a new section type and relocation type might justify the feature.

pcc · 2025-07-09T04:01:11Z

There is no accompanying test, making its transformations unclear.

Apologies if the description wasn't enough. My intent was to write a test once we agree on the protocol between the compiler and the linker.

I'll try to illustrate my goal with some assembly:

.section .text.jumptable,"ax",@progbits
f1:
jmp f1.cfi
.balign 8, 0xcc

f2:
jmp f2.cfi
.balign 8, 0xcc

f3:
jmp f3.cfi
.balign 8, 0xcc

.section .text.f1,"ax",@progbits
f1.cfi:
ret

.section .text.f2,"ax",@progbits
f2.cfi:
.fill 16, 1, 0x90
ret

.section .text.f3,"ax",@progbits
f3.cfi:
.fill 16, 1, 0x90
ret

I want the linker to do this:

.section .text.jumptable,"ax",@progbits
; f1.cfi was small enough to fit
f1:
f1.cfi:
ret
.balign 8, 0xcc

; f2.cfi could not fit in the jump table
f2:
jmp f2.cfi
.balign 8, 0xcc

; f3.cfi was last jump table entry and replaces the final jmp
f3:
f3.cfi:
.fill 16,1, 0x90
ret

; f2.cfi stays where it was
.section .text.f2,"ax",@progbits
f2.cfi:
.fill 16, 1, 0x90
ret

Relying on a "magic" section name to trigger transformations feels unreliable and imprecise. Introducing a new section type and relocation type might justify the feature.

Agreed. I used a magic section name for expediency so that I'd have something to share in this PR.

So do you think that both a new section type and a new relocation type are needed? I was thinking that either/or would be enough.

[𝘀𝗽𝗿] initial version

5bce06b

Created using spr 1.3.6-beta.1

llvmbot added lld lld:ELF labels Jul 7, 2025

pcc requested a review from MaskRay July 7, 2025 23:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ELF: CFI jump table relaxation. #147424

ELF: CFI jump table relaxation. #147424

pcc commented Jul 7, 2025

Uh oh!

llvmbot commented Jul 7, 2025 •

edited

Loading

Uh oh!

MaskRay commented Jul 9, 2025

Uh oh!

pcc commented Jul 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

ELF: CFI jump table relaxation. #147424

Are you sure you want to change the base?

ELF: CFI jump table relaxation. #147424

Conversation

pcc commented Jul 7, 2025

Uh oh!

llvmbot commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaskRay commented Jul 9, 2025

Uh oh!

pcc commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

llvmbot commented Jul 7, 2025 •

edited

Loading

pcc commented Jul 9, 2025 •

edited

Loading