[x86] Use function-relative label differences in the large code model, and consider it in other code models #62894

rnk · 2023-05-23T20:00:41Z

This is the code pattern LLVM currently uses for jump tables:

int f(int x, int y) {
  switch (x) {
  case 0: y *= 2;
  case 1: y++;
  case 2: y *= 3;
  case 3: y++;
  case 4: y += 4;
  }
  return y;
}

->

        movl    %edi, %ecx
        leaq    .LJTI0_0(%rip), %rdx
        movslq  (%rdx,%rcx,4), %rcx
        addq    %rdx, %rcx
        jmpq    *%rcx
...
        .section        .rodata,"a",@progbits
        .p2align        2, 0x0
.LJTI0_0:
        .long   .LBB0_2-.LJTI0_0
        .long   .LBB0_3-.LJTI0_0
        .long   .LBB0_4-.LJTI0_0
        .long   .LBB0_5-.LJTI0_0
        .long   .LBB0_6-.LJTI0_0

As shown, the address of the jump table is materialized (LEA), the offset is loaded, added to the materialized offset, and we jump.

First, these label differences may overflow in the large code model when .rodata is far from .text. This code pattern also requires a static relocation for each entry because the delta is cross-section. See the objdump output:

RELOCATION RECORDS FOR [.rodata]:
OFFSET           TYPE              VALUE
0000000000000000 R_X86_64_PC32     .text+0x0000000000000019
0000000000000004 R_X86_64_PC32     .text+0x000000000000001f
0000000000000008 R_X86_64_PC32     .text+0x0000000000000025
000000000000000c R_X86_64_PC32     .text+0x000000000000002c
0000000000000010 R_X86_64_PC32     .text+0x0000000000000032

We could, however, change the code pattern to avoid these relocations by adding in a displacement to the movslq instruction above, consider this assembly:

        movl    %edi, %ecx
        leaq    .Lfunc_begin0(%rip), %rdx
        movslq  .LJTI0_0-.Lfunc_begin0(%rdx,%rcx,4), %rcx
        addq    %rdx, %rcx
        jmpq    *%rcx
...
.LJTI0_0:
        .long   .LBB0_2-.Lfunc_begin0
        .long   .LBB0_3-.Lfunc_begin0
        .long   .LBB0_4-.Lfunc_begin0
        .long   .LBB0_5-.Lfunc_begin0
        .long   .LBB0_6-.Lfunc_begin0

This doesn't add any additional instructions, but it does use a more complex address mode which increases code size by 4 bytes, which could result in runtime performance degradation. Normally, we never care about static object file size, but I imagine that computing smaller, non-negative offsets results in more zero bytes in the jump table that probably compress better in the end.

We can't use the displacement encoding trick in the large code model. Instead, we have to materialize a 64-bit offset and add it, but that's the large code model for you.

@aeubanks @jyknight @tkoeppe

The text was updated successfully, but these errors were encountered:

llvmbot · 2023-05-23T20:30:20Z

@llvm/issue-subscribers-backend-x86

efriedma-quic · 2023-05-26T17:17:35Z

Note that AArch64 uses distances between basic blocks to implement compressed jump tables. (Not sure how difficult it would be to implement compression on x86 given we don't really need to track the sizes of instructions at the moment...)

A more exotic alternative is to embed the jump table into the text section:

        leaq    .Lanchor(%rip), %rdx
        leaq    (%rdx,%rcx,8), %rcx
        jmpq    *%rcx
.p2align 3
.Lanchor:
        jmp     .LBB0_2
.p2align 3
        jmp     .LBB0_3
.p2align 3
        jmp     .LBB0_4
.p2align 3
        jmp     .LBB0_5

rnk · 2023-05-26T17:33:11Z

We also have the option to put the data directly into the text section, which I think we did at some point in history. The only downside is that it creates more ROP gadgets and doesn't disassemble nicely. Darwin does this, although I'm not familiar with the .data_region jt32 directive, maybe the linker uses that to move it to rodata.

I think the executable jump table you've suggested ends up having worse code size overall, so it's a fun alternative, but probably not something we'd ship. :)

If we want to compress x86 jump tables, we could either estimate an upper bound on function length (65k is the obvious one), or we could invent fancy assembler directive patters (.if L1 - L2 < 65536 ... .else).

efriedma-quic · 2023-05-26T18:02:21Z

although I'm not familiar with the .data_region jt32 directive, maybe the linker uses that to move it to rodata.

I think that just marks the fact that there's data embedded in a code section, so disassemblers can be aware.

I think the executable jump table you've suggested ends up having worse code size overall, so it's a fun alternative, but probably not something we'd ship. :)

You can eliminate the padding at the cost of an extra "add"... that's 5 bytes per entry, so not that terrible. (Instead of .p2align, you write something like .fill 0x90, 1, .Lbefore_branch+5-.Lafter_branch to ensure each entry is exactly 5 bytes.)

aeubanks · 2023-08-31T19:50:14Z

https://reviews.llvm.org/D159297 uses a different approach of just making jump table entries 64 bits under the large code model. That's more resistant to blocks in functions getting moved around (e.g. BOLT-like optimizations).

rnk · 2023-08-31T20:23:12Z

I think the main benefit of this would be reducing static relocation size, which is essentially a very very low priority for the project, and we're not going to implement this idea.

With the large code model, the label difference may not fit into 32 bits. Even if we assume that any individual function is no larger than 2^32 and use a difference from the function entry to the target destination, things like BOLT can rearrange blocks (even if BOLT doesn't necessarily work with the large code model right now). set directives avoid static relocations in some 32-bit entry cases, but don't worry about set directives for 64-bit jump table entries (we can do that later if somebody really cares about it). check-llvm in a bootstrapped clang with the large code model passes. Fixes #62894 Reviewed By: rnk Differential Revision: https://reviews.llvm.org/D159297

With the large code model, the label difference may not fit into 32 bits. Even if we assume that any individual function is no larger than 2^32 and use a difference from the function entry to the target destination, things like BOLT can rearrange blocks (even if BOLT doesn't necessarily work with the large code model right now). set directives avoid static relocations in some 32-bit entry cases, but don't worry about set directives for 64-bit jump table entries (we can do that later if somebody really cares about it). check-llvm in a bootstrapped clang with the large code model passes. Fixes llvm#62894 Reviewed By: rnk Differential Revision: https://reviews.llvm.org/D159297 commit-id:9020bd5e

github-actions bot added the new issue label May 23, 2023

rnk added the backend:X86 label May 23, 2023

EugeneZelenko removed the new issue label May 23, 2023

rnk closed this as completed Aug 31, 2023

EugeneZelenko added the llvm:codegen label Aug 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[x86] Use function-relative label differences in the large code model, and consider it in other code models #62894

[x86] Use function-relative label differences in the large code model, and consider it in other code models #62894

rnk commented May 23, 2023

llvmbot commented May 23, 2023

efriedma-quic commented May 26, 2023

rnk commented May 26, 2023

efriedma-quic commented May 26, 2023

aeubanks commented Aug 31, 2023

rnk commented Aug 31, 2023

[x86] Use function-relative label differences in the large code model, and consider it in other code models #62894

[x86] Use function-relative label differences in the large code model, and consider it in other code models #62894

Comments

rnk commented May 23, 2023

llvmbot commented May 23, 2023

efriedma-quic commented May 26, 2023

rnk commented May 26, 2023

efriedma-quic commented May 26, 2023

aeubanks commented Aug 31, 2023

rnk commented Aug 31, 2023