Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[x86] Use function-relative label differences in the large code model, and consider it in other code models #62894

Closed
rnk opened this issue May 23, 2023 · 6 comments

Comments

@rnk
Copy link
Collaborator

rnk commented May 23, 2023

This is the code pattern LLVM currently uses for jump tables:

int f(int x, int y) {
  switch (x) {
  case 0: y *= 2;
  case 1: y++;
  case 2: y *= 3;
  case 3: y++;
  case 4: y += 4;
  }
  return y;
}

->

        movl    %edi, %ecx
        leaq    .LJTI0_0(%rip), %rdx
        movslq  (%rdx,%rcx,4), %rcx
        addq    %rdx, %rcx
        jmpq    *%rcx
...
        .section        .rodata,"a",@progbits
        .p2align        2, 0x0
.LJTI0_0:
        .long   .LBB0_2-.LJTI0_0
        .long   .LBB0_3-.LJTI0_0
        .long   .LBB0_4-.LJTI0_0
        .long   .LBB0_5-.LJTI0_0
        .long   .LBB0_6-.LJTI0_0

As shown, the address of the jump table is materialized (LEA), the offset is loaded, added to the materialized offset, and we jump.

First, these label differences may overflow in the large code model when .rodata is far from .text. This code pattern also requires a static relocation for each entry because the delta is cross-section. See the objdump output:

RELOCATION RECORDS FOR [.rodata]:
OFFSET           TYPE              VALUE
0000000000000000 R_X86_64_PC32     .text+0x0000000000000019
0000000000000004 R_X86_64_PC32     .text+0x000000000000001f
0000000000000008 R_X86_64_PC32     .text+0x0000000000000025
000000000000000c R_X86_64_PC32     .text+0x000000000000002c
0000000000000010 R_X86_64_PC32     .text+0x0000000000000032

We could, however, change the code pattern to avoid these relocations by adding in a displacement to the movslq instruction above, consider this assembly:

        movl    %edi, %ecx
        leaq    .Lfunc_begin0(%rip), %rdx
        movslq  .LJTI0_0-.Lfunc_begin0(%rdx,%rcx,4), %rcx
        addq    %rdx, %rcx
        jmpq    *%rcx
...
.LJTI0_0:
        .long   .LBB0_2-.Lfunc_begin0
        .long   .LBB0_3-.Lfunc_begin0
        .long   .LBB0_4-.Lfunc_begin0
        .long   .LBB0_5-.Lfunc_begin0
        .long   .LBB0_6-.Lfunc_begin0

This doesn't add any additional instructions, but it does use a more complex address mode which increases code size by 4 bytes, which could result in runtime performance degradation. Normally, we never care about static object file size, but I imagine that computing smaller, non-negative offsets results in more zero bytes in the jump table that probably compress better in the end.

We can't use the displacement encoding trick in the large code model. Instead, we have to materialize a 64-bit offset and add it, but that's the large code model for you.

@aeubanks @jyknight @tkoeppe

@llvmbot
Copy link
Member

llvmbot commented May 23, 2023

@llvm/issue-subscribers-backend-x86

@efriedma-quic
Copy link
Collaborator

Note that AArch64 uses distances between basic blocks to implement compressed jump tables. (Not sure how difficult it would be to implement compression on x86 given we don't really need to track the sizes of instructions at the moment...)

A more exotic alternative is to embed the jump table into the text section:

        leaq    .Lanchor(%rip), %rdx
        leaq    (%rdx,%rcx,8), %rcx
        jmpq    *%rcx
.p2align 3
.Lanchor:
        jmp     .LBB0_2
.p2align 3
        jmp     .LBB0_3
.p2align 3
        jmp     .LBB0_4
.p2align 3
        jmp     .LBB0_5

@rnk
Copy link
Collaborator Author

rnk commented May 26, 2023

We also have the option to put the data directly into the text section, which I think we did at some point in history. The only downside is that it creates more ROP gadgets and doesn't disassemble nicely. Darwin does this, although I'm not familiar with the .data_region jt32 directive, maybe the linker uses that to move it to rodata.

I think the executable jump table you've suggested ends up having worse code size overall, so it's a fun alternative, but probably not something we'd ship. :)

If we want to compress x86 jump tables, we could either estimate an upper bound on function length (65k is the obvious one), or we could invent fancy assembler directive patters (.if L1 - L2 < 65536 ... .else).

@efriedma-quic
Copy link
Collaborator

although I'm not familiar with the .data_region jt32 directive, maybe the linker uses that to move it to rodata.

I think that just marks the fact that there's data embedded in a code section, so disassemblers can be aware.

I think the executable jump table you've suggested ends up having worse code size overall, so it's a fun alternative, but probably not something we'd ship. :)

You can eliminate the padding at the cost of an extra "add"... that's 5 bytes per entry, so not that terrible. (Instead of .p2align, you write something like .fill 0x90, 1, .Lbefore_branch+5-.Lafter_branch to ensure each entry is exactly 5 bytes.)

@aeubanks
Copy link
Contributor

https://reviews.llvm.org/D159297 uses a different approach of just making jump table entries 64 bits under the large code model. That's more resistant to blocks in functions getting moved around (e.g. BOLT-like optimizations).

@rnk
Copy link
Collaborator Author

rnk commented Aug 31, 2023

I think the main benefit of this would be reducing static relocation size, which is essentially a very very low priority for the project, and we're not going to implement this idea.

@rnk rnk closed this as completed Aug 31, 2023
aeubanks added a commit that referenced this issue Aug 31, 2023
With the large code model, the label difference may not fit into 32 bits.
Even if we assume that any individual function is no larger than 2^32
and use a difference from the function entry to the target destination,
things like BOLT can rearrange blocks (even if BOLT doesn't necessarily
work with the large code model right now).

set directives avoid static relocations in some 32-bit entry cases, but
don't worry about set directives for 64-bit jump table entries (we can
do that later if somebody really cares about it).

check-llvm in a bootstrapped clang with the large code model passes.

Fixes #62894

Reviewed By: rnk

Differential Revision: https://reviews.llvm.org/D159297
smeenai pushed a commit to smeenai/llvm-project that referenced this issue Sep 1, 2023
With the large code model, the label difference may not fit into 32 bits.
Even if we assume that any individual function is no larger than 2^32
and use a difference from the function entry to the target destination,
things like BOLT can rearrange blocks (even if BOLT doesn't necessarily
work with the large code model right now).

set directives avoid static relocations in some 32-bit entry cases, but
don't worry about set directives for 64-bit jump table entries (we can
do that later if somebody really cares about it).

check-llvm in a bootstrapped clang with the large code model passes.

Fixes llvm#62894

Reviewed By: rnk

Differential Revision: https://reviews.llvm.org/D159297

commit-id:9020bd5e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants