Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bolt] fix a wrong relocation update issue with weak references #69136

Merged
merged 1 commit into from
Aug 7, 2024

Conversation

linsinan1995
Copy link
Member

It is legal to have an address of zero with weak references, but Bolt will update the relocation against this symbol to its PLT entry(a bolt-synthetized symbol and has a non-zero address), which leads to wrong runtime behaviour.

I recently encountered a problem where a segv occurs after using Bolt(crash even just llvm-bolt app -o app.opt). It turns out to be related to weak references. e.g.

__attribute__((weak)) void undef_weak_fun();

  if (&undef_weak_fun)
    undef_weak_fun();

(ref: https://maskray.me/blog/2021-04-25-weak-symbol)

In this case, an undefined weak symbol undef_weak_fun has an address of zero, and Bolt incorrectly changes the relocation for the corresponding symbol to symbol@PLT, leading to incorrect runtime behaviour.

A real-world use case of weak reference: facebook/zstd@6cee3c2

@linsinan1995 linsinan1995 changed the title [Bolt] do not search for PLT entries if the relocation is against [Bolt] fix a wrong relocation update issue with weak references Oct 16, 2023
@linsinan1995 linsinan1995 force-pushed the bolt-weak-reference branch 2 times, most recently from d999ee5 to cef0e7d Compare October 16, 2023 04:58
Copy link
Member

@yota9 yota9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

@@ -1974,6 +1974,14 @@ bool RewriteInstance::analyzeRelocation(
if (!Relocation::isSupported(RType))
return false;

auto isWeakReference = [](const SymbolRef &Symbol) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsWeakUndReference

Copy link
Member

@yota9 yota9 Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping, please capitalise and add Und

.type _start, %function
_start:
.LFB6:
.cfi_startproc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to add second case in test, where the symbol is actually emitted

@linsinan1995 linsinan1995 force-pushed the bolt-weak-reference branch 2 times, most recently from f54c244 to 345b857 Compare October 31, 2023 13:58
Copy link
Member

@yota9 yota9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch! Please capitalise and and Und to IsWeakUndReference.

@@ -1974,6 +1974,14 @@ bool RewriteInstance::analyzeRelocation(
if (!Relocation::isSupported(RType))
return false;

auto isWeakReference = [](const SymbolRef &Symbol) {
Copy link
Member

@yota9 yota9 Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ping, please capitalise and add Und


# CHECK: {{.*}} <.rodata>:
# CHECK-NEXT: {{.*}} .word 0x{{[0]+}}[[#ADDR]]
# CHECK-NEXT: {{.*}} .word 0x00000000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The func_1 check is missed now

@linsinan1995
Copy link
Member Author

linsinan1995 commented Nov 30, 2023

Hi @yota9 , sorry for a late reply.

I made several changes,

  • change assembly code to obj2yaml output, since I notice the linker could impact the result.
  • add both func_1 and func_2 tests for weak undefined and normal cases.

I still use IsWeakReference, since weak reference is a term for such a case. Is it ok?

@yota9
Copy link
Member

yota9 commented Dec 4, 2023

Hi @linsinan1995 . May I ask in what what it might impact it? Does some linkers removes the symbol name?
We're trying to avoid yaml tests. If it completely necessary we need to minimise yaml file (e.g. removing the symbols and sections that are not related to the test).

I still use IsWeakReference, since weak reference is a term for such a case. Is it ok?

Corrent me if I'm wrong but the weak reference doesn't always mean that the symbol is NULL. You're looking to the 2 parameters: weak and undefined, that is why I was asking to add both weak and und to the lambda name.

@linsinan1995
Copy link
Member Author

linsinan1995 commented Dec 5, 2023

Hi @linsinan1995 . May I ask in what what it might impact it? Does some linkers removes the symbol name? We're trying to avoid yaml tests. If it completely necessary we need to minimise yaml file (e.g. removing the symbols and sections that are not related to the test).

Hi,
I was unable to reproduce this issue in lld previously(fail to link), so as for the previous patch, I used the local ld linker for linking. However, this case cannot pass on the Windows aarch64 CI (most likely because the local linker is not ld), which is what I mentioned about the impact of the linker.

I spent some time learning why this happened, and finally found a way to reproduce it with lld ... (we need dynamic relocation and also func_1/2 symbol values live in .rodata)

I still use IsWeakReference, since weak reference is a term for such a case. Is it ok?

Corrent me if I'm wrong but the weak reference doesn't always mean that the symbol is NULL. You're looking to the 2 parameters: weak and undefined, that is why I was asking to add both weak and und to the lambda name.

I think weak reference equals weak undefined. AArch64 ELF spec says: A weak reference — This is denoted by st_shndx=SHN_UNDEF, ELF64_ST_BIND()=STB_WEAK.

But both are fine. in lld/ELF/Symbols.h UndefWeak is used and in lld/MachO/Symbols.h and textapi isWeakRef is used. I have no preference for both names, or maybe I could just use UndefWeak, which is more intuitive.

@yota9
Copy link
Member

yota9 commented Dec 5, 2023

@linsinan1995 Thanks for explanations, I don't have objections on this :) The one thing I want to ask you is to minimise the text asm, it seems to be unnecessary to add all of the instructions , we want to keep tests minimalistic

…eak reference symbol.

Take a common weak reference pattern for example
```
__attribute__((weak)) void undef_weak_fun();

  if (&undef_weak_fun)
    undef_weak_fun();
```

In this case, an undefined weak symbol `undef_weak_fun` has an address
of zero, and Bolt incorrectly changes the relocation for the
corresponding symbol to symbol@PLT, leading to incorrect runtime
behavior.
@linsinan1995
Copy link
Member Author

@linsinan1995 Thanks for explanations, I don't have objections on this :) The one think I want to ask you is to minimise the text asm, it seems to be unnecessary to add all of the instructions , we want to keep tests minimalistic

Done :)

Copy link
Member

@yota9 yota9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your fix! :)

@rafaelauler
Copy link
Contributor

Thanks!

@rafaelauler
Copy link
Contributor

Regarding the formatting/descriptions on the PR:

Capitalize BOLT. As a suggestion, I would also make the title shorter (like 50/72 rule format). Although we don't explicitly enforce this, a lot of devs use this because it looks nicer on git log. In BOLT, most of diffs are formatted like that.

For example:

[BOLT] Don't resolve weak undefs in data to PLT

In AArch64, RISCV, BOLT resolved weak undefs in data sections to a PLT
entry. Even though code refs to weak syms can go to a PLT entry, data
section should remain zero to support the pattern if (sym) sym() used
with weak syms. BOLT was breaking this. Fix it.

The first line of the commit message should be the title of the PR, while the remaining lines formatted to 72 columns are the the description of the PR.

BTW I was re-reading https://reviews.llvm.org/D118088 on why are we redirecting these relocs to PLT entries in the first place, and I can't really tell why. @yota9 do you remember why? The problem with that is that we could be pessimizing some .data references if the linker concluded they should go straight to the functions and is bypassing the PLT. By doing this we go to these optimized references and convert them back to go through the PLT again. In some instances this might change the intended behavior too if the application was linked with the intent to make a given symbol non-preemptible, but we go there and make it preemptible again by redirecting the reference to a PLT entry. We can't blindly redirect everything to a PLT entry, and this weak-undef case is just one example why doing so is incorrect.

@maksfb
Copy link
Contributor

maksfb commented Dec 6, 2023

If the assembly code was generated from C source, could you please include the source in the comments and what compiler flags were used to generate the code?

@linsinan1995
Copy link
Member Author

linsinan1995 commented Dec 6, 2023

Regarding the formatting/descriptions on the PR:

Capitalize BOLT. As a suggestion, I would also make the title shorter (like 50/72 rule format). Although we don't explicitly enforce this, a lot of devs use this because it looks nicer on git log. In BOLT, most of diffs are formatted like that.

For example:

[BOLT] Don't resolve weak undefs in data to PLT

In AArch64, RISCV, BOLT resolved weak undefs in data sections to a PLT
entry. Even though code refs to weak syms can go to a PLT entry, data
section should remain zero to support the pattern if (sym) sym() used
with weak syms. BOLT was breaking this. Fix it.

The first line of the commit message should be the title of the PR, while the remaining lines formatted to 72 columns are the the description of the PR.

Will do. Thanks a ton for your detailed guide!

BTW I was re-reading https://reviews.llvm.org/D118088 on why are we redirecting these relocs to PLT entries in the first place, and I can't really tell why. @yota9 do you remember why? The problem with that is that we could be pessimizing some .data references if the linker concluded they should go straight to the functions and is bypassing the PLT. By doing this we go to these optimized references and convert them back to go through the PLT again. In some instances this might change the intended behavior too if the application was linked with the intent to make a given symbol non-preemptible, but we go there and make it preemptible again by redirecting the reference to a PLT entry. We can't blindly redirect everything to a PLT entry, and this weak-undef case is just one example why doing so is incorrect.

Thats a really good point.

func@plt symbol is synthetic and they are not in symtab, thus it brings difficulty when analyzing some relocs, such as R_AARCH64_ADR_PREL_PG_HI21. D118088 could help for such cases. I think a better solution for this issue is that BOLT should only replace the address with the plt symbol for those plt related relocations, so non-preemptible cases will not be touched and and so weak undef cases as well. What do you think?

@yota9
Copy link
Member

yota9 commented Dec 6, 2023

BTW I was re-reading https://reviews.llvm.org/D118088 on why are we redirecting these relocs to PLT entries in the first place, and I can't really tell why. @yota9 do you remember why?

Sorry, I fill like I didn't quite understand your question. I think you're talking about these lines:

if (!SymbolAddress && (IsAArch64 || BC->isRISCV())) {
const BinaryData *BD = BC->getPLTBinaryDataByName(SymbolName);
SymbolAddress = BD ? BD->getAddress() : 0;
}

The logic behind this is simple: we found the symbol, but the symbol has no address. Usually it means that the symbol is UND and it would be found it PLT. Yes, I didn't take into account the case like weak reference fixed here.
As for the non-preemtable symbols (visibility protected I think you're talking about) - these symbols won't be UND and we won't search for the PLT entries.

Maybe I misunderstand you, sorry, please give more more information and I would try to answer :)

@rafaelauler
Copy link
Contributor

Thanks for the context, @yota9. I missed the check that guards on the symbol having no address, which makes this code much more restricted than I previously thought. But I wonder which symbols would have no address, and still be strong definitions (non-weak)? I was trying to understand why is this if() necessary, specially after the fix in this patch.

Another thing I was discussing with Maksim yesterday: Maksim pointed out that we should be skipping processing static relocations in places that also have dynamic relocations, and this weak ref case is an example that should have a dynamic relocation. After all, the weak symbol pattern is precisely one in which the runtime can modify it by resolving the weak undef, so BOLT has no business in updating a reference that will be resolved at runtime. But if we look at https://github.com/llvm/llvm-project/blob/main/bolt/lib/Rewrite/RewriteInstance.cpp#L2451 we only do this for non-AArch64, because of https://reviews.llvm.org/D122100.

Now, we have this exception that AArch64 processes static relocs in places that have dynamic relocs in the same place because AArch64 uses R_AARCH64_RELATIVE in .rela.dyn (dynamic) to fixup the load address in constant islands. So we're forced to read static relocs that also have dynamic relocs in the same offset, for this specific case, for AArch64. But that seems to be too big of a hammer to fix this problem. Perhaps we should only process those static relocations if they have the dynamic R_AARCH64_RELATIVE in the same offset, but not other kinds of dynamic relocs [that might mean that the runtime will completely recompute the address at that location].

@yota9
Copy link
Member

yota9 commented Dec 6, 2023

. But I wonder which symbols would have no address, and still be strong definitions (non-weak)

Any PLT-related symbol located in another binary (DEFAULT UND)

Maksim pointed out that we should be skipping processing static relocations in places that also have dynamic relocations

As for the further discussions - we need to consider it carefully. Maybe you're right and we need to add "white" list of dynamic relocations that would also have static relocations. To be honest I can't say what way would be the best here right away, it needs deep consideration and long list of tests I think :) I would try to ponder about this at leisure cause it doesn't look as simple question to me..

# CHECK: {{.*}} <.rodata>:
# CHECK-NEXT: {{.*}} .word 0x00000000
# CHECK-NEXT: {{.*}} .word 0x00000000
# CHECK-NEXT: {{.*}} .word 0x{{[0]+}}[[#ADDR]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a discussion, another thing Maksim pointed out is that this actually needs to be zero too, because this part is resolved by dynamic linker at runtime. I've checked and the pre-bolt binary is indeed zeroed here. Currently, BOLT makes this non-zero because we're processing dynamic relocs for aarch64, for the reasons I discussed above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that before it is 0 here. But it is up to the linker to set value here or not. and for example to support RELR we need to set values here. I don't think it is a problem and would rather set the value, than not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that before it is 0 here. But it is up to the linker to set value here or not. and for example to support RELR we need to set values here. I don't think it is a problem and would rather set the value, than not.

Can we rely on the linker's decision in such cases? I.e., if the linker decided to put 0, keep it at that. If it was the symbol value, then we can update it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that before it is 0 here. But it is up to the linker to set value here or not. and for example to support RELR we need to set values here. I don't think it is a problem and would rather set the value, than not.

Can we rely on the linker's decision in such cases? I.e., if the linker decided to put 0, keep it at that. If it was the symbol value, then we can update it.

Theoretically we can. But is there a reason behind it? It doesn't affect runtime. And to be honest I prefer to set the values, we can easily see the value during objdump, sometimes it is useful.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yota9 why should BOLT change this? Isn't leaving it to linker the right thing to do and also less risky in terms of correctness?

Copy link
Member

@yota9 yota9 Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@s-dag Currently I don't see any risks. If there would be some troubles I would try to handle such situations in separate patch, but right now I didn't see binaries where we should worry about these changes. Anyway this question is out of scope of this patch.

@yota9
Copy link
Member

yota9 commented Jul 29, 2024

So can we agree on merging this commit @rafaelauler ? I still think that this extra check is not useless and it fixes the problem :)

@yota9
Copy link
Member

yota9 commented Jul 31, 2024

@maksfb @rafaelauler If no ojections I would like to merge this patch by the end of a week and backport it to 19x. Thanks!

@yota9
Copy link
Member

yota9 commented Aug 5, 2024

Hi @linsinan1995. Please merge this patch by wednesday too if no further comments would be provided. Thank you!

@s-dag
Copy link

s-dag commented Aug 6, 2024

Hi,

Wanted to report this seems to be the same issue we ran into while trying to apply BOLT to an Aarch64 binary dynamically linked with musl. So it is a very important use case for embedded software world and we expect more people will run into it as BOLT Aarch64/embedded usage picks up. It can be reproduced with a simple hello_world. Reproduction steps are below:

cat hello.c
#include <stdio.h>

int main() {
     printf("Hello World");
     return 0;
}

1- aarch64-unknown-linux-musl-gcc hello.c -Wl,-dynamic-linker=ld-musl-aarch64.so.1 -Wl,-emit-relocs -o hello_aarch64

2- llvm-bolt hello_aarch64 -o hello_bolted_skip --skip-funcs=.* --use-gnu-stack

hello_bolted_skip segfaults at start. When we compared disassembly, we saw .rodata had differences which boiled down to <__register_frame_info@plt> /<__deregister_frame_info@plt> relocs added by BOLT that are actually undefined weak references and are not defined in musl.

The patch in 2acdf00 does fix this reproducer as well as our own binary. We hope to see it merged soon.

@yota9
Copy link
Member

yota9 commented Aug 7, 2024

I agree, me or @linsinan1995 would probably merge it by the end of the day

@linsinan1995 linsinan1995 merged commit 6c8933e into llvm:main Aug 7, 2024
3 checks passed
@linsinan1995 linsinan1995 added this to the LLVM 19.X Release milestone Aug 7, 2024
@linsinan1995
Copy link
Member Author

/cherry-pick 6c8933e

@linsinan1995
Copy link
Member Author

merged, Thanks!

llvmbot pushed a commit to llvmbot/llvm-project that referenced this pull request Aug 7, 2024
)

Take a common weak reference pattern for example
```
    __attribute__((weak)) void undef_weak_fun();

      if (&undef_weak_fun)
        undef_weak_fun();
```

In this case, an undefined weak symbol `undef_weak_fun` has an address
of zero, and Bolt incorrectly changes the relocation for the
corresponding symbol to symbol@PLT, leading to incorrect runtime
behavior.

(cherry picked from commit 6c8933e)
@llvmbot
Copy link
Member

llvmbot commented Aug 7, 2024

/pull-request #102295

TIFitis pushed a commit that referenced this pull request Aug 8, 2024
Take a common weak reference pattern for example
```
    __attribute__((weak)) void undef_weak_fun();
    
      if (&undef_weak_fun)
        undef_weak_fun();
```
    
In this case, an undefined weak symbol `undef_weak_fun` has an address
of zero, and Bolt incorrectly changes the relocation for the
corresponding symbol to symbol@PLT, leading to incorrect runtime
behavior.
tru pushed a commit to llvmbot/llvm-project that referenced this pull request Aug 10, 2024
)

Take a common weak reference pattern for example
```
    __attribute__((weak)) void undef_weak_fun();

      if (&undef_weak_fun)
        undef_weak_fun();
```

In this case, an undefined weak symbol `undef_weak_fun` has an address
of zero, and Bolt incorrectly changes the relocation for the
corresponding symbol to symbol@PLT, leading to incorrect runtime
behavior.

(cherry picked from commit 6c8933e)
kstoimenov pushed a commit to kstoimenov/llvm-project that referenced this pull request Aug 15, 2024
)

Take a common weak reference pattern for example
```
    __attribute__((weak)) void undef_weak_fun();
    
      if (&undef_weak_fun)
        undef_weak_fun();
```
    
In this case, an undefined weak symbol `undef_weak_fun` has an address
of zero, and Bolt incorrectly changes the relocation for the
corresponding symbol to symbol@PLT, leading to incorrect runtime
behavior.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging this pull request may close these issues.

6 participants