Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLVM17 performing faulty outlining for cortex-m targets causes program crash #118867

Closed
jamesmunns opened this issue Dec 12, 2023 · 27 comments · Fixed by #119802
Closed

LLVM17 performing faulty outlining for cortex-m targets causes program crash #118867

jamesmunns opened this issue Dec 12, 2023 · 27 comments · Fixed by #119802
Assignees
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-unsound Issue: A soundness hole (worst kind of bug), see: https://en.wikipedia.org/wiki/Soundness llvm-fixed-upstream Issue expected to be fixed by the next major LLVM upgrade, or backported fixes O-Arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state P-high High priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another.

Comments

@jamesmunns
Copy link
Member

jamesmunns commented Dec 12, 2023

I tried this code:

https://github.com/peter9477/none-fault

(additional repro notes in the README there). I have verified this causes the code to crash as it jumps into a random RAM location unexpectedly.

This code does a fairly benign call to what boils down to:

format!("{:?}", None::<usize>);

However, things get sorta bad.

; This is deep inside <Option as core::fmt::Debug>::fmt
;
; Here, the vtable lookup has been outlined as OUTLINED_FUNCTION_14
;
                     LAB_000024ca                       
000024ca        00 f0 73 fb     bl         OUTLINED_FUNCTION_14

; calls into OUTLINED_FUNCTION_14
; at the time of call, LR = 24CF, which has clobbered the LR of the parent
;

    00002bb4    d1 e9 05 01     ldrd       r0,r1,[r1,#0x14]
    00002bb8    04 22           movs       r2,#0x4
    00002bba    cb 68           ldr        r3,[r1,#0xc]
    00002bbc    70 47           bx         lr

; The outlined function returns, using LR
;
; We now call the vtable method, however we don't restore LR prior to calling

000024ce        03 49           ldr        r1,[DAT_000024dc]   
000024d0        18 47           bx         r3

; This looks like it is SUPPOSED to be a tail-call function,
; and return to the caller of <Option as core::fmt::Debug>::fmt,
; but instead returns TO <Option as core::fmt::Debug>::fmt

    00000b74    80 b5           push       {r7,lr}
    00000b76    6f 46           mov        r7,sp
    00000b78    00 68           ldr        r0,[r0,#0x0]
    00000b7a    0a 44           add        r2,r1
    00000b7c    00 f0 54 f8     bl         _ZN132_$LT$alloc..vec..Vec$LT$T$C$A$GT$$u20$as 
    00000b80    00 20           movs       r0,#0x0
    00000b82    80 bd           pop        {r7,pc}

; since LR was clobbered, we go BACK to 24CE. However, since
; r3 is clobbered at this point (by memcpy), we jump into program memory
; and hard fault

The disassembled version by ghidra looks like this:

void _ZN66_$LT$core..option..Option$LT$T$GT$$u20$as$u20$core..fmt..Debug$GT$3fmt17h6606eac464c97c06E
               (int *param_1,undefined4 param_2,undefined4 param_3,code *UNRECOVERED_JUMPTABLE)

{
  undefined4 uVar1; // this is r3
  
  if (*param_1 != 0) {
    OUTLINED_FUNCTION_10(0x1be9,param_2,"SomeBusyNone    ",&stack0xfffffff4,0x1be9);
    return;
  }
  uVar1 = OUTLINED_FUNCTION_14();
                    /* WARNING: Could not recover jumptable at 0x000024d0. Too many branches */
                    /* WARNING: Treating indirect jump as call */

  // NOTE(jamesmunns): We call this function, but then return back and call it again!
  (*UNRECOVERED_JUMPTABLE)(uVar1,"None    "); 
  return;
}

As far as I can tell for THIS reproduction, it:

  • DOES happen on nightly-2023-08-09 and all that I've tried later than this
  • DOES NOT happen on nightly-2023-08-08 and before
  • The project updated from LLVM16 to LLVM17 on 2023-08-08, it seems -09 is the first nightly with LLVM17
  • ONLY happens with lto = "fat", thinlto and no lto do not reproduce
    • Edit: I only observed it happening with fatlto, but Peter previously saw it with thinlto
  • ONLY happens with -Oz (edit: the repro uses -Oz for debug builds, switching to -O3 in release does not repro)

I've attached my specific elf file, so you can look at the same memory locations referenced in my issue

none-fault.elf.zip

This does seem tempermental, and tweaking unrelated pieces of the repro code causes it to disappear.

CC @peter9477 @Dirbaio

@jamesmunns jamesmunns added the C-bug Category: This is a bug. label Dec 12, 2023
@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Dec 12, 2023
@jamesmunns
Copy link
Member Author

Also pushed jamesmunns/none-fault@cc96c01 which is the exact version used to produce the attached elf file above.

@jamesmunns
Copy link
Member Author

Tagging in @fhahn, who suggested that llvm/llvm-project#73553 may address this, if we can update the LLVM17 tip.

I've not done a "rebuild rustc with different LLVM submodule" before, but I'll ask around the embedded-rust folks to see if anyone can give me a hand or try it out.

@peter9477
Copy link

FYI, with my original code base where this first showed up, I can confirm that in fact this happens with lto = "fat" or lto = "thin", but not with lto = "off".

@jamesmunns jamesmunns changed the title LTO performing faulty outlining for cortex-m targets in LLVM17 LLVM17 performing faulty outlining for cortex-m targets causes program crash Dec 12, 2023
@jamesmunns
Copy link
Member Author

Yep, updated the title to point to the updated assumption that this is more likely due to LR clobbering of the outliner, rather than LTO. In my attempts, lto="thin" didn't repro, but that could have been chance more than science.

@saethlin
Copy link
Member

ONLY happens with -Oz (edit: the repro uses -Oz for debug builds, switching to -O3 in release does not repro)

What happens with -Copt-level=s? What happens with -Copt-level=z + -Zshare-generics=no?

@saethlin saethlin added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. O-Arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state I-unsound Issue: A soundness hole (worst kind of bug), see: https://en.wikipedia.org/wiki/Soundness and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Dec 12, 2023
@rustbot rustbot added the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label Dec 12, 2023
@jamesmunns
Copy link
Member Author

jamesmunns commented Dec 13, 2023

  • does not repro with -Copt-level=s (well, I changed the profile)
  • does repro with RUSTFLAGS='-Zshare-generics=no' cargo build edit: with opt-level='z'

I think @peter9477 might have mentioned the original code having problems in opt='s', though this repro is pretty sensitive to sometimes working with relatively minor changes.

@jamesmunns
Copy link
Member Author

If someone can point me to docs that show how to build a new toolchain, including switching the LLVM version to the branch from llvm/llvm-project#73553 (hoping that it Just Builds), I can likely take a look at it this weekend.

Otherwise, if anyone knows how to do that quickly, and can do that + do a cargo build on https://github.com/peter9477/none-fault and give me the produced target/thumbv7em-none-eabihf/debug/none-fault, I can test that sooner.

@peter9477
Copy link

peter9477 commented Dec 13, 2023

I think @peter9477 might have mentioned the original code having problems in opt='s', though this repro is pretty sensitive to sometimes working with relatively minor changes.

As James mentioned, I believe I had reproduced it yesterday with both "s" and "z", and definitely not with 1, 2, or 3, but it was a very long session and I won't swear to it. The most obvious symptom (a hard fault) can come and go based on link order (it seems), and at one stage during minimization I had it causing a borrow_ref_mut() panic inside a critical section (which should be impossible). It's entirely possible that this is what I remember failing with "s". At the time I assumed they were both manifestations of the same issue (but someone later suggested an unrelated possible cause for that issue).

I just tried with "s" on my full code base and it does not fail.

@Dirbaio
Copy link
Contributor

Dirbaio commented Dec 13, 2023

The bug is caused by the outliner, which is only enabled in opt-level=z I think (couldn't find an official source for this).

Can you check if it adding RUSTFLAGS=-Cllvm-args=--enable-machine-outliner=never also fixes it on opt-level=z?

@jamesmunns
Copy link
Member Author

The bug is caused by the outliner, which is only enabled in opt-level=z I think (couldn't find an official source for this).

Can you check if it adding RUSTFLAGS=-Cllvm-args=--enable-machine-outliner=never also fixes it on opt-level=z?

It does not reproduce with these settings

@peter9477
Copy link

Confirmed. With lto="fat" and opt-level="z" and that RUSTFLAG added (to my .cargo/config.toml) the issue does not appear.

@nikic
Copy link
Contributor

nikic commented Dec 21, 2023

Looks like the new PR for this issue has been merged: llvm/llvm-project#75527

@nikic nikic added the llvm-fixed-upstream Issue expected to be fixed by the next major LLVM upgrade, or backported fixes label Dec 21, 2023
@jamesmunns
Copy link
Member Author

jamesmunns commented Dec 21, 2023

Hey @nikic, is there a way we can get one of the following two options done before rust 1.75? Otherwise we'll have a stable-to-stable regression for all cortex-m targets (thumbv*-none-eabi*):

  • Pull in this new LLVM tip
  • Make RUSTFLAGS=-Cllvm-args=--enable-machine-outliner=never the default temporarily as a hotfix

If anyone can point me to docs on how to:

  • update the rustc repo to use the new LLVM tip
  • build a version of the compiler with that LLVM tip

So I can test whether that fix addresses the issue we saw (I expect it will!), I can do that ASAP.

@nikic
Copy link
Contributor

nikic commented Dec 21, 2023

The LLVM 17 update is already on stable, so I don't think this would cause a new stable regression. Together with the fact that a trivial workaround exists, I am not willing to include this in the 1.75 release.

I do plan to backport this to nightly (and the new beta) though, together with a number of other fixes.

@jamesmunns
Copy link
Member Author

@nikic did LLVM17 make it into 1.74? I thought it was merged after the branch point, but I am less familiar with how the release train works.

I don't think we observed this in stable releases, but the repro did use nightly features, so we may just not have gotten lucky in hitting it. I can try and use some BOOTSTRAP tricks to see if I can force the repro w/ nightly features on the stable 1.74.

If this does exist on 1.74 (and then 1.75), we'll need to put out a warning as the embedded-wg for everyone to use one of the workarounds, as this can be hard to detect and cause totally random crashes and unsoundness.

@jamesmunns
Copy link
Member Author

Ah yes, LLVM17 landed in 1.73, actually: https://releases.rs/docs/1.73.0/

I'll see if I can reproduce with bootstrap on 1.73/1.74.

@jamesmunns
Copy link
Member Author

Confirming, by using the BOOTSTRAP flag to allow nightly features:

  • Rust 1.72.0 stable does NOT reproduce
  • Rust 1.73.0 stable DOES reproduce
  • Rust 1.74.0 stable DOES reproduce.

@jamesmunns
Copy link
Member Author

Not sure if it helps, but could we get a regression-from-stable-to-stable Performance or correctness regression from one stable version to another. label? As @nikic mentioned, the regression was actually 1.72 to 1.73, but also affects 1.74 and the soon-to-be 1.75.

Following up with embedded-wg for a potential advisory notice.

@Dirbaio
Copy link
Contributor

Dirbaio commented Dec 21, 2023

@rustbot label regression-from-stable-to-stable

@rustbot rustbot added the regression-from-stable-to-stable Performance or correctness regression from one stable version to another. label Dec 21, 2023
@jamesmunns

This comment was marked as outdated.

@jamesmunns
Copy link
Member Author

Advisory now here: rust-embedded/cortex-m#503

@benma
Copy link

benma commented Dec 29, 2023

The LLVM 17 update is already on stable, so I don't think this would cause a new stable regression. Together with the fact that a trivial workaround exists, I am not willing to include this in the 1.75 release.

I do plan to backport this to nightly (and the new beta) though, together with a number of other fixes.

Imho this was not the right call - it is a serious bug and users could be shipping faulty firmwares if they don't see this issue in time. If a workaround exists, it should be bundled in the release until a proper fix lands. It would be safer and more efficient for all users. Furthermore, since one cannot be sure if one suffers from this bug if no problems are observed, it is hard to test if applying the workaround actually fixed anything - issues could appear only in production.

At the very least, the official Rust release notes should mention this bug.

benma added a commit to benma/bitbox02-firmware that referenced this issue Dec 29, 2023
Apply workaround for a potential miscompilation bug:
See
- rust-embedded/cortex-m#503
- rust-lang/rust#118867
We have not observed any abnormal behavior even though we fulfil all the criteria to be affected
(opt-level='z', thumbv7em-none-eabi target, Rust toolchain 1.74.0 being >= 1.73.0), but we apply
the workaround just in case.
This increases the binary size of the `make firmware` (Multi) build by 11568 at the time of
adding this workaround. This can be removed again once the issue above is fixed and we have
updated to a Rust toolchain that contains the fix.

This workaround is one of three suggested alternatives. The other two
are:
- Downgrade the Rust toolchain - dismissed as it is harder to do, as
  it involves re-building the Docker image
- Switch from opt-level='z' to opt-level='s' - this increases the
  binary size by 74416 bytes for the Multi, which is much more than the
  workaround in this commit.
benma added a commit to benma/bitbox02-firmware that referenced this issue Dec 29, 2023
Apply workaround for a potential miscompilation bug:
See
- rust-embedded/cortex-m#503
- rust-lang/rust#118867
We have not observed any abnormal behavior even though we fulfil all the criteria to be affected
(opt-level='z', thumbv7em-none-eabi target, Rust toolchain 1.74.0 being >= 1.73.0), but we apply
the workaround just in case.
This increases the binary size of the `make firmware` (Multi) build by
11568 bytes at the time of
adding this workaround. This can be removed again once the issue above is fixed and we have
updated to a Rust toolchain that contains the fix.

This workaround is one of three suggested alternatives. The other two
are:
- Downgrade the Rust toolchain - dismissed as it is harder to do, as
  it involves re-building the Docker image
- Switch from opt-level='z' to opt-level='s' - this increases the
  binary size by 74416 bytes for the Multi, which is much more than the
  workaround in this commit.
@apiraino
Copy link
Contributor

WG-prioritization assigning priority (Zulip discussion).

@rustbot label -I-prioritize +P-high

@rustbot rustbot added P-high High priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Dec 30, 2023
@nikic nikic self-assigned this Jan 9, 2024
@Sympatron
Copy link

@nikic What is the reason you did not want to address this for 1.75? This bug can cause unsound behavior for all programs targeting cortex-m without any unsafe code at any time on stable.
This is extremely serious in my opinion and would have been an easy fix.
You could just disable the outliner by passing --enable-machine-outliner=never to LLVM for these targets until this is resolved by updating LLVM which will hopefully happen in 1.76.

@jamesmunns
Copy link
Member Author

@Sympatron I'm not a moderator, but I might suggest that this issue is not the right place to voice procedural grieviences.

Nikic has made their reasoning clear above (even if you or I disagree with it), and I'd really like to avoid a pile-on here, particularly as nikic is the one doing the legwork of getting this fixed by pulling in the LLVM patches.

As thumb targets are tier two, I don't believe the release team has any responsibility to "stop the world" to correct any known soundness issues, even if they do so on a best effort basis.

@Sympatron
Copy link

What I meant to say is that this issue may warrant a 1.75.1 release, as I believe it is critical enough, despite it being a tier 2 target.
If not, there should at least be a public warning, such as a blog post or something similar.

@bors bors closed this as completed in 062e7c6 Jan 11, 2024
@Emilgardis
Copy link
Contributor

#119802 is in the process of being backported

github-actions bot pushed a commit to rust-lang/miri that referenced this issue Jan 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-unsound Issue: A soundness hole (worst kind of bug), see: https://en.wikipedia.org/wiki/Soundness llvm-fixed-upstream Issue expected to be fixed by the next major LLVM upgrade, or backported fixes O-Arm Target: 32-bit Arm processors (armv6, armv7, thumb...), including 64-bit Arm in AArch32 state P-high High priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants