LLVM17 performing faulty outlining for cortex-m targets causes program crash #118867

jamesmunns · 2023-12-12T12:36:03Z

I tried this code:

https://github.com/peter9477/none-fault

(additional repro notes in the README there). I have verified this causes the code to crash as it jumps into a random RAM location unexpectedly.

This code does a fairly benign call to what boils down to:

format!("{:?}", None::<usize>);

However, things get sorta bad.

; This is deep inside <Option as core::fmt::Debug>::fmt
;
; Here, the vtable lookup has been outlined as OUTLINED_FUNCTION_14
;
                     LAB_000024ca                       
000024ca        00 f0 73 fb     bl         OUTLINED_FUNCTION_14

; calls into OUTLINED_FUNCTION_14
; at the time of call, LR = 24CF, which has clobbered the LR of the parent
;

    00002bb4    d1 e9 05 01     ldrd       r0,r1,[r1,#0x14]
    00002bb8    04 22           movs       r2,#0x4
    00002bba    cb 68           ldr        r3,[r1,#0xc]
    00002bbc    70 47           bx         lr

; The outlined function returns, using LR
;
; We now call the vtable method, however we don't restore LR prior to calling

000024ce        03 49           ldr        r1,[DAT_000024dc]   
000024d0        18 47           bx         r3

; This looks like it is SUPPOSED to be a tail-call function,
; and return to the caller of <Option as core::fmt::Debug>::fmt,
; but instead returns TO <Option as core::fmt::Debug>::fmt

    00000b74    80 b5           push       {r7,lr}
    00000b76    6f 46           mov        r7,sp
    00000b78    00 68           ldr        r0,[r0,#0x0]
    00000b7a    0a 44           add        r2,r1
    00000b7c    00 f0 54 f8     bl         _ZN132_$LT$alloc..vec..Vec$LT$T$C$A$GT$$u20$as 
    00000b80    00 20           movs       r0,#0x0
    00000b82    80 bd           pop        {r7,pc}

; since LR was clobbered, we go BACK to 24CE. However, since
; r3 is clobbered at this point (by memcpy), we jump into program memory
; and hard fault

The disassembled version by ghidra looks like this:

void _ZN66_$LT$core..option..Option$LT$T$GT$$u20$as$u20$core..fmt..Debug$GT$3fmt17h6606eac464c97c06E
               (int *param_1,undefined4 param_2,undefined4 param_3,code *UNRECOVERED_JUMPTABLE)

{
  undefined4 uVar1; // this is r3
  
  if (*param_1 != 0) {
    OUTLINED_FUNCTION_10(0x1be9,param_2,"SomeBusyNone    ",&stack0xfffffff4,0x1be9);
    return;
  }
  uVar1 = OUTLINED_FUNCTION_14();
                    /* WARNING: Could not recover jumptable at 0x000024d0. Too many branches */
                    /* WARNING: Treating indirect jump as call */

  // NOTE(jamesmunns): We call this function, but then return back and call it again!
  (*UNRECOVERED_JUMPTABLE)(uVar1,"None    "); 
  return;
}

As far as I can tell for THIS reproduction, it:

DOES happen on nightly-2023-08-09 and all that I've tried later than this
DOES NOT happen on nightly-2023-08-08 and before
The project updated from LLVM16 to LLVM17 on 2023-08-08, it seems -09 is the first nightly with LLVM17
~~ONLY happens with lto = "fat", thinlto and no lto do not reproduce~~
- Edit: I only observed it happening with fatlto, but Peter previously saw it with thinlto
ONLY happens with -Oz (edit: the repro uses -Oz for debug builds, switching to -O3 in release does not repro)

I've attached my specific elf file, so you can look at the same memory locations referenced in my issue

none-fault.elf.zip

This does seem tempermental, and tweaking unrelated pieces of the repro code causes it to disappear.

CC @peter9477 @Dirbaio

The text was updated successfully, but these errors were encountered:

jamesmunns · 2023-12-12T13:25:38Z

Also pushed jamesmunns/none-fault@cc96c01 which is the exact version used to produce the attached elf file above.

jamesmunns · 2023-12-12T19:23:19Z

Tagging in @fhahn, who suggested that llvm/llvm-project#73553 may address this, if we can update the LLVM17 tip.

I've not done a "rebuild rustc with different LLVM submodule" before, but I'll ask around the embedded-rust folks to see if anyone can give me a hand or try it out.

peter9477 · 2023-12-12T19:27:17Z

FYI, with my original code base where this first showed up, I can confirm that in fact this happens with lto = "fat" or lto = "thin", but not with lto = "off".

jamesmunns · 2023-12-12T19:29:24Z

Yep, updated the title to point to the updated assumption that this is more likely due to LR clobbering of the outliner, rather than LTO. In my attempts, lto="thin" didn't repro, but that could have been chance more than science.

saethlin · 2023-12-12T23:50:24Z

ONLY happens with -Oz (edit: the repro uses -Oz for debug builds, switching to -O3 in release does not repro)

What happens with -Copt-level=s? What happens with -Copt-level=z + -Zshare-generics=no?

jamesmunns · 2023-12-13T00:04:58Z

does not repro with -Copt-level=s (well, I changed the profile)
does repro with RUSTFLAGS='-Zshare-generics=no' cargo build edit: with opt-level='z'

I think @peter9477 might have mentioned the original code having problems in opt='s', though this repro is pretty sensitive to sometimes working with relatively minor changes.

jamesmunns · 2023-12-13T00:19:00Z

If someone can point me to docs that show how to build a new toolchain, including switching the LLVM version to the branch from llvm/llvm-project#73553 (hoping that it Just Builds), I can likely take a look at it this weekend.

Otherwise, if anyone knows how to do that quickly, and can do that + do a cargo build on https://github.com/peter9477/none-fault and give me the produced target/thumbv7em-none-eabihf/debug/none-fault, I can test that sooner.

peter9477 · 2023-12-13T00:36:50Z

I think @peter9477 might have mentioned the original code having problems in opt='s', though this repro is pretty sensitive to sometimes working with relatively minor changes.

As James mentioned, I believe I had reproduced it yesterday with both "s" and "z", and definitely not with 1, 2, or 3, but it was a very long session and I won't swear to it. The most obvious symptom (a hard fault) can come and go based on link order (it seems), and at one stage during minimization I had it causing a borrow_ref_mut() panic inside a critical section (which should be impossible). It's entirely possible that this is what I remember failing with "s". At the time I assumed they were both manifestations of the same issue (but someone later suggested an unrelated possible cause for that issue).

I just tried with "s" on my full code base and it does not fail.

Dirbaio · 2023-12-13T11:59:49Z

The bug is caused by the outliner, which is only enabled in opt-level=z I think (couldn't find an official source for this).

Can you check if it adding RUSTFLAGS=-Cllvm-args=--enable-machine-outliner=never also fixes it on opt-level=z?

jamesmunns · 2023-12-13T12:06:30Z

The bug is caused by the outliner, which is only enabled in opt-level=z I think (couldn't find an official source for this).

Can you check if it adding RUSTFLAGS=-Cllvm-args=--enable-machine-outliner=never also fixes it on opt-level=z?

It does not reproduce with these settings

peter9477 · 2023-12-13T16:47:52Z

Confirmed. With lto="fat" and opt-level="z" and that RUSTFLAG added (to my .cargo/config.toml) the issue does not appear.

nikic · 2023-12-21T11:49:46Z

Looks like the new PR for this issue has been merged: llvm/llvm-project#75527

jamesmunns · 2023-12-21T12:58:52Z

Hey @nikic, is there a way we can get one of the following two options done before rust 1.75? Otherwise we'll have a stable-to-stable regression for all cortex-m targets (thumbv*-none-eabi*):

Pull in this new LLVM tip
Make RUSTFLAGS=-Cllvm-args=--enable-machine-outliner=never the default temporarily as a hotfix

If anyone can point me to docs on how to:

update the rustc repo to use the new LLVM tip
build a version of the compiler with that LLVM tip

So I can test whether that fix addresses the issue we saw (I expect it will!), I can do that ASAP.

nikic · 2023-12-21T13:28:46Z

The LLVM 17 update is already on stable, so I don't think this would cause a new stable regression. Together with the fact that a trivial workaround exists, I am not willing to include this in the 1.75 release.

I do plan to backport this to nightly (and the new beta) though, together with a number of other fixes.

jamesmunns · 2023-12-21T13:39:44Z

@nikic did LLVM17 make it into 1.74? I thought it was merged after the branch point, but I am less familiar with how the release train works.

I don't think we observed this in stable releases, but the repro did use nightly features, so we may just not have gotten lucky in hitting it. I can try and use some BOOTSTRAP tricks to see if I can force the repro w/ nightly features on the stable 1.74.

If this does exist on 1.74 (and then 1.75), we'll need to put out a warning as the embedded-wg for everyone to use one of the workarounds, as this can be hard to detect and cause totally random crashes and unsoundness.

jamesmunns · 2023-12-21T13:41:34Z

Ah yes, LLVM17 landed in 1.73, actually: https://releases.rs/docs/1.73.0/

I'll see if I can reproduce with bootstrap on 1.73/1.74.

jamesmunns · 2023-12-21T13:55:53Z

Confirming, by using the BOOTSTRAP flag to allow nightly features:

Rust 1.72.0 stable does NOT reproduce
Rust 1.73.0 stable DOES reproduce
Rust 1.74.0 stable DOES reproduce.

jamesmunns · 2023-12-21T14:05:23Z

Not sure if it helps, but could we get a regression-from-stable-to-stable Performance or correctness regression from one stable version to another. label? As @nikic mentioned, the regression was actually 1.72 to 1.73, but also affects 1.74 and the soon-to-be 1.75.

Following up with embedded-wg for a potential advisory notice.

Dirbaio · 2023-12-21T14:26:53Z

@rustbot label regression-from-stable-to-stable

jamesmunns · 2023-12-28T15:58:15Z

Advisory now here: rust-embedded/cortex-m#503

benma · 2023-12-29T09:44:14Z

The LLVM 17 update is already on stable, so I don't think this would cause a new stable regression. Together with the fact that a trivial workaround exists, I am not willing to include this in the 1.75 release.

I do plan to backport this to nightly (and the new beta) though, together with a number of other fixes.

Imho this was not the right call - it is a serious bug and users could be shipping faulty firmwares if they don't see this issue in time. If a workaround exists, it should be bundled in the release until a proper fix lands. It would be safer and more efficient for all users. Furthermore, since one cannot be sure if one suffers from this bug if no problems are observed, it is hard to test if applying the workaround actually fixed anything - issues could appear only in production.

At the very least, the official Rust release notes should mention this bug.

Apply workaround for a potential miscompilation bug: See - rust-embedded/cortex-m#503 - rust-lang/rust#118867 We have not observed any abnormal behavior even though we fulfil all the criteria to be affected (opt-level='z', thumbv7em-none-eabi target, Rust toolchain 1.74.0 being >= 1.73.0), but we apply the workaround just in case. This increases the binary size of the `make firmware` (Multi) build by 11568 at the time of adding this workaround. This can be removed again once the issue above is fixed and we have updated to a Rust toolchain that contains the fix. This workaround is one of three suggested alternatives. The other two are: - Downgrade the Rust toolchain - dismissed as it is harder to do, as it involves re-building the Docker image - Switch from opt-level='z' to opt-level='s' - this increases the binary size by 74416 bytes for the Multi, which is much more than the workaround in this commit.

Apply workaround for a potential miscompilation bug: See - rust-embedded/cortex-m#503 - rust-lang/rust#118867 We have not observed any abnormal behavior even though we fulfil all the criteria to be affected (opt-level='z', thumbv7em-none-eabi target, Rust toolchain 1.74.0 being >= 1.73.0), but we apply the workaround just in case. This increases the binary size of the `make firmware` (Multi) build by 11568 bytes at the time of adding this workaround. This can be removed again once the issue above is fixed and we have updated to a Rust toolchain that contains the fix. This workaround is one of three suggested alternatives. The other two are: - Downgrade the Rust toolchain - dismissed as it is harder to do, as it involves re-building the Docker image - Switch from opt-level='z' to opt-level='s' - this increases the binary size by 74416 bytes for the Multi, which is much more than the workaround in this commit.

apiraino · 2023-12-30T09:19:10Z

WG-prioritization assigning priority (Zulip discussion).

@rustbot label -I-prioritize +P-high

Sympatron · 2024-01-11T15:17:53Z

@nikic What is the reason you did not want to address this for 1.75? This bug can cause unsound behavior for all programs targeting cortex-m without any unsafe code at any time on stable.
This is extremely serious in my opinion and would have been an easy fix.
You could just disable the outliner by passing --enable-machine-outliner=never to LLVM for these targets until this is resolved by updating LLVM which will hopefully happen in 1.76.

jamesmunns · 2024-01-11T15:26:10Z

@Sympatron I'm not a moderator, but I might suggest that this issue is not the right place to voice procedural grieviences.

Nikic has made their reasoning clear above (even if you or I disagree with it), and I'd really like to avoid a pile-on here, particularly as nikic is the one doing the legwork of getting this fixed by pulling in the LLVM patches.

As thumb targets are tier two, I don't believe the release team has any responsibility to "stop the world" to correct any known soundness issues, even if they do so on a best effort basis.

Sympatron · 2024-01-11T15:41:40Z

What I meant to say is that this issue may warrant a 1.75.1 release, as I believe it is critical enough, despite it being a tier 2 target.
If not, there should at least be a public warning, such as a blog post or something similar.

Emilgardis · 2024-01-11T18:31:49Z

#119802 is in the process of being backported

Update LLVM submodule Fixes rust-lang/rust#118867. Fixes rust-lang/rust#119160. Fixes rust-lang/rust#119179. r? `@cuviper`

jamesmunns added the C-bug Category: This is a bug. label Dec 12, 2023

rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Dec 12, 2023

jamesmunns mentioned this issue Dec 12, 2023

[ARM/Thumb] Mark TAILJMP*/TCRETURN* as using LR. llvm/llvm-project#73492

Closed

jamesmunns changed the title ~~LTO performing faulty outlining for cortex-m targets in LLVM17~~ LLVM17 performing faulty outlining for cortex-m targets causes program crash Dec 12, 2023

rustbot added the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label Dec 12, 2023

nikic added the llvm-fixed-upstream Issue expected to be fixed by the next major LLVM upgrade, or backported fixes label Dec 21, 2023

rustbot added the regression-from-stable-to-stable Performance or correctness regression from one stable version to another. label Dec 21, 2023

This comment was marked as outdated.

Sign in to view

jamesmunns mentioned this issue Dec 21, 2023

SIGSEGV in rustc upon compiling with --release #119179

Closed

benma mentioned this issue Dec 29, 2023

rust: add workaround for potential compiler bug BitBoxSwiss/bitbox02-firmware#1158

Merged

rustbot added P-high High priority and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Dec 30, 2023

nikic self-assigned this Jan 9, 2024

This was referenced Jan 9, 2024

Cherry-pick a number of fixes rust-lang/llvm-project#163

Merged

Update LLVM submodule #119802

Merged

bors closed this as completed in 062e7c6 Jan 11, 2024

github-actions bot pushed a commit to rust-lang/miri that referenced this issue Jan 13, 2024

Auto merge of #119802 - nikic:update-llvm-19, r=cuviper

0553319

Update LLVM submodule Fixes rust-lang/rust#118867. Fixes rust-lang/rust#119160. Fixes rust-lang/rust#119179. r? `@cuviper`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLVM17 performing faulty outlining for cortex-m targets causes program crash #118867

LLVM17 performing faulty outlining for cortex-m targets causes program crash #118867

jamesmunns commented Dec 12, 2023 •

edited

Loading

jamesmunns commented Dec 12, 2023

jamesmunns commented Dec 12, 2023

peter9477 commented Dec 12, 2023

jamesmunns commented Dec 12, 2023

saethlin commented Dec 12, 2023

jamesmunns commented Dec 13, 2023 •

edited

Loading

jamesmunns commented Dec 13, 2023

peter9477 commented Dec 13, 2023 •

edited

Loading

Dirbaio commented Dec 13, 2023

jamesmunns commented Dec 13, 2023

peter9477 commented Dec 13, 2023

nikic commented Dec 21, 2023

jamesmunns commented Dec 21, 2023 •

edited

Loading

nikic commented Dec 21, 2023 •

edited

Loading

jamesmunns commented Dec 21, 2023

jamesmunns commented Dec 21, 2023

jamesmunns commented Dec 21, 2023

jamesmunns commented Dec 21, 2023

Dirbaio commented Dec 21, 2023

This comment was marked as outdated.

jamesmunns commented Dec 28, 2023

benma commented Dec 29, 2023

apiraino commented Dec 30, 2023

Sympatron commented Jan 11, 2024

jamesmunns commented Jan 11, 2024

Sympatron commented Jan 11, 2024

Emilgardis commented Jan 11, 2024

LLVM17 performing faulty outlining for cortex-m targets causes program crash #118867

LLVM17 performing faulty outlining for cortex-m targets causes program crash #118867

Comments

jamesmunns commented Dec 12, 2023 • edited Loading

jamesmunns commented Dec 12, 2023

jamesmunns commented Dec 12, 2023

peter9477 commented Dec 12, 2023

jamesmunns commented Dec 12, 2023

saethlin commented Dec 12, 2023

jamesmunns commented Dec 13, 2023 • edited Loading

jamesmunns commented Dec 13, 2023

peter9477 commented Dec 13, 2023 • edited Loading

Dirbaio commented Dec 13, 2023

jamesmunns commented Dec 13, 2023

peter9477 commented Dec 13, 2023

nikic commented Dec 21, 2023

jamesmunns commented Dec 21, 2023 • edited Loading

nikic commented Dec 21, 2023 • edited Loading

jamesmunns commented Dec 21, 2023

jamesmunns commented Dec 21, 2023

jamesmunns commented Dec 21, 2023

jamesmunns commented Dec 21, 2023

Dirbaio commented Dec 21, 2023

This comment was marked as outdated.

jamesmunns commented Dec 28, 2023

benma commented Dec 29, 2023

apiraino commented Dec 30, 2023

Sympatron commented Jan 11, 2024

jamesmunns commented Jan 11, 2024

Sympatron commented Jan 11, 2024

Emilgardis commented Jan 11, 2024

jamesmunns commented Dec 12, 2023 •

edited

Loading

jamesmunns commented Dec 13, 2023 •

edited

Loading

peter9477 commented Dec 13, 2023 •

edited

Loading

jamesmunns commented Dec 21, 2023 •

edited

Loading

nikic commented Dec 21, 2023 •

edited

Loading