-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLVM17 performing faulty outlining for cortex-m targets causes program crash #118867
Comments
Also pushed jamesmunns/none-fault@cc96c01 which is the exact version used to produce the attached elf file above. |
Tagging in @fhahn, who suggested that llvm/llvm-project#73553 may address this, if we can update the LLVM17 tip. I've not done a "rebuild rustc with different LLVM submodule" before, but I'll ask around the embedded-rust folks to see if anyone can give me a hand or try it out. |
FYI, with my original code base where this first showed up, I can confirm that in fact this happens with lto = "fat" or lto = "thin", but not with lto = "off". |
Yep, updated the title to point to the updated assumption that this is more likely due to LR clobbering of the outliner, rather than LTO. In my attempts, lto="thin" didn't repro, but that could have been chance more than science. |
What happens with |
I think @peter9477 might have mentioned the original code having problems in opt='s', though this repro is pretty sensitive to sometimes working with relatively minor changes. |
If someone can point me to docs that show how to build a new toolchain, including switching the LLVM version to the branch from llvm/llvm-project#73553 (hoping that it Just Builds), I can likely take a look at it this weekend. Otherwise, if anyone knows how to do that quickly, and can do that + do a |
As James mentioned, I believe I had reproduced it yesterday with both "s" and "z", and definitely not with 1, 2, or 3, but it was a very long session and I won't swear to it. The most obvious symptom (a hard fault) can come and go based on link order (it seems), and at one stage during minimization I had it causing a borrow_ref_mut() panic inside a critical section (which should be impossible). It's entirely possible that this is what I remember failing with "s". At the time I assumed they were both manifestations of the same issue (but someone later suggested an unrelated possible cause for that issue). I just tried with "s" on my full code base and it does not fail. |
The bug is caused by the outliner, which is only enabled in Can you check if it adding |
It does not reproduce with these settings |
Confirmed. With lto="fat" and opt-level="z" and that RUSTFLAG added (to my .cargo/config.toml) the issue does not appear. |
Looks like the new PR for this issue has been merged: llvm/llvm-project#75527 |
Hey @nikic, is there a way we can get one of the following two options done before rust 1.75? Otherwise we'll have a stable-to-stable regression for all cortex-m targets (
If anyone can point me to docs on how to:
So I can test whether that fix addresses the issue we saw (I expect it will!), I can do that ASAP. |
The LLVM 17 update is already on stable, so I don't think this would cause a new stable regression. Together with the fact that a trivial workaround exists, I am not willing to include this in the 1.75 release. I do plan to backport this to nightly (and the new beta) though, together with a number of other fixes. |
@nikic did LLVM17 make it into 1.74? I thought it was merged after the branch point, but I am less familiar with how the release train works. I don't think we observed this in stable releases, but the repro did use nightly features, so we may just not have gotten lucky in hitting it. I can try and use some BOOTSTRAP tricks to see if I can force the repro w/ nightly features on the stable 1.74. If this does exist on 1.74 (and then 1.75), we'll need to put out a warning as the embedded-wg for everyone to use one of the workarounds, as this can be hard to detect and cause totally random crashes and unsoundness. |
Ah yes, LLVM17 landed in 1.73, actually: https://releases.rs/docs/1.73.0/ I'll see if I can reproduce with bootstrap on 1.73/1.74. |
Confirming, by using the BOOTSTRAP flag to allow nightly features:
|
Not sure if it helps, but could we get a
regression-from-stable-to-stable
Following up with embedded-wg for a potential advisory notice. |
@rustbot label regression-from-stable-to-stable |
This comment was marked as outdated.
This comment was marked as outdated.
Advisory now here: rust-embedded/cortex-m#503 |
Imho this was not the right call - it is a serious bug and users could be shipping faulty firmwares if they don't see this issue in time. If a workaround exists, it should be bundled in the release until a proper fix lands. It would be safer and more efficient for all users. Furthermore, since one cannot be sure if one suffers from this bug if no problems are observed, it is hard to test if applying the workaround actually fixed anything - issues could appear only in production. At the very least, the official Rust release notes should mention this bug. |
Apply workaround for a potential miscompilation bug: See - rust-embedded/cortex-m#503 - rust-lang/rust#118867 We have not observed any abnormal behavior even though we fulfil all the criteria to be affected (opt-level='z', thumbv7em-none-eabi target, Rust toolchain 1.74.0 being >= 1.73.0), but we apply the workaround just in case. This increases the binary size of the `make firmware` (Multi) build by 11568 at the time of adding this workaround. This can be removed again once the issue above is fixed and we have updated to a Rust toolchain that contains the fix. This workaround is one of three suggested alternatives. The other two are: - Downgrade the Rust toolchain - dismissed as it is harder to do, as it involves re-building the Docker image - Switch from opt-level='z' to opt-level='s' - this increases the binary size by 74416 bytes for the Multi, which is much more than the workaround in this commit.
Apply workaround for a potential miscompilation bug: See - rust-embedded/cortex-m#503 - rust-lang/rust#118867 We have not observed any abnormal behavior even though we fulfil all the criteria to be affected (opt-level='z', thumbv7em-none-eabi target, Rust toolchain 1.74.0 being >= 1.73.0), but we apply the workaround just in case. This increases the binary size of the `make firmware` (Multi) build by 11568 bytes at the time of adding this workaround. This can be removed again once the issue above is fixed and we have updated to a Rust toolchain that contains the fix. This workaround is one of three suggested alternatives. The other two are: - Downgrade the Rust toolchain - dismissed as it is harder to do, as it involves re-building the Docker image - Switch from opt-level='z' to opt-level='s' - this increases the binary size by 74416 bytes for the Multi, which is much more than the workaround in this commit.
WG-prioritization assigning priority (Zulip discussion). @rustbot label -I-prioritize +P-high |
@nikic What is the reason you did not want to address this for 1.75? This bug can cause unsound behavior for all programs targeting cortex-m without any unsafe code at any time on stable. |
@Sympatron I'm not a moderator, but I might suggest that this issue is not the right place to voice procedural grieviences. Nikic has made their reasoning clear above (even if you or I disagree with it), and I'd really like to avoid a pile-on here, particularly as nikic is the one doing the legwork of getting this fixed by pulling in the LLVM patches. As thumb targets are tier two, I don't believe the release team has any responsibility to "stop the world" to correct any known soundness issues, even if they do so on a best effort basis. |
What I meant to say is that this issue may warrant a 1.75.1 release, as I believe it is critical enough, despite it being a tier 2 target. |
#119802 is in the process of being backported |
Update LLVM submodule Fixes rust-lang/rust#118867. Fixes rust-lang/rust#119160. Fixes rust-lang/rust#119179. r? `@cuviper`
I tried this code:
https://github.com/peter9477/none-fault
(additional repro notes in the README there). I have verified this causes the code to crash as it jumps into a random RAM location unexpectedly.
This code does a fairly benign call to what boils down to:
However, things get sorta bad.
The disassembled version by ghidra looks like this:
As far as I can tell for THIS reproduction, it:
nightly-2023-08-09
and all that I've tried later than thisnightly-2023-08-08
and beforeONLY happens withlto = "fat"
, thinlto and no lto do not reproduce-Oz
(edit: the repro uses -Oz for debug builds, switching to -O3 in release does not repro)I've attached my specific elf file, so you can look at the same memory locations referenced in my issue
none-fault.elf.zip
This does seem tempermental, and tweaking unrelated pieces of the repro code causes it to disappear.
CC @peter9477 @Dirbaio
The text was updated successfully, but these errors were encountered: