Skip to content
This repository has been archived by the owner on Jan 24, 2022. It is now read-only.

Infinite backtrace, odd backtrace behavior and "corrupted stack?" in GDB #139

Closed
japaric opened this issue Oct 26, 2018 · 25 comments
Closed
Labels

Comments

@japaric
Copy link
Member

japaric commented Oct 26, 2018

STR

$ cargo generate --git ~/rust-embedded/cortex-m-quickstart --name app && cd app

$ cargo add panic-abort

$ # modify examples/panic.rs to use panic-abort

$ # change memory.x to match the memory layout of the STM32F3DISCOVERY

$ # uncomment thumbv7em-none-eabi target and gdb runner in .cargo/config 

$ cargo run --example panic
(gdb) backtrace  # "corrupt stack?"
#0  DefaultPreInit ()
    at /home/japaric/.cargo/registry/src/github.aaakk.us.kg-1ecc6299db9ec823/cortex-m-rt-0.6.5/src/lib.rs:556
#1  0x08000412 in Reset ()
    at /home/japaric/.cargo/registry/src/github.aaakk.us.kg-1ecc6299db9ec823/cortex-m-rt-0.6.5/src/lib.rs:496
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

(gdb) continue
Breakpoint 4, main () at examples/panic.rs:29
29          panic!("Oops")

(gdb) backtrace  # where did Reset go?
#0  main () at examples/panic.rs:29

(gdb) continue
Breakpoint 3, rust_begin_unwind (_info=0x10001fb4) at (..)
31          unsafe { intrinsics::abort() }

(gdb) backtrace  # infinite backtrace
#0  rust_begin_unwind (_info=0x10001fb4) at (..)
#1  0x080006b0 in core::panicking::panic_fmt () at libcore/panicking.rs:77
#2  0x080006b0 in core::panicking::panic_fmt () at libcore/panicking.rs:77
#3  0x080006b0 in core::panicking::panic_fmt () at libcore/panicking.rs:77
(..)

Metadata

$ rustc -V
rustc 1.31.0-nightly (f99911a4a 2018-10-23)

$ grep cortex-m-rt Cargo.lock
 "cortex-m-rt 0.6.5 (registry+https://github.com/rust-lang/crates.io-index)"

$ arm-none-eabi-gdb --version
GNU gdb (GDB) 8.2

Has anyone experienced these issues? I've seen them with previous releases of cortex-m-rt so I don't think they are related to the recent changes to HardFault. Also, in the above logs all backtrace invocations are before reaching HardFault.

@adamgreen
Copy link
Contributor

I did see similar issues when I was running through "Chapter 6 - Hello, world!" of the Discovery book. I took a quick look at the time and seem to remember that the panic related functions were modifying important registers like LR without first saving them on the stack. I didn't really dig into it since I am a Rust noob and I figured that was the expected behavior for Rust's panic handling code. If you want, I can investigate this. I don't think my HardFault change has anything to do with this but I can make sure that is the case and I think they are probably a similar type of issue.

@japaric
Copy link
Member Author

japaric commented Oct 26, 2018

You are correct. core::panicking::panic modifies LR without first pushing it onto the stack.

$ cd $(rustc --print sysroot)

$ arm-none-eabi-objdump -Cd lib/rustlib/thumbv7m-none-eabi/lib/libcore-*.rlib
(..)
00000000 <core::panicking::panic>:
   0:   b08c            sub     sp, #48 ; 0x30
   2:   e890 1006       ldmia.w r0, {r1, r2, ip}
   6:   e9d0 3e03       ldrd    r3, lr, [r0, #12]
   a:   6940            ldr     r0, [r0, #20]
   c:   e9cd 1206       strd    r1, r2, [sp, #24]
  10:   a906            add     r1, sp, #24
  12:   f240 0200       movw    r2, #0
  16:   9100            str     r1, [sp, #0]
  18:   2101            movs    r1, #1
  1a:   f2c0 0200       movt    r2, #0
  1e:   9101            str     r1, [sp, #4]
  20:   2100            movs    r1, #0
  22:   e9cd 1102       strd    r1, r1, [sp, #8]
  26:   e9cd 2104       strd    r2, r1, [sp, #16]
  2a:   a908            add     r1, sp, #32
  2c:   e9cd c308       strd    ip, r3, [sp, #32]
  30:   e9cd e00a       strd    lr, r0, [sp, #40]       ; 0x28
  34:   4668            mov     r0, sp
  36:   f7ff fffe       bl      0 <core::panicking::panic>
  3a:   defe            udf     #254    ; 0xfe

It seems that all divergent functions (fn(..) -> !) are given the noreturn attribute in LLVM-IR and this gives LLVM permission to thrash the LR register when optimizing the code. See example below:

#![feature(asm)]
#![no_main]
#![no_std]

extern crate panic_abort;

use cortex_m_rt::entry;

#[entry]
fn main() -> ! {
    foo();
    bar();
    baz()
}

#[inline(never)]
fn foo() {
    unsafe {
        asm!("" :: "r"(0) "r"(1) "r"(2) "r"(3) "r"(4) "r"(5) :: "volatile");
    }
}

#[inline(never)]
fn bar() {
    unsafe {
        asm!("" :: "r"(0) "r"(1) "r"(2) "r"(3) "r"(4) :: "volatile");
    }
}

#[inline(never)]
fn baz() -> ! {
    unsafe {
        asm!("" :: "r"(0) "r"(1) "r"(2) "r"(3) "r"(4) "r"(5) :: "volatile");
    }

    quux()
}

#[inline(never)]
fn quux() -> ! {
    unsafe {
        asm!("" :: "r"(0) "r"(1) "r"(2) "r"(3) "r"(4) "r"(5) "r"(6) :: "volatile");
    }

    loop {}
}
$ cargo rustc --example asm --release -- --emit=llvm-ir

$ cat $(find -name '*.ll')
; asm::foo
; Function Attrs: noinline nounwind
define internal fastcc void @_ZN3asm3foo17h42aa59da8b4484e3E() unnamed_addr #0 {
start:
  tail call void asm sideeffect "", "r,r,r,r,r,r"(i32 0, i32 1, i32 2, i32 3, i32 4, i32 5) #3, !srcloc !513
  ret void
}

; asm::bar
; Function Attrs: noinline nounwind
define internal fastcc void @_ZN3asm3bar17h6e8a9117df8b0663E() unnamed_addr #0 {
start:
  tail call void asm sideeffect "", "r,r,r,r,r"(i32 0, i32 1, i32 2, i32 3, i32 4) #3, !srcloc !514
  ret void
}

; asm::baz
; Function Attrs: noinline noreturn nounwind
define internal fastcc void @_ZN3asm3baz17h32a1e9952bcf27a1E() unnamed_addr #1 {
start:
  tail call void asm sideeffect "", "r,r,r,r,r,r"(i32 0, i32 1, i32 2, i32 3, i32 4, i32 5) #3, !srcloc !515
; call asm::quux
  tail call fastcc void @_ZN3asm4quux17h6aa0bed8e5be8684E()
  unreachable
}

; asm::quux
; Function Attrs: noinline noreturn nounwind
define internal fastcc void @_ZN3asm4quux17h6aa0bed8e5be8684E() unnamed_addr #1 {
start:
  tail call void asm sideeffect "", "r,r,r,r,r,r,r"(i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6) #3, !srcloc !516
  br label %bb1

bb1:                                              ; preds = %bb1, %start
  br label %bb1
}

$ cargo objdump --example asm --release -- -d -no-show-raw-insn -print-imm-hex
Disassembly of section .text:
asm::foo::hab8803af3cab0482:
 8000400:       push    {r7, lr}
 8000402:       mov.w   r12, #0x0
 8000406:       mov.w   lr, #0x1
 800040a:       movs    r2, #0x2
 800040c:       movs    r3, #0x3
 800040e:       movs    r0, #0x4
 8000410:       movs    r1, #0x5
 8000412:       pop     {r7, pc}

asm::bar::h0d958609aa53168d:
 8000414:       mov.w   r12, #0x0
 8000418:       movs    r1, #0x1
 800041a:       movs    r2, #0x2
 800041c:       movs    r3, #0x3
 800041e:       movs    r0, #0x4
 8000420:       bx      lr

asm::baz::h3d14c7f076829055:
 8000422:       mov.w   r12, #0x0
 8000426:       mov.w   lr, #0x1 ; <-
 800042a:       movs    r2, #0x2
 800042c:       movs    r3, #0x3
 800042e:       movs    r0, #0x4
 8000430:       movs    r1, #0x5
 8000432:       bl      #0x2
 8000436:       trap

asm::quux::h485b4bd2341abdb3:
 8000438:       mov.w   r12, #0x0
 800043c:       mov.w   lr, #0x1 ; <-
 8000440:       movs    r2, #0x2
 8000442:       movs    r3, #0x3
 8000444:       movs    r0, #0x4
 8000446:       movs    r1, #0x5
 8000448:       movs    r4, #0x6
 800044a:       b       #-0x4 <asm::quux::h485b4bd2341abdb3+0x12>

I can't think of any way to solve this issue from our side given that libcore comes pre-compiled.

@thejpster
Copy link
Contributor

Jumping in this thread. I saw this in an Embedded Rust workshop I ran at work the other day.

@adamgreen
Copy link
Contributor

It would be nice if the LLVM backend for ARM targets could be told to not trash LR when the noreturn attribute is used so that stack backtraces weren't broken in the debugger. What's more important, the little bit of speed/space savings from not spilling LR to the stack before modification or allowing a developer to actually be able to debug their code when such code is in the middle of their stack backtrace?

@japaric
Copy link
Member Author

japaric commented Dec 10, 2018

AFAICT the difference between noreturn and no-noreturn is not small. The former will not stack any registers until all registers (~10) are used up; the latter will stack registers as soon as the scratch registers (3-5) are used up. I think we should not disable this optimization (in LLVM or in cortex-m-rt), or make it opt-in (makes too big of a diff in perf to have people "forget to enable it").

It would be nice if the LLVM backend for ARM targets could be told to not trash LR when the noreturn
attribute is used

If LLVM has this feature then I think it should be opt-out (i.e. enabled by default in the compiler but can be disabled via a compiler flag) for the Cortex-M targets. I haven't seen any switch like that though; I have only seen one for the frame pointer (-fno-omit-frame-pointer).

@adamgreen
Copy link
Contributor

I think we should not disable this optimization (in LLVM or in cortex-m-rt), or make it opt-in (makes too big of a diff in perf to have people "forget to enable it").

I wonder how big the performance hit really is as these divergent functions aren't going to be called in a hot loop since they never return to be called more than once.

To corrupt the registers so bad that a developer can't get a stack backtrace in the debugger is pretty bad behavior. It is made even worse by the fact that a lot of times these noreturn functions find themselves in the middle of a call stack that a developer would really be interested in looking at (HardFault, panics, etc).

@korken89
Copy link
Contributor

korken89 commented Jan 7, 2019

I will chime in and say that I have seen the same issue, which unfortunately makes postmortem debugging of embedded systems currently impossible and unteachable.

@korken89
Copy link
Contributor

korken89 commented Jan 8, 2019

Update from the WG meeting 2019-01-08

  • The main issue with this is that the lr (link register) is not pushed to the stack, and used as a scratch register for divergent functions
  • There is a question if this is an intentional LLVM optimization
  • We should collect if panic debugging worked, and when it stopped working
  • When we have data, we will make an upstream issue to get more input on how to solve it

@eddyp
Copy link

eddyp commented Jan 10, 2019

I was wondering that in the case of a secure OS which has the ability to load and execute untrusted payloads it would make sense not to corrupt these so the exception handlers can decode the data automatically, log it and kick out or restart the offending payload (depending on the OS policy a retry count could be added).

Also note that not trashing LR is essential since it might be possible that a hard fault occurs in the SVC handler because of an offending payload, so the OS should be able to go down the stack and identify the caller of the SVC.

This would make for a very good use case, supplemental to the debugging support.

Granted, probably for such a use case there is a need for stable inline asm (sorry if this is not the right name, I am unfamiliar with the feature names), and maybe even some support for custom linker symbols so the decoding can happen at runtime, automatically.

As for the resolution, at Rust code level, I agree, a per-function opt-in attribute for divergent functions would be a very elegant solution. An alternative might be that all exception handlers are, by default, not trashing LR, even if divergent, but that would probably mean some "special knowledge" in the compiler which will probably not be accepted in llvm.

@adamgreig
Copy link
Member

Noting that this is presumably the same problem as #158

@korken89
Copy link
Contributor

Yeah, with the HardfaultHandler is a divergent function it will not protect the link register, and any function call (or other use of LR) within the handler will break backtrace..

@jamesmunns
Copy link
Member

jamesmunns commented Jan 25, 2019

Quick note if you are using a tool like gdbgui which automatically tries to query the stack frame on a crash (and ends up stuck recursively trying to query the corrupted stack frame), gdb does support limiting the max backtrace depth, which will keep the debugger from infinitely querying. Something like the following worked for me:

set backtrace limit 32

Edit: This can be set at command line, or in a batch file like .gdbinit or -x debug.gdb.

@eddyp
Copy link

eddyp commented Jan 25, 2019

set backtrace limit 32

Edit: This can be set at command line, or in a batch file like .gdbinit or -x debug.gdb

Would it make sense to have such a setting enabled by default in any bare metal project?

@ssendev
Copy link

ssendev commented Jul 22, 2019

Isn't it possible to do some linker tricks to avoid the noreturn?

fn panic(info: &PanicInfo) -> ! {
  if trick_the_optimizer() {
    unsafe { asm!{"ret"} }
  }
  loop {}
}

And then replace trick_the_optimizer during linking and maybe disable LTO if it is too good. Or maybe it is enough to mark it as extern "C" to inhibit optimizations.

@tpwrules
Copy link

tpwrules commented Aug 5, 2019

Isn't it possible to do some linker tricks to avoid the noreturn?

I tried various methods, including the one you suggested. Unfortunately, this does not seem to help because there are a couple other noreturn functions between fn panic(info: &PanicInfo) -> ! and the panic site, so LR ends up corrupted anyway.

I have found that one can break on core::panicking::panic instead of rust_begin_unwind and still have a consistent stack. You don't get the panic message because it hasn't been constructed yet, but you do get a complete and non-corrupted backtrace, which is much more valuable in my experience. Perhaps the .gdbinit should be changed so that this is the default.

@jonas-schievink
Copy link
Contributor

Is there an upstream issue for this?

@jamesmunns
Copy link
Member

jamesmunns commented Nov 22, 2019

There have been discussions about this with one of the teams (at previous all hands), not sure if there is an issue

@qwerty19106
Copy link

There is a case in which break on core::panicking::panic does not help.
My code controls the PWM output for a power inverter. I run the code with only one breakpoint, as in case of an accidental stop, the device may be destructed.

The rust_begin_unwind contains code to correct stoping of device. After this code I set breakpoint and want to watch registers and memory. But now stack already corrupted!

Removing noreturn from a -> ! function seems to be the right solution for thumbv7m-none-eabi (also thumbv6m-none-eabi and et.al.)

@jonas-schievink
Copy link
Contributor

Submitted an upstream bug report: rust-lang/rust#69231

@jonas-schievink
Copy link
Contributor

Has nobody tried building with -Cforce-frame-pointers=true yet? Is there a minimal example showing the issue?

@MabezDev
Copy link
Member

Has nobody tried building with -Cforce-frame-pointers=true yet? Is there a minimal example showing the issue?

FYI it's -Cforce-frame-pointers=yes/no, but unfortunately it doesn't seem to make a difference to the backtrace.

I wasn't aware of a minimal example, so I put one together based on the quickstart that you can use. https://github.com/MabezDev/cortex-m-quickstart

@jonas-schievink
Copy link
Contributor

Great, thanks!

@MabezDev
Copy link
Member

Reposting here for anyone not following the upstream issue: rust-lang/rust#69231 (comment)

Forcing framepointers on fixes the backtrace issue, unfortunately the precompiled core doesn't force framepointers hence you will have to build your own with xargo or -Z build-std at the moment.

Dylan-DPC-zz pushed a commit to Dylan-DPC-zz/rust that referenced this issue Feb 19, 2020
Don't eliminate frame pointers on thumb targets

This should hopefully fix issues rust-lang#69231 and rust-embedded/cortex-m-rt#139.

~~I couldn't test this locally as the rustc I produced does not create binaries (no idea why).~~ Resolved.
Dylan-DPC-zz pushed a commit to Dylan-DPC-zz/rust that referenced this issue Feb 19, 2020
Don't eliminate frame pointers on thumb targets

This should hopefully fix issues rust-lang#69231 and rust-embedded/cortex-m-rt#139.

~~I couldn't test this locally as the rustc I produced does not create binaries (no idea why).~~ Resolved.
@jonas-schievink
Copy link
Contributor

This should be fixed on the current nightly

@jonas-schievink
Copy link
Contributor

Closing as fixed!

Note that according to rust-lang/rust#69231 (comment) there might be a cheaper way of achieving this, so if someone wants to go after that, feel free! (I won't be spending more time on this as we don't have any easy to use tooling for assessing changes like this)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests