Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debuginfo-based panic locations #2154

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

main--
Copy link

@main-- main-- commented Sep 15, 2017

Rendered


For a quick mockup of the basic idea, have a look at this.

kennytm added a commit to kennytm/rfcs that referenced this pull request Sep 20, 2017
@retep998
Copy link
Member

I am very strongly in favor of removing those panic messages from the binary because they waste space, cannot be stripped, and it is just a poor man's version of actual debuginfo.

# Guide-level explanation
[guide-level-explanation]: #guide-level-explanation

* The Rust compiler simply continues to generate platform-compatible debug information. It gains a new switch that causes it to generate *external* debug info. What this means is that the binary itself is stripped - all debug info is written to a separate **symbol file** (in a platform-specific format).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this switch do on platforms that don't support external debug info?
What does this switch do on platforms where the debug info is already external anyway?
Should there be a negated version of this switch to force debug info to be internal?
What would the negated switch do on platforms that don't support internal debug info?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As specified in the referencel-level explanation, this is not a flag but an additional value for the existing debuginfo switch. The idea is that fine-grained control over what kind of debuginfo you want doesn't really make sense when it's external: Just take it or leave it, size shouldn't be a concern. I'm not really happy with this design, so bikeshed to your heart's content.

AFAIK all supported platforms are either DWARF or PDB. As DWARF can do both but PDB is always external, this basically means that debuginfo levels 2 and 3 do the same thing on windows-msvc.

[guide-level-explanation]: #guide-level-explanation

* The Rust compiler simply continues to generate platform-compatible debug information. It gains a new switch that causes it to generate *external* debug info. What this means is that the binary itself is stripped - all debug info is written to a separate **symbol file** (in a platform-specific format).
* Cargo's release profile enables external debug info.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in fact one of the most common roadblocks people have on windows-msvc. They have a release build and they get a panic backtrace which has symbols from std but not their own crate, because the linker always emits .pdb debug info, but rustc itself is not putting debug info in their crate. The solution is to add [profile.release] debug = true to their Cargo.toml and having this be the default would remove a major source of friction.

* Panic messages now always include a backtrace (instead of the old location string).
* This means that the root cause of panics from `unwrap()` and similar interfaces are now much clearer.
* It's simple and obvious as most modern languages print backtraces for unhandled errors.
* The backtrace is generated from debug information. Without debug symbols, memory addresses are printed instead. The standard `addr2line` utility can be used to manually obtain location information once debug info is available.
Copy link
Member

@retep998 retep998 Sep 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the equivalent utility be for windows-msvc? This is already an issue in Rust today, so it's not like the RFC is causing a need for such a utility, but it sure would be nice to have a solution at some point.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cross-platform equivalent of addr2line appears to be llvm-symbolizer (perhaps the RFC should refer to that instead?). It gained windows support in 2015. Its output is very similar to addr2line.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do addr2line and llvm-symbolizer work on position independent executables? I'm not sure how it would be able to based off of just a the binary and an address but I could be missing something.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, perhaps I should have specified this: Naturally, these symbolization tools always work with addresses into the binary, not runtime memory addresses. You simply query the base/load address of each loaded object and subtract (my proof of concept assumes one single statically-linked binaries, for anything more complex you would of course have multiple baseptrs and figure out which one to subtract) that from your addresses before printing them.

You're right, without this step randomized addresses would be meaningless.

@est31
Copy link
Member

est31 commented Sep 20, 2017

Can we actually put column info into debuginfo yet? There seem to be some bugs at least on some platforms connected to columns. I wouldn't want to regress on this. Edit: see also rust-lang/rust#42921

@kennytm
Copy link
Member

kennytm commented Sep 20, 2017

@est31 the DWARF format itself support columns. Not sure about PDB.

@retep998
Copy link
Member

Last I checked, I did not see an API in dbghelp to get the column number. That said, the only time lack of column numbers would actually be a regression is when you have multiple panics on the same line. Not multiple function calls which could panic, but multiple panics directly inline.

@main--
Copy link
Author

main-- commented Sep 20, 2017

The PDB format does support column numbers. However, in rust-lang/rust#42921 (comment) @est31 details why it's disabled right now:

I've dug a little and it seems that Visual Studio expects column info to be in ranges, and otherwise is buggy... That's why it was apparently disabled for clang+msvc: https://reviews.llvm.org/rL279765 . LLVM seems to have no way of sending column ranges, so we can't improve on clang here unless we want to patch LLVM.

LLVM uses the newer DIA debuginfo API which does provide column information. LLVM even forwards this in their own PDB API even though it's not used anywhere.

Patching LLVM to write column end info as well can't be that hard (although the poking I just did suggests that it may be a lot harder than it seems, at least) and I don't see why they would reject that patch.

Updating the panic runtime to use DIA instead of dbghelp may or may not be worth the effort, but all I'm trying to say is that getting column numbers is certainly possible.

@est31
Copy link
Member

est31 commented Sep 20, 2017

the only time lack of column numbers would actually be a regression is when you have multiple panics on the same line

That's when the regression would matter, right. Having multiple panics on the same line is more often than you think tho. E.g. take code like foo[i] = bar[k] + bar[l] with index out of bounds or a = b + c + d + e with overflow.

@retep998
Copy link
Member

@est31 Multiple panics on the same line in that case would only apply when the Add or Index operator is built in. It wouldn't apply to Vec indexing, nor would it apply to addition of matrices or quaternions.

@Zoxc
Copy link

Zoxc commented Sep 21, 2017

I'm in favor of building debug information by default and always generating stack traces instead of requiring setting RUST_BACKTRACE.

I also think we should support multiple options of how much panic information to embed in binaries:

  • No panic messages or location information. This is useful if code size or performance is critical. Panics would reduce to an ud2 instruction on x86. This has a slight problem with libstd conflating panic messages and panic payloads.
  • Just panic messages (suggested by this RFC). This relies on having good backtrace support to know where panics occur. This seems like a reasonable option for releases on platforms with such backtrace support.
  • Panic messages and partial backtraces for blamed functions (this is RFC 2091 with my suggestion). I think this should be the default for cargo --release. It ensures we'd always have a trace to the blamed function, even if debug information is missing or buggy.
  • Panic messages and full backtraces for Rust functions (which I also (suggested in RFC 2091)

@main--
Copy link
Author

main-- commented Sep 21, 2017

@Zoxc According to #2070 (comment) it's possible to optimize messages out if the handler doesn't use them, at least in the no_std world. But since std's handler always prints the message, I don't think this works for regular applications. I brought this up because I really like how very pay-only-for-what-you-use this is. It's a neat way to get the first behavior you described.

Your "partial backtraces for blamed functions" proposal's biggest drawback is that annotating a function with #[blame_caller] means that the function is not important / just a small helper. Thus, what you actually get from that is a backtrace containing only the uninteresting frames (except the last one of course).

even if debug information is missing or buggy

It's a valid concern but I'm starting to think that it seems a lot worse than it really is. My experience with embedded platforms or other problematic environments is very limited, so I would very much like to hear other opinions on this but from what I can see basically everything that isn't MSVC is DWARF. And if your embedded device has any way to report errors at all why can't you just use addr2line from your cross toolchain to decode the backtrace addresses?

@Zoxc
Copy link

Zoxc commented Sep 21, 2017

Your "partial backtraces for blamed functions" proposal's biggest drawback is that annotating a function with #[blame_caller] means that the function is not important / just a small helper. Thus, what you actually get from that is a backtrace containing only the uninteresting frames (except the last one of course).

Well it avoids problems when who to blame is unclear. It's also consistent with how other backtraces work. Also you won't get dropped into a some function without the path to the actual panic being clear, which is a very important point. Functions unknown to you might have #[blame_caller] attached to them without you being aware of said function panicking.

It's a valid concern but I'm starting to think that it seems a lot worse than it really is. My experience with embedded platforms or other problematic environments is very limited, so I would very much like to hear other opinions on this but from what I can see basically everything that isn't MSVC is DWARF. And if your embedded device has any way to report errors at all why can't you just use addr2line from your cross toolchain to decode the backtrace addresses?

Manually typing in addresses in addr2line is a very painful workflow. Even if the device has a serial port, and I have written code to print to that serial port, I still have to copy and paste addresses around instead of being given a proper backtrace. I suggest you try doing this in your normal workflow.

My toolchain may not even support generating debug information. Even if it does, it's probably buggy and I may have to write a linker script which correctly generates said debug information (so I'd need to know how DWARF works). This situation is not made better by the fact you may have to mix code for different modes on x86 (x86-16, x86-32, x86-64).

Forking rustc to generate backtraces for me is a much simpler option.

@est31
Copy link
Member

est31 commented Sep 22, 2017

Patching LLVM to write column end info as well can't be that hard (although the poking I just did suggests that it may be a lot harder than it seems, at least) and I don't see why they would reject that patch.

Meta question: does the RFC process include contributions to LLVM? As in, can the RFC state "change LLVM to include column/line end info"? And @main-- could you include something like that? From a rustc side, this should be manageable. rustc manages span end positions internally, it should be easy to pass them to LLVM.

Edit: pointing out that if we include end column info, we should also include end line info, because if the span ends on a new line, it might highlight a wrong span.

@main--
Copy link
Author

main-- commented Sep 22, 2017

Manually typing in addresses in addr2line is a very painful workflow. Even if the device has a serial port, and I have written code to print to that serial port, I still have to copy and paste addresses around instead of being given a proper backtrace. I suggest you try doing this in your normal workflow.

@Zoxc I imagine your workflow is already quite automated if copy-pasting addresses is a significant inconvenience. Extending that to automatically symbolize stack traces shouldn't be too hard, I hope?

My toolchain may not even support generating debug information.

Why not? Is rustc/LLVM unable to generate DWARF for some targets? If yes, why?

Even if it does, it's probably buggy and I may have to write a linker script which correctly generates said debug information (so I'd need to know how DWARF works).

What do you mean? Of course if you're writing a kernel or something like that you always need a linker script to make sure the binary looks like it should. But that's a no_std scenario anyway which is an unresolved question right now. I guess you would either include the debug sections in your binary and somehow tell the backtracking runtime about it or link them separately and symbolize externally.

@retep998
Copy link
Member

After doing some digging, it turns out that dbghelp on Windows does allow getting the column. Both SymGetLineFromAddr64 and SymGetLineFromInlineContext have a pdwDisplacement parameter which std and the backtrace crate both ignore (https://github.com/rust-lang/rust/blob/master/src/libstd/sys/windows/backtrace/printing/msvc.rs#L72). So we're getting the column and then throwing it away without even bothering to display it.

The other big issue is inlined functions. Because walking the stack inherently does not know about inlined functions, the only way to display inlined functions in a backtrace is to ask the debug info for any inlined functions at that address. As it turns out, it seems dbghelp does have a way to get this information, by calling SymFromInlineContext or SymGetLineFromInlineContext we can specify the inline context to get information about inlined frames.

Of course the real issue now is whether LLVM is capable of putting column info and inlined frames into the debuginfo. It certainly can't hurt to have std and backtrace start printing that information though when it is available, such as when there are C or C++ functions in the backtrace.

# Drawbacks
[drawbacks]: #drawbacks

The single major drawback here is implementation complexity. Walking the stack is usually quite simple but parsing debug info to obtain symbols is not. On the other hand, backtraces already work on most platforms and are even essential to debugging for many users, so most of this is either already solved or worth the effort on its own.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debuginfo is commonly 3-5x the size of the main binary. Relying on that much stuff to get line information seems like a drawback to me.

For reference, we build a service at work with level 1 debuginfo to have symbolicated stack traces. The binary is 53 MiB, of which 41 MiB is debuginfo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other hand debuginfo can easily be left out, whereas line information hardcoded into the binary is always there no matter what you do wasting space.

Copy link
Member

@sfackler sfackler Oct 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much space do those hardcoded lines take up? Has anyone proposed a flag to turn off line!() &co?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This RFC proposes to replace the file!()+line!() panic locations with backtraces so it's always "turned off" for panics.

I'm aware that debuginfo can be huge, I just don't see why that's a problem:

  • This RFC doesn't change anything for debug builds.
  • In release builds:
    • Compile time is a lot less important
    • Size of the binary itself doesn't change (debug info is external)
    • Sure, the symbol file consumes disk space. But when is that a problem? You don't usually ship it to customers/production.
      • In cases where you do want debuginfo in production for some reason (like the example you mentioned), you can still just ask the compiler to give you what you want.

So what does change?

  • By default, you no longer print panic locations in production / to customers. The underlying rationale of both this RFC and RFC: Implicit caller location (third try to the unwrap/expect line info problem) #2091 (that, I think, everyone agrees with at this point) is that today's panic locations are rarely useful ("unwrap failed in option.rs") so I don't think this makes any difference.
  • However (the big win): When you see a (raw, unsymbolicated) stack trace in a log file or a bug report from a customer you can now symbolicate it to get a full stack trace. If you have a core dump, you can just fire up a debugger and you see everything!

# Reference-level explanation
[reference-level-explanation]: #reference-level-explanation

**Symbol files** already exist (and work) on Windows and MacOS. On Linux, it's as simple as writing the main binary to one ELF image and all the debug sections to another. The existing `-C debuginfo` compiler options gains a third level "3 = full debug info in an external symbol file". The cargo release profile sets `debuginfo=3` (instead of `debuginfo=0`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The location of the debuginfo is orthogonal to its detail. Why couldn't I have out-of-binary level 1 debuginfo? It seems like it would make more sense for this to be a separate flag.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rationale here: #2154 (comment)

I'm not very happy with this either, I don't care about it. The only reason this isn't a new compiler flag is that I was too lazy to write up the whole "new flag, -Z/-C, stabilization, Cargo.toml" boilerplate.


The main purpose of panic location information as well as backtraces and debuggers is **debugging**: Finding and fixing the bug. Debugging code written in a compiled systems language is a hard problem but Rust is not unique here - the C ecosystem was facing exactly the same problem and debug info (in a platform-specific format like DWARF or PDB) exists to solve it. Of course, Rust is already leveraging this to support both backtraces and debuggers. But from this perspective it certainly seems very redundant to compile dedicated panic location strings when this information already exists in a binary's debug sections.

Furthermore, there is no automated solution right now that could remove panic location strings from binaries. This is a problem for developers of closed-source software as it can, in some circumstances, faciliate reverse engineering.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this RFC deal with this issue?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is that if the standard panic handler just discards file!() and line!(), I would expect the optimizer to remove them (see #2070 (comment)). This is nice because it avoids having to deal with reworking all of those panic internals (although doing that is probably a cleaner and better solution).

Just like the design of the compiler flag, this is a detail that I wanted to gloss over until later rounds of feedback (it already took me a while to find the time to write up this first draft, no point in delaying it further for things that don't really matter when I'm just looking for feedback on the basic idea).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is std::panic::set_hook going to work?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quoting from PanicInfo::location():

This method will currently always return Some, but this may change in future versions.

Payload still exists of course. And location can be resolved from debuginfo or just return None when that fails. Adding a method to PanicInfo that returns the entire stack trace seems like a good idea too.

@arielb1
Copy link
Contributor

arielb1 commented Nov 29, 2017

I'm not so sure I like this unless we also create some very good symbol infrastructure.

Both at Rust and in my past job, I often had to debug crashes at machines I don't have access to. If there are multiple versions of my executable (and perhaps, multiple versions of dynamic libraries) then tracking down the exact symbol file for the version that's installed on the machine is a non-trivial job, and assembling them correctly at my configuration so that I'll have correct line-numbers is a very non-trivial job.

I hope that for static cargo builds this would be easier - you only need to track down the debuginfo for the static executable.

@est31
Copy link
Member

est31 commented Dec 16, 2017

Do we have panic backtrace support on the wasm platforms? It would be pretty sad to regress here.

@pepyakin
Copy link

pepyakin commented Dec 16, 2017

@est31 I guess no. In panic=abort mode, in theory, embedder might print call stack with source locations, provided that there is source maps in binary. I've not seen such support yet but I've not checked for a while.

@est31
Copy link
Member

est31 commented Dec 16, 2017

Thanks @pepyakin. IIRC there is source maps support for emscripten, but not for the unknown target, I've filed a bug for the unknown target.

All this together makes me believe that this RFC should be postponed as there is serious implementation work still needing to be done. This includes:

  • support to emit debug info at all on all platforms (including the emscripten and unknown wasm platforms)
  • emission of column info on all platforms
  • making the debuginfo be accurate even after inlining on all platforms (the RFC mentions that this is a bug in the current implementation)
  • support to actually print out the backtrace (IIRC we only print "unknown" on mac OS?)

Panic location info might be "poor man's debuginfo" but at least it is working consistently across platforms. We can extend them with support to strip them or put them into a separate file. This is much simpler than implementing any of the four points above.

Even with this RFC merged, I support merging of #2091 because it would allow for debuginfo based panic emitters to display the first "semantic" place where you panic. With the current panic backtraces (and debuginfo based panic locations are no different here) you got a long list of functions, beginning with panic machinery.

@main--
Copy link
Author

main-- commented Dec 16, 2017

@est31

All this together makes me believe that this RFC should be postponed as there is serious implementation work still needing to be done.

I don't really understand what you mean. The RFC process always was about discussing ideas before doing the implementation work.

  1. I'm not familiar with how the web targets work (can sourcemaps do this?).
  2. The only issue with column info that I'm aware of only affects pdb and it's a simple choice between A) fix the llvm bug B) regress slightly on visual studio debugging C) set columns to zero (the current workaround)
  3. Sorry, my wording is perhaps a bit ambiguous there. What I meant is that today's libbacktrace-powered RUST_BACKTRACE=1 omits inlined frames because that's what libbacktrace does. The actual debug info is in my experience always correct and using a different implementation (like for instance addr2line in my tyrion PoC) shows a complete backtrace.
  4. The OSX revert you linked is unfortunate but the stated reason is again entirely a libbacktrace problem: The library is insecure in the face of malicous DWARF and since a binary's debuginfo sections are not loaded into the address space there needs to be a way to make sure we're loading the original binary. On Linux this is /proc/self/exe.

Panic location info might be "poor man's debuginfo" but at least it is working consistently across platforms. We can extend them with support to strip them or put them into a separate file. This is much simpler than implementing any of the four points above.

Interestingly, this is the first time I've seen anyone question the core rationale behind both this RFC and #2091. What good is something that works when it just doesn't help?

Even with this RFC merged, I support merging of #2091 because it would allow for debuginfo based panic emitters to display the first "semantic" place where you panic. With the current panic backtraces (and debuginfo based panic locations are no different here) you got a long list of functions, beginning with panic machinery.

I think something like the RUST_BACKTRACE=short from rust-lang/rust#38165 does the job just fine - filters all the noise while keeping everything that matters. I remain convinced that blaming the panic on a single frame/location (what #2091 aims to do) is unreasonable, bugs are rarely that simple in reality. And combining both approaches really doesn't make sense to me: How does going from main(), foo(), bar(), baz(), unwrap() to main(), foo(), bar(), baz() improve anything?

@est31
Copy link
Member

est31 commented Dec 16, 2017

The RFC process always was about discussing ideas before doing the implementation work.

Correct me if I'm wrong, but the "implementation work" for this RFC consists of removing panic location info, updating docs, and nothing else, right? If that is being done, without fixing the bugs, we would have the regressions I described here.

In the past RFCs have been postponed because of implementation work still needing to be done. This is nothing new. example 1, example 2.

The support for debuginfo that we have is too shaky right now, so IMO it should improve before we switch to it. In the end, a change needs to be implementable in the real world.

I'm not familiar with how the web targets work (can sourcemaps do this?).

I think so, all it needs is getting all the wasm targets emit sourcemaps. I'm an user of the unknown wasm target and don't have any line numbers in my backtraces. If the compiler switched towards debuginfo based panic locations, it would mean that I don't have any info at all. #2091 would greatly help me. When a panic occurs, the developer console does show you a "backtrace" consisting of a bunch of wasm-function[42], wasm-function[77], wasm-function[199] like entries. If you turn on debuginfo, aka -g in RUSTFLAGS (or you compile with --debug), you are getting mangled functions names like _ZN3std9panicking20rust_panic_with_hook17h21338e728eaa6c9fE instead.

Interestingly, this is the first time I've seen anyone question the core rationale behind both this RFC and #2091. What good is something that works when it just doesn't help?

Often panic loc info does help. E.g. when you encounter unimplemented!() or panic!() invocations inside your code. Sure, sometimes the file name and line/col numbers are useless, but often they actually point to the correct location. It's not black vs white. I don't question the core rationale behind #2091, it would certainly improve panic loc info.

@main--
Copy link
Author

main-- commented Dec 17, 2017

@est31

Correct me if I'm wrong, but the "implementation work" for this RFC consists of removing panic location info, updating docs, and nothing else, right?

No, the panic runtime also needs changes to always print backtraces by default. And to me that includes fixing the remaining issues, probably by moving away from libbacktrace.

My reasoning here is:

  1. You want a stack trace for all nontrivial bugs (Admittedly, today's panic locations often help but I don't want "you see the bug if you're lucky", I want "you can definitely see the bug").
  2. Once you have that, today's panic locations are completely redundant
  3. One perfectly portable way to get stack traces is the linked list approach outlined in RFC: Implicit caller location (third try to the unwrap/expect line info problem) #2091 (comment). The only drawback is the potential performance impact.
  4. There is an obvious intrinsic motivation to support debugging on any supported platform.
  5. Since debuginfo already has everything you need to symbolicate stack traces, why not just use that instead of inventing new separate machinery? The only obvious drawback is that instead of one portable implementation, there need to be n platform-specific ones. But DWARF and PDB are basically solved problems and other than Web Sourcemaps, there's not much on the horizon.

The support for debuginfo that we have is too shaky right now, so IMO it should improve before we switch to it. In the end, a change needs to be implementable in the real world.

It is indeeed worse than I'd like it to be, but I feel like a big reason for that is that not a lot of things depend on it right now.

I'm an user of the unknown wasm target and don't have any line numbers in my backtraces. If the compiler switched towards debuginfo based panic locations, it would mean that I don't have any info at all.

I understand but that's not what I'm proposing. Of course debuginfo has to work before the switch can happen.

@est31
Copy link
Member

est31 commented Dec 17, 2017

No, the panic runtime also needs changes to always print backtraces by default. And to me that includes fixing the remaining issues, probably by moving away from libbacktrace.

I don't say that we shouldn't fix the issues. We should! And once we've done it, we should have a separate look at this RFC.

DWARF and PDB are basically solved problems
[...]
Of course debuginfo has to work before the switch can happen.

DWARF might work on GNU/Linux, but backtraces are still not functional on OSX nor do we have column numbers on any target.

@main--
Copy link
Author

main-- commented Dec 20, 2017

And once we've done it, we should have a separate look at this RFC.

I don't understand. At this point I'm just asking "Do we want this?". None of what I'm proposing / needs to happen is very hypothetical or written in the stars, all of this can and does work.

When you suggest to "look at this RFC at a later point" what is your idea? The only reason I initially took the time to write up and post this RFC was that #2091 was approaching merge-FCP and since I'm sure that's the wrong approach I felt the need to present a counterproposal. All of this got stalled out by the impl period, I guess. But nevertheless I just don't see how the fact that a big part of the implementation work for this solution consists of fixing issues that should be fixed anyways is a reason to delay discussion about what's best.

DWARF might work on GNU/Linux, but backtraces are still not functional on OSX

Correct my if I'm wrong, but AFAICS backtraces were fully functional (rust-lang/rust#44251) on OSX at one point, but crashed due to that other bug where the compiler was emitting invalid DWARF (rust-lang/rust#45511) which ultimately led to a full revert (rust-lang/rust#45760) since @alexcrichton realized that due to rust-lang/rust#21889, libbacktrace cannot be used securely on OSX.

Note how most problems (this, missing inliened frames, etc) actually tend to circle back to libbacktrace. All of that goes away if we switch to a different implementation, for example gimli addr2line (I contribute to that one).

Don't forget that OSX is still just DWARF. Just like basically the entire Unix world is DWARF. Or any platform that builds with a GNU-based toolchain, which should cover most embedded devices. The world is DWARF. The only exceptions to this are Windows (PDB, but a non-issue since it already brings its own parser) and Web Stackmaps (arguably a very special and young case, as I already said I'm not familiar with that).

nor do we have column numbers on any target.

But ... it's not that hard. Again, all of the underlying formats support columns. By far the hardest part here is just a small patch to LLVM to take care of the Windows issue.

@est31
Copy link
Member

est31 commented Dec 20, 2017

None of what I'm proposing / needs to happen is very hypothetical or written in the stars, all of this can and does work.

It can work, but it doesn't. You said that the "implementation work" of this RFC

Note how most problems (this, missing inliened frames, etc) actually tend to circle back to libbacktrace. All of that goes away if we switch to a different implementation, for example gimli addr2line (I contribute to that one).

Once we've switched to that and we have column info everywhere, I believe we can merge this RFC.

If the RFC listed all the blockers by name I'd be more comfortable.

@whitequark
Copy link
Member

Regarding the no_std work: I've added support for bare-metal to libunwind, and obtaining the raw backtrace after the linker script is appropriately modified and libunwind is linked is just a dozen lines.

@Centril Centril added the T-compiler Relevant to the compiler team, which will review and decide on the RFC. label Feb 23, 2018
@Centril Centril added A-debugging Debugging related proposals & ideas A-panic Panics related proposals & ideas labels Nov 22, 2018
@Mark-Simulacrum
Copy link
Member

We discussed this in a T-compiler meeting, though we did not have significant attendance. We decided that we would like to see this proposal as a MCP, specifically from someone (or a group of people) with the intent to see the implementation and design work through. We do not think that it requires a full RFC, though there may need to be an FCP for stabilization if we end up with an implementation that needs to be opt-in rather than just on by default.

@pnkfelix
Copy link
Member

I'm locking this thread while I figure out what's going on.

@rust-lang rust-lang locked and limited conversation to collaborators Feb 26, 2021
@rust-lang rust-lang unlocked this conversation Sep 29, 2023
@wesleywiser
Copy link
Member

Discussed during a T-compiler steering meeting. The key three proposals the RFC seems to make are:

  1. Add compiler flag for split-debuginfo.
  2. Update cargo's release profile to build split-debuginfo by default.
  3. Update the panic hook to dump a backtrace.

Of these, item 1 shipped in Rust 1.65 and the other two items are under the purview of T-cargo and T-libs. We suggest interested individuals reach out to those teams to determine the best way to begin pursuing the remaining items.

@main--
Copy link
Author

main-- commented Oct 9, 2023

Discussed during a T-compiler steering meeting. The key three proposals the RFC seems to make are:

  1. Add compiler flag for split-debuginfo.
  2. Update cargo's release profile to build split-debuginfo by default.
  3. Update the panic hook to dump a backtrace.

Given that a lot of time has passed, I'd like to clarify the original intent of this RFC. The "heart" of this RFC was really more about shifting the direction where the ecosystem is going, rather than the hard technical specifics. At the time, I saw an opportunity to use the sorry state of panic messages ("panic in option.rs line 335") as inspiration/motivation for the Rust ecosystem as a whole to embrace unwinding and stack traces for debugging.

Today we have things like std::backtrace::Backtrace, which states:

This function will be a noop if the RUST_BACKTRACE or RUST_LIB_BACKTRACE backtrace variables are both not set. If either environment variable is set and enabled then this function will actually capture a backtrace. Capturing a backtrace can be both memory intensive and slow, so these environment variables allow liberally using Backtrace::capture and only incurring a slowdown when the environment variables are set.

And to be completely honest with you here, this just makes me sad. It makes me feel like the ship has sailed on this RFC. The whole point of this proposal was to prove that if done right, backtraces are fast and cheap! Obviously not run-this-in-a-tight-loop-without-noticing-cheap, but more than cheap enough for handling any type of application error. The problem is that to this day, people still conflate the process of capturing a stack trace (i.e. walking the stack to grab a couple of addresses) with the process of symbolicating that stack trace (which first of all requires having the full debuginfo, then parsing it and finding the symbols, potentially demangling those, etc). The second part is the really expensive one. The first part is not.

Your list is arguably correct, but it almost makes it sound like this RFC is concerned with implementation details, when the exact opposite is true! This really aims to solve two distinct, but related problems:

  1. By defaulting to stripped release builds (and providing zero tooling / guidance for debugging release builds), the "officially recommended" approach for dealing with errors/crashes in production essentially amounts to just crossing your fingers and flying blind. (Debuginfo-based panic locations #2154 (comment)) I will happily admit that this is a fine strategy for plenty of usecases, but I firmly believe that it's not an appropriate default.
  2. Reinventing debuginfo in the form of panic location information wastes unstrippable space in the binary for no reason. (Debuginfo-based panic locations #2154 (comment)) And for this cost we get ... objectively worse results (only a single frame with no context, potentially forgot #[track_caller], etc...). It's honestly puzzling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-debugging Debugging related proposals & ideas A-panic Panics related proposals & ideas T-compiler Relevant to the compiler team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.