-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
truncated backtraces #3469
Comments
Wow, you figured out how to disable inlining forcibly by overwriting my method. @timholy your deviousness knows no bounds! |
More to the point, let me see if I fully understand the issue. Is it that inlining squashes the entire backtrace, as opposed to just the frame for the inlined function? Not sure how that might happen. |
Yes, that's it. Whatever is happening, it's super-weird. Sometimes you get a full backtrace, and sometimes you get a backtrace where the inlined function appears to be the "root" from which everything is called. The mechanism is that on the next iteration of the loop in The really weird part is that you get both behaviors during a run of the function (with no intermediate recompiling). The pointer addresses for the truncated backtraces are within a few bytes of the addresses for the ones that yield full backtraces. I haven't done a careful analysis of whether there's any overlap in the exact addresses; if that would help figure this out, I'd be glad to do it. Seems like it could easily be a libunwind bug, but the fact that this seems restricted to inlined functions made me think it's possible there's more going on. |
Before getting to more substantive issues...
It doesn't reflect any deep understanding of |
OK, so it was pretty obvious that analyzing the instruction pointers would be a good thing to do. I accumulated 80 runs of
For reference, I'll note the disassembly for
For the full-length backtraces, there were 14 distinct instruction pointers:
And the interesting part:
With statistics of this size, this cannot be a coincidence. I'm back to thinking this is a libunwind issue. I've only just started to take baby steps in learning to read the disassembly, and not yet able to make inferences about how the instruction pointers relate to particular points in the disassembly (e.g., I don't know how many bytes to expect in each line). Maybe the pattern is meaningful to someone else? All this is x86_64. |
Nice investigation! It seems consistent with the idea that the state is slightly different at different points in the function, some of which leave libunwind confused. You might be able to use |
Thanks for the tip re
So the first "interesting" one from the long backtrace seems to be the entrypoint. |
Blake has just done some really nice work revealing that these truncated backtraces are not Julia-specific. A Julia-specific problem, however, seems to be segfaults on OSX. timholy/IProfile.jl#14 (comment) So now that begs the question, should this be closed? There's useful data here for debugging a problem, and there's an even more serious issue, but it's not obviously relevant to the original issue. |
I take it that the conclusion is that this is a libunwind problem? Marking "upstream"; their tracker is http://savannah.nongnu.org/bugs/?group=libunwind |
As testing has progressed, it's not entirely clear what the situation is; it may not be purely upstream. |
I'm in the process of filing a bug report to |
LLVM has its own build time options and also options that are set at runtime for the generated code. |
We set some options in |
You could try setting |
We might want to try |
I'm very grateful for the help here, folks. @blakejohnson has put together a Julia-free testcase (C or C+fortran) here: timholy/IProfile.jl#14. The problem is that on OSX it seems to only happen for OpenBLAS calls, and on Linux (gcc) I can't yet get it to happen at all even when I force C inlining. Thanks for the tips re the compile options. I added I had also already checked whether eliminating To me From what I can tell, |
The understanding we've gained so far has also allowed me to finally generate a small test case that isolates a related problem I've seen before:
Now do this:
Note that this backtrace is truncated; there is no information about who called |
For the record, I get the full backtrace on Mac64, but not on Linux64 when using your example code. |
So, it might only be relevant on Linux. |
One more observation: I started thinking that the best short-term solution might be to see what happens if we ignore the return value of However, testing on this "nest" example showed that this strategy doesn't work; I get the message that the backtrace repeats 79998 times. This indicates that So it seems that to fix this we may need to generate the missing unwind information. |
Update: while I've (slowly) been teaching myself a bit about LLVM and related topics in order to try to address this issue, the behavior has changed:
In addition, it's worth showing the latest SProfile results on
The oddity: I've skimmed recent changes to codegen, but there have been a lot of them (thanks, Jeff!) and I don't have any good guesses about what caused these changes. If it would be useful to bisect any of these changed behaviors (particularly the loss of inlining |
Are you sure |
It's a fair question; at the time I wasn't in the habit of looking at the disassembly because I'd never learned how to read it. So I went back and checked out f724a02 from June 20th (the date I filed this) and rebuilt. Here's what I suspect is the crucial part of the disassembly of
So, looks like no. Yet I was able to reproduce the truncated backtraces just fine. So, turning off inlining got rid of them, but in fact this function wasn't inlined. When I updated to current (d3be255), now I'm not seeing the "new" behavior, I'm seeing the original behavior as reported in this issue. In other words, there's still something to fix. I swear I'm not imagining these things, but this is really weird. I wonder if it might depend on what had been run previously within the same REPL session? |
I've made a bit more progress (and can't really invest a lot more in this for today, so reporting in...). By reading the libunwind source, I discovered that one can turn on debugging output. Below I've indicated the build & code changes needed to turn this on, how I ran it, and the results for one truncated backtrace and one full backtrace, both starting from within Bottom line: it looks like none of the Julia-compiled code has DWARF information, so libunwind relies on "frame-chain". Sometimes this works and sometimes it doesn't. In searches I've found precious little about the combination llvm + dwarf + jit online; this link is perhaps the most interesting one.
|
Great detective work, Tim. That explains why our backtraces are so generally flaky. The "old" LLVM JIT infrastructure does not emit DWARF info, whereas I believe that the new MCJIT does support that. So maybe this is one of the many things that would be fixed by making the switch. Big undertaking though. The sad thing about this is that the stack frame, if they were following the x86_64 spec, is a super simple linked list. Even when you ask for this spec to be followed, however, it isn't. |
I did find an older version of that page (dating to LLVM 2.7) that seems to suggest that there was some way to get DWARF information even in earlier versions (I'm guessing this must be something they've been able to do for a while, since my limited understanding is that this should be used anytime you run a debugger). In fact, in 2.7 there seemed to have been a Even presuming LLVM can emit this information, there's also the question of getting libunwind to use it. |
I've read a bit more. I'm now guessing that the "right" way to handle all of these issues will be to switch to LLDB, which in addition to solving this immediate issue might also enable a whole host of other useful things (like a full debugger). JIT support seems like it's on the road map, but not ready yet: http://blog.llvm.org/2013/06/lldb-33-and-beyond.html. Unless someone knows differently, I'm therefore planning on putting this issue on ice for the time being and see how that support evolves. |
Hi all, I was having a similar-looking issue in my project, and using the debugging steps from this thread (thanks!) I managed to isolate it to what I think is a bug in libunwind. I can't tell if it's the same issue that you guys are having, but what I was running into is that libunwind will judge stack frames with no stack variables (ie rbp=rsp) as "invalid" and terminate the unwind. Here's a simple patch that changes the behavior: Again, it works for me and it seems reasonable, but I haven't heard anything back from the libunwind list, and I don't know if it's the problem you're having so YMMV. Also I'm running on MCJIT and, at least out of the box, I don't think it has any automatic libunwind integration. I think it should be fairly easy to add, though, since MCJIT will already emit the relevant DWARF info and it's should theoretically just be a matter of finding the right address and calling _U_dyn_register on it; I haven't gotten around to it, though, since (other than the rbp=rsp issue) I haven't seen any issues with the rbp-based unwinding. |
Today is Christmas. I haven't tested extensively, but cursory examination suggests you've fixed my personal most-hated-bug-ever. Can't thank you enough. To the rest: I'm not exactly sure what we need to do to apply a patch in one of the libraries, but we should do this! |
@timholy Should be as simple as this https://github.com/tkelman/julia/compare/patch-libunwind @kmod Thanks so much for the heads up here! Pyston is a really interesting project that a lot of us are keeping an eye on. |
@tkelman, thanks! Have you submitted this as a pull request? Definitely eager to see this get merged. |
Not yet but I could. Or let @kmod do the honors if he wants, since he wrote the patch. It's really only a 1-character change, but I'd still want to be careful about licenses. Pyston appears to be Apache License v2.0, which is GPLv3 compatible but not GPLv2? Are any of the Julia dependencies GPLv2 only? |
Is this relevant to our libosxunwind as well? |
Bump. IANAL, and I don't want to get Julia in trouble either. I believe the portable Git we distribute with Julia Windows binaries is strictly GPL v2. @kmod can you clarify whether patches in the Pyston repository are under Apache License v2, or if you would be willing to submit that libunwind patch to Julia under the MIT license? Libunwind is also MIT licensed, so presumably they would like to know the same - unless there's some trivial patch rule where this stuff doesn't matter. |
You surely saw this, but just in case: https://github.com/dropbox/pyston/blob/master/LICENSE This patch lives in that repository, and there is no mention of separate licenses applying to separate parts. |
Oh, scrolled up, and you had seen that already. |
Right, and from: https://www.gnu.org/licenses/license-list.html#apache2
|
The patch is a single character deletion. Is that really copyrightable? |
Also, we don't link against git, so the point (there at least) is moot. |
Recent legal precedent has been redefining what is copyrightable, throwing common sense out the window. I don't see why Dropbox would have any reason to take Julia to court, but there are other entities that could in the future have both the motivation and the resources to do so. How many Matlab APIs might Julia, SciPy, etc now be in hypothetical legal trouble for imitating? That question is unrelated to this one-character patch, but hopefully you can see where my unnecessary paranoia is coming from. Rather ask permission here... |
It would be great to get this patch integrated. Would be best to get its |
Hi, I hereby place the patch in the public domain for you (or anyone else) to use as you please. The same disclaimer as in the Pyston (Apache 2.0) license applies, that it's released without warranties either expressed or implied. I'll update the LICENSE to say so; also I submitted the patch for inclusion in libunwind so ideally it would actually end up getting licensed under their license, but I'm not sure the project is still actively being developed so who knows. |
Fantastic, thank you very much Kevin. Will submit a Julia PR momentarily. Regarding the maintenance of libunwind, Arun Sharma did respond recently on their mailing list: http://lists.nongnu.org/archive/html/libunwind-devel/2014-05/msg00001.html |
Thanks a bunch, Kevin! |
Fixed by b551291 |
Amazing to have this fixed! |
I've decided to get a tattoo that says |
I think everyone is waiting for video of your happy dance :) |
Should be beneficial to other LLVM+libunwind projects as well (e.g. Pyston)
Should be beneficial to other LLVM+libunwind projects as well (e.g. Pyston)
In timholy/IProfile.jl#19 I documented that sometimes libunwind returns truncated ("unrooted") backtraces. Now I believe I have traced this to inlining, and hence I wonder if it could be a Julia issue rather than a libunwind issue.
Using the same code as in the Profile.jl issue, I tried this:
The thing that caught my eye is that there are no longer any unrooted backtraces in
unsafe_iceil
.The text was updated successfully, but these errors were encountered: