Segfaults #14

blakejohnson · 2013-05-17T13:38:21Z

I keep running into an issue where running @sprofile on the same function several times causes Julia to segfault. I am not even sure where to start in finding the origin of this issue. I am running Mac OS 10.7.5. I rebuild Julia daily, and this problem has existed at least for a few weeks. I also rebuilt the Profile.jl library with Pkg.runbuildscript("Profile").

If it would help to post the code I am running, I would be happy to do so.

The text was updated successfully, but these errors were encountered:

timholy · 2013-05-17T15:19:55Z

Thanks for the report. I've never seen this, so if you really have no objections it would be useful for me to try your code. It sounds like you've tried the obvious things. Did it used to work without a segfault, or is it simply that you've only been using @sprofile in this fashion recently?

This might be an OSX vs Linux thing, of course. Since I was writing it on Linux I went with its most modern timer API, then later learned that it's not available on BSD. But Linux can also use the same older timer API as OSX. So if I can't reproduce the segfault I'll compile a special version and test the OSX timers. I did that when I first developed the OSX support, but I bet I didn't give it the same kind of thorough workout you're giving it.

blakejohnson · 2013-05-17T15:59:10Z

Sure, the code is here: https://gist.github.com/blakejohnson/5600044

Running this causes my machine to segfault:

require("SimSpeedTest.jl")
require("Profile")
using SProfile
@sprofile fot ct=1:10 run_sim(); end

timholy · 2013-05-17T19:37:54Z

I'll definitely have to try the BSD timers. I was able to run that last line ~6 times without issue.

Hopefully this weekend.

blakejohnson · 2013-05-17T19:46:20Z

I'll also have a look on another computer to see if it is something
peculiar to my work machine.

--Blake

On Fri, May 17, 2013 at 3:37 PM, Tim Holy [email protected] wrote:

I'll definitely have to try the BSD timers. I was able to run that last
line ~6 times without issue.

Hopefully this weekend.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/14#issuecomment-18081921
.

blakejohnson · 2013-05-17T21:20:18Z

FWIW, here is an example backtrace upon segfault:

#0  0x00007fff8916a2b2 in libunwind::CompactUnwinder_x86_64<libunwind::LocalAddressSpace>::stepWithCompactEncodingRBPFrame ()
#1  0x00007fff8916a1fd in libunwind::CompactUnwinder_x86_64<libunwind::LocalAddressSpace>::stepWithCompactEncoding ()
#2  0x00007fff8916a171 in libunwind::UnwindCursor<libunwind::LocalAddressSpace, libunwind::Registers_x86_64>::step ()
#3  0x0000000100097a66 in rec_backtrace ()
#4  0x00000001018f2d18 in sprofile_bt ()
#5  <signal handler called>
#6  0x00000001048259d1 in zhemv_ ()
#7  0x00000001050357a7 in zhetd2_ ()

blakejohnson · 2013-05-18T00:37:58Z

Ok, I can confirm the issue on another Mac, this one with OS 10.8.2. The segfault backtrace is basically the same.

timholy · 2013-05-18T16:18:34Z

OK, I commented out the Linux timers and rebuilt. I still don't get the segfault. Then I ran it under valgrind as
valgrind --track-origins=yes /home/tim/src/julia/usr/bin/julia-debug-readline.

I got these errors:
==1537== Thread 2:
==1537== Syscall param msync(start) points to unaddressable byte(s)
==1537== at 0x50684CD: ??? (syscall-template.S:82)
==1537== by 0x5E69ADE: msync_validate (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x5E69C0B: validate_mem (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x5E69D48: access_mem (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x5E67A2C: dwarf_get (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x5E67CEA: _ULx86_64_access_reg (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x5E675AA: _ULx86_64_get_reg (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x5E6C9FB: apply_reg_state (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x5E6D1AA: _ULx86_64_dwarf_find_save_locs (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x5E691BB: _ULx86_64_dwarf_step (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x5E67E97: _ULx86_64_step (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x53A5987: rec_backtrace (task.c:506)
==1537== Address 0xd99d000 is not stack'd, malloc'd or (recently) free'd
==1537==
==1537== Syscall param msync(start) points to uninitialised byte(s)
==1537== at 0x50684CD: ??? (syscall-template.S:82)
==1537== by 0x5E69ADE: msync_validate (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x5E69C0B: validate_mem (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x5E69D48: access_mem (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x5E6AB8C: dwarf_get (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x5E6CC5A: apply_reg_state (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x5E6D1AA: _ULx86_64_dwarf_find_save_locs (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x5E691BB: _ULx86_64_dwarf_step (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x5E67E97: _ULx86_64_step (in /home/tim/src/julia/usr/lib/libjulia-debug.so)
==1537== by 0x53A5987: rec_backtrace (task.c:506)
==1537== by 0xE3FE8B7: sprofile_bt (profile.c:27)
==1537== by 0x646E49F: ??? (in /lib/x86_64-linux-gnu/libc-2.15.so)
==1537== Address 0xd99e000 is on thread 2's stack
==1537== Uninitialised value was created by a stack allocation
==1537== at 0x53A58E1: rec_backtrace (task.c:499)
==1537==

See also:
JuliaLang/julia#782

I don't know what to do here. Looks like it could be a bug in libunwind. CCing @JeffBezanson.

blakejohnson · 2013-05-18T16:40:17Z

I also get the feeling that it is a libunwind bug.

blakejohnson · 2013-05-21T00:50:10Z

On the Mac, valgrind crashes before it finishes loading the Julia REPL. So, I can't show similar output...

timholy · 2013-05-21T01:29:37Z

Hmm, that's not good!

Sounds like one of us should submit a bug report to libunwind. Do you do C?

blakejohnson · 2013-05-21T01:38:25Z

Right. Reducing this to a bug report that doesn't require Julia is the trick. Unfortunately, it also looks like Apple's updates to libunwind have not been accepted upstream. So, do we submit a bug with Apple or the libunwind project?

Ugh.

JeffBezanson · 2013-05-21T08:01:03Z

I believe the reason libunwind is calling msync here is just to check its return code to see if a memory region is valid. So it is expected that it will sometimes pass invalid addresses to msync.

timholy · 2013-05-21T08:34:47Z

Right. Reducing this to a bug report that doesn't require Julia is the trick. Unfortunately, it also looks like Apple's updates to libunwind have not been accepted upstream. So, do we submit a bug with Apple or the libunwind project?

Dunno. I assume we get "ours" straight from libunwind? In that case, I'd
consider starting with libunwind and see where that gets us.

timholy · 2013-05-21T11:52:24Z

@JeffBezanson, thanks for checking. Obviously it might be more informative to run it on the mac, if valgrind weren't crashing.

blakejohnson · 2013-05-25T14:20:55Z

It turns out that we only build libunwind on linux and FreeBSD. So, this must be a bug in Apple's implementation. I think Apple still runs their radar bug tracker. So, I guess I will file there.

timholy · 2013-05-27T20:38:36Z

Thanks for continuing to pursue this! Profile users will owe you their thanks.

timholy · 2013-05-28T07:17:44Z

CC @ViralBShah.

ViralBShah · 2013-05-28T08:01:35Z

We can certainly build a libunwind on OS X, if there is a patched version that works.

blakejohnson · 2013-06-04T13:24:53Z

So, I've been looking more into this over the last few days. It's difficult to debug because libunwind is very low-level. I've only really unearthed two new pieces of information:

The crash point always has OpenBLAS on the stack.
The few other similar reports I can find suggest that the problem could be using inconsistent versions of libstdc++. So, I wondered if gfortran is linking to a different libstdc than clang.

timholy · 2013-06-04T15:13:28Z

That does sound like progress. Viral, you know a lot about Julia's build process, is this theory likely?

ViralBShah · 2013-06-05T03:40:43Z

I do not have any more insight here. Since gfortran is provided by the user, I can imagine that different builds link to different libraries. I personally have now moved towards using brew for all the dependencies to build julia. OpenBLAS and gfortran do not seem to be linking to anything weird though:

 otool -L libopenblas.dylib 
libopenblas.dylib:
    @rpath/libopenblas.dylib (compatibility version 0.0.0, current version 0.0.0)
    /usr/local/Cellar/gfortran/4.8.0/gfortran/lib/libgfortran.3.dylib (compatibility version 4.0.0, current version 4.0.0)
    /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 169.3.0)
    /usr/local/Cellar/gfortran/4.8.0/gfortran/lib/libquadmath.0.dylib (compatibility version 1.0.0, current version 1.0.0)

OpenBLAS does lots of tricks under the hood, and perhaps @xianyi can tell us if it is possible that it is likely to interfere with profiling.

Also, the last I had checked (probably a year ago, when we got backtrace), the upstream libunwind does not build on the mac, and you have to use Apple's. Perhaps that has changed now.

ViralBShah · 2013-06-05T03:42:01Z

I notice that libprofile is built on gcc with mac, instead of with clang. Could this be an issue?

ViralBShah · 2013-06-05T03:45:49Z

I built with clang and still get the segfault.

blakejohnson · 2013-06-05T12:47:13Z

If you then do otool -L /usr/local/Cellar/gfortran/4.8.0/gfortran/lib/libgfortran.3.dylib you will notice that it links to '/usr/local/Cellar/gfortran/4.8.0/gfortran/lib/libgcc_s.1.dylib'. Whereas libjulia-release.dylib links to /usr/lib/libstdc++.6.dylib. Is this a problem?

I'll check what this looks like on linux.

blakejohnson · 2013-06-05T13:11:45Z

On Ubuntu 12.04:

ldd libopenblas.so
    linux-gate.so.1 =>  (0xb76f8000)
    libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xb6a02000)
    libpthread.so.0 => /lib/i386-linux-gnu/libpthread.so.0 (0xb69e7000)
    libgfortran.so.3 => /usr/lib/i386-linux-gnu/libgfortran.so.3 (0xb68ee000)
    libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb6744000)
    /lib/ld-linux.so.2 (0xb76f9000)
    libquadmath.so.0 => /usr/lib/i386-linux-gnu/libquadmath.so.0 (0xb66cc000)
    libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0xb66ae000)

ldd /usr/lib/i386-linux-gnu/libgfortran.so.3
    linux-gate.so.1 =>  (0xb76ed000)
    libquadmath.so.0 => /usr/lib/i386-linux-gnu/libquadmath.so.0 (0xb7561000)
    libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xb7535000)
    libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0xb7516000)
    libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb736c000)
    /lib/ld-linux.so.2 (0xb76ee000)

ldd libjulia-release.so 
    linux-gate.so.1 =>  (0xb77bf000)
    libdl.so.2 => /lib/i386-linux-gnu/libdl.so.2 (0xb681a000)
    librt.so.1 => /lib/i386-linux-gnu/librt.so.1 (0xb6811000)
    libpthread.so.0 => /lib/i386-linux-gnu/libpthread.so.0 (0xb67f5000)
    libstdc++.so.6 => /usr/lib/i386-linux-gnu/libstdc++.so.6 (0xb670c000)
    libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xb66e0000)
    libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0xb66c2000)
    libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb6518000)
    /lib/ld-linux.so.2 (0xb77c0000)

So, both libgfortran and libjulia link to libgcc_s.so.1. This is at least different from the mac.

ViralBShah · 2013-06-05T17:13:37Z

That doesn't seem to be a problem to me. The gfortran is not provided with XCode and hence it brings its own version of libgcc. libjulia needs libstdc++ which comes from the C toolchain. It seems that the linking and loading is happening correctly.

blakejohnson · 2013-06-05T17:38:22Z

Alright, I guess it is time to look in a different direction. My test script spends most of its time in OpenBLAS, so perhaps it is just a coincidence that that I never seen it crash outside of OpenBLAS. I'll try some of the perf2 demos that are pure Julia and see if I can get the profiler to crash.

blakejohnson · 2013-06-05T17:59:30Z

Well, I cannot get the laplace or Go benchmarks to segfault during profiling. So, perhaps OpenBLAS or ccall are important.

ViralBShah · 2013-06-06T06:04:44Z

@xianyi Any wisdom from you here will be useful. This is a blocker for having a sampling profiler as part of julia base, since it crashes when openblas is executing.

xianyi · 2013-06-06T14:20:59Z

Hi @blakejohnson , @ViralBShah ,

Could you try the latest develop branch? We fixed some bugs in level-2 BLAS.

Is it multi-threaded OpenBLAS? Could you try the single thread? Or, build OpenBLAS with USE_OPENMP=1

Xianyi

ViralBShah · 2013-06-24T13:07:08Z

Potential culprit: JuliaLang/julia#3365

Perhaps the fortran interface inside OpenBLAS needs to be compiled with -ff2c

timholy · 2013-06-24T13:17:10Z

But the segfaults on OSX are not specific to OpenBLAS. On Linux, the "unrooted" backtraces are also independent of OpenBLAS.

timholy · 2013-06-24T14:59:18Z

Blake, when you link against OpenBLAS, are all the backtraces truncated, or only some of them? For example, I'm guessing if the backtrace is triggered while in main it should not be truncated (capturing samples in main is quite rare, so you may need to run for more samples or run multiple times). Also, looking at your truncated backtraces, I notice there's something yet different from #19: the "head" of the truncated backtrace is not even within OpenBLAS, it's within the signal handler. Presumably unw_step is returning <=0 for the first frame beyond that.

Even though things are still rather confusing, I'm wondering if the time has come to report the issue with the libunwind developers. Perhaps they might immediately know what's happening and help direct our investigations. I'm not sure whether the mailing list or the bug reporter is the better choice. My suspicion is that the mailing list would be better, since I think there's still a lot we don't understand about this issue.

timholy · 2013-06-24T15:01:28Z

Blake, if you also think that the time has come to report it, do you want to do it or should I? I'm happy either way.

blakejohnson · 2013-06-24T15:09:29Z

If you wouldn't mind contacting the libunwind developers, that would be great. They may or may not be able to help given the disconnect between their code and Apple'. But, they can probably point us in productive directions.

I want to try to extend my example a little further to include calls to a simple external library. Then, I think this will be ready to file with Apple.

blakejohnson · 2013-06-24T15:38:39Z

I should mention my null result from last night: I obtained a proper gcc (rather than use the llvm-gcc that Apple provides) and tried building my test with that to at least rule out some varieties of incompatible calling conventions between gfortran and LLVM. No change, regardless of optimization level or fno-omit-frame-pointer.

blakejohnson · 2013-06-24T18:04:45Z

Now I am back to thinking that there is an issue with how OpenBLAS is compiled. I've updated my gist with two new targets:
https://gist.github.com/blakejohnson/5831456

The 4th target uses a simple C version of daxpy. I get full backtraces here.

The 5th target compiles NETLIB's daxpy with gfortran and calls that directly. This also gives me full backtraces.

So, OpenBLAS must be doing something different. @xianyi ?

xianyi · 2013-06-26T02:46:29Z

OpenBLAS uses the assembly kernel for daxpy.

blakejohnson · 2013-06-26T13:27:58Z

I see... these assembly methods manually encode the ABI prologue and epilogue. Hmm. Maybe I should choose a method implemented in C.

timholy · 2013-06-26T14:30:40Z

I was hoping that perhaps we were beginning to discover that assembly methods were lacking unwind information and causing your segfault. But gettimeofday seems to be C. So this might explain the truncated backtraces but not the segfaults.

timholy · 2013-06-26T14:43:49Z

@blakejohnson, can you trigger the segfault from pure-Julia, using the unsafe_iceil example I posted in JuliaLang/julia#3469?

blakejohnson · 2013-06-26T15:18:42Z

After spending a couple hours trying to learn about calling conventions in x86_64, it seems like the assembly methods in OpenBLAS always omit the frame pointer. We could try modifying the definitions of PROLOGUE and EPILOGUE in OpenBLAS/common_x86_64.h to include:

pushq %rbp
movq %rsp, %rbp

and

popq %rbp

blakejohnson · 2013-06-26T17:10:13Z

@timholy I cannot seem to trigger a segfault with your unsafe_iceil example.

blakejohnson · 2013-06-26T17:48:12Z

Looking into this a little further, we may be able to just add .cfi_startproc and .cfi_endproc to the PROLOGUE and EPILOGUE definitions. I'm building OpenBLAS with that change now.

blakejohnson · 2013-06-26T18:09:34Z

Well, no joy on my simple modification of PROLOGUE and EPILOGUE.

timholy · 2013-06-26T18:17:46Z

Good try. This is a toughie!

I think I'm going to try the whole _U_dyn_register thing, but it may be a few days before I can get to it (and perhaps more to finish it...).

xianyi · 2013-06-27T15:38:35Z

Hi @blakejohnson ,

When you added the following instruction.

pushq %rbp
movq %rsp, %rbp

Did you use -fno-omit-frame-pointer flag?

Xianyi

blakejohnson · 2013-06-28T11:35:14Z

@xianyi I haven't tried that code snippet yet, just the .cfi_startproc and .cfi_endproc version. I was under the impression, though, that -fno-omit-frame-pointer would only affect code that wasn't already written in assembly. Is that incorrect?

timholy · 2013-07-14T12:35:46Z

@blakejohnson, it's possible that turning on libunwind debugging output might help debug these segfaults. See the procedure in JuliaLang/julia#3469 (comment)

blakejohnson · 2013-07-15T00:28:49Z

Thanks, @timholy. I'll see if Apple's libunwind supports similar debug flags.

asterite · 2015-01-03T15:33:58Z

We are facing similar issues in Crystal. We also believe it's a bug in libunwind and LLVM. We found this:

http://llvm.org/bugs/show_bug.cgi?id=20800

We applied the patch and use those binaries and it seems to be working fine, but now we tripped into this again so there might be more bugs. So I'm almost sure it has nothing to do with BLAS.

The worse thing is that it's hard to reproduce, for a given program it always crashes but we only get that with the compiler's source, and with small changes to the compiler's source it goes away or comes back so it's hard to find a minimal case.

asterite · 2015-01-03T15:34:19Z

BTW, it only happens on OSX. On Linux it always works fine.

timholy · 2015-01-03T16:53:44Z

I don't know whether this is still a problem for us (I can't test, since I don't have a Mac).

FYI, this profiler moved into julia base, and a potentially-relevant thread (very long, but with a happy resolution) is JuliaLang/julia#3469. I suspect it's a separate issue, but I thought you should know about it .

timholy mentioned this issue Jul 20, 2013

segfault running julialang/julia#3775 #22

Closed

ViralBShah mentioned this issue Aug 7, 2013

Segfault in profiler JuliaLang/julia#3971

Closed

ViralBShah mentioned this issue Aug 27, 2013

Fix profiling on OS X JuliaLang/julia#4159

Merged

timholy mentioned this issue Jan 3, 2015

LLVM 3.5 is missing backtraces JuliaLang/julia#7910

Closed

asterite mentioned this issue Jan 8, 2015

Compiler segmentation fault crystal-lang/crystal#327

Closed

timholy mentioned this issue Feb 27, 2015

Profiling Pkg.update() segfaults on OSX JuliaLang/julia#10355

Closed

timholy closed this as completed Nov 14, 2015

Segfaults #14

Segfaults #14

Comments

blakejohnson commented May 17, 2013

timholy commented May 17, 2013

blakejohnson commented May 17, 2013

timholy commented May 17, 2013

blakejohnson commented May 17, 2013

blakejohnson commented May 17, 2013

blakejohnson commented May 18, 2013

timholy commented May 18, 2013

blakejohnson commented May 18, 2013

blakejohnson commented May 21, 2013

timholy commented May 21, 2013

blakejohnson commented May 21, 2013

JeffBezanson commented May 21, 2013

timholy commented May 21, 2013

timholy commented May 21, 2013

blakejohnson commented May 25, 2013

timholy commented May 27, 2013

timholy commented May 28, 2013

ViralBShah commented May 28, 2013

blakejohnson commented Jun 4, 2013

timholy commented Jun 4, 2013

ViralBShah commented Jun 5, 2013

ViralBShah commented Jun 5, 2013

ViralBShah commented Jun 5, 2013

blakejohnson commented Jun 5, 2013

blakejohnson commented Jun 5, 2013

ViralBShah commented Jun 5, 2013

blakejohnson commented Jun 5, 2013

blakejohnson commented Jun 5, 2013

ViralBShah commented Jun 6, 2013

xianyi commented Jun 6, 2013

ViralBShah commented Jun 24, 2013

timholy commented Jun 24, 2013

timholy commented Jun 24, 2013

timholy commented Jun 24, 2013

blakejohnson commented Jun 24, 2013

blakejohnson commented Jun 24, 2013

blakejohnson commented Jun 24, 2013

xianyi commented Jun 26, 2013

blakejohnson commented Jun 26, 2013

timholy commented Jun 26, 2013

timholy commented Jun 26, 2013

blakejohnson commented Jun 26, 2013

blakejohnson commented Jun 26, 2013

blakejohnson commented Jun 26, 2013

blakejohnson commented Jun 26, 2013

timholy commented Jun 26, 2013

xianyi commented Jun 27, 2013

blakejohnson commented Jun 28, 2013

timholy commented Jul 14, 2013

blakejohnson commented Jul 15, 2013

asterite commented Jan 3, 2015

asterite commented Jan 3, 2015

timholy commented Jan 3, 2015