Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack overflow segfault in getrf_parallel using Breeze / Netlib-Java on Debian 8 Jessie #1082

Closed
pmkc opened this issue Feb 4, 2017 · 8 comments

Comments

@pmkc
Copy link

pmkc commented Feb 4, 2017

Hi

I got a segfault using Breeze 0.12 on Debian 8 Jessie.

I gdb'ed down into 4 recursions of getrf_parallel before it overflowed the stack.

Jessie's OpenBLAS is 0.2.12 with some patches (I think including 5d33121). Most importantly it sets NUM_THREADS = 64, which causes the getrf_parallel stack overflow from #246 (and probably #912) to easily blow out Java's 1 MB Stack.

Could you just always heap allocate job_t?

Details:
JVM Open JDK 8
Scala 2.11.8
Breeze 0.12

Code:

import breeze.linalg._
val A: DenseMatrix[Double] = DenseMatrix.rand(100,100)
val b = DenseVector.fill(100,1.0)
val res = A \ b
@brada4
Copy link
Contributor

brada4 commented Feb 4, 2017

You mentioned 4 products. Can you clarify which are installed from Debian repositories, and which are unzipped by you.
btw latest 'Jessie' OpenBLAS is 0.2.12-1

Debian builds pthread openblas and patches out thread safety warning (oops)
You can set OPENBLAS_NUM_THREADS=1 for your program (unless you are certain all SCALA JNI BLAS calls are serialized in java to work with thread-unsafe libraries)

One fix is to build your own openblas 0.2.10 - OpenMP or no threading (no thrading gives better control to SCALA, which is smart enough to partition computations to many CPUs)

Other fix is to convince debian openblas maintainer to rebuild a thread-safe i.e. OpenMP openblas, copying what Ubuntu does.

@martin-frbg
Copy link
Collaborator

If you are prepared to build your own OpenBLAS you could reduce the (probably arbitrary) limit of MAX_CPU_NUMBER introduced by 5d33121 to something like 16 and see if this is the only problem in your context. (Not sure how big the overhead of malloc is across all supported architectures). (Alternatively, wouldn't running java with -Xss4m or thereabouts take care of this as well, or is this no longer an option in JDK 8 ?)
If your java code is already running multithreaded it could be that it is more efficient to use a singlethreaded OpenBLAS or to limit OPENBLAS_NUM_THREADS to some small value.

@pmkc
Copy link
Author

pmkc commented Feb 6, 2017

I installed OpenBLAS and OpenJDK 8 from Jessie (Backports?) I also installed OpenBLAS 0.2.19 from Sid and built my own 0.2.12/0.2.19 (all repro-ed with NUM_THREADS=64 and USE_OPENMP=0). Scala was downloaded off their website and Breeze & Netlib Java came from building Apache Spark 2.0.2 from source.

All of the above fixes / workarounds work. My objection is more that something simple breaks badly out of the box. If the consensus of this thread is that OpenMP is the correct solution, I'd be happy to file an issue on Debian requesting they build with it in future.

@brada4
Copy link
Contributor

brada4 commented Feb 6, 2017

It should be fairly easy to check your scala code on Ubuntu that it yields no pain out of the box.

@jeromerobert
Copy link
Contributor

@martin-frbg
Copy link
Collaborator

Thank you for checking. I suspect even with 0.2.19 or develop the current default threshold value of 80 (meanwhile centrally set in common.h) will exceed the default java stacksize of 1m, given that for OS_WINDOWS it is explicitly set to 32 citing the 1m stack limit for that platform. As I do not see how OpenBLAS could know at build time that it is going to be used in a java context later, my gut feeling is that telling java to increase its stacksize is the way to go here(?).

@brada4
Copy link
Contributor

brada4 commented Feb 8, 2017

There is no way to detect stack size at runtime. No rlimit in java either.
e.g 30-something threads 10MB each would be quite a killer for raspbery pi...

@martin-frbg
Copy link
Collaborator

My point was just that I think running java code that links to OpenBLAS as "java -Xss4m" is preferrable to lowering the default threshold for all non-windows platforms to 32 just in case someone might want to call it from java later. (That is unless someone can state with certainty that there is no more reason to prefer stack allocation of job_t nowadays - on all supported platforms).
And there are probably more than enough ways to bake a raspberry pi(e).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants