Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SegFault when using jpype with numpy.linalg #808

Open
AbdealiLoKo opened this issue Jul 21, 2020 · 11 comments
Open

SegFault when using jpype with numpy.linalg #808

AbdealiLoKo opened this issue Jul 21, 2020 · 11 comments
Labels
bug Unable to deliver desired behavior (crash, fail, untested) notice Long standing JPype limitation (outside of our control)

Comments

@AbdealiLoKo
Copy link

AbdealiLoKo commented Jul 21, 2020

I am currently using jnius, and was looking into jpype for some of my machine learning usecases to see how jpype behaves in general.
(Refer: https://gist.github.com/AbdealiJK/1dd5b7677435ba22f9ab3e26016bb3e7)

I found that the issue reported at numpy/numpy#15691 seems to be with jpype and numpy too
This is also an issue in pyjnius: kivy/pyjnius#490

Just posting it here - in case it helps

$ docker run --rm -it centos:7 /bin/bash
# yum install -y wget bzip2 which java-1.8.0-openjdk-devel
# wget https://repo.anaconda.com/miniconda/Miniconda3-4.7.12-Linux-x86_64.sh
# bash ./Miniconda3-4.7.12-Linux-x86_64.sh -b
# /root/miniconda3/bin/pip install jpype1==1.0.1 numpy==1.17.4
# /root/miniconda3/bin/python
>>> import jpype
>>> jpype.startJVM()
>>> import numpy as np
>>> tmp = np.linalg.inv(np.random.rand(24, 24))
Segmentation fault
@AbdealiLoKo AbdealiLoKo changed the title SegFault when using jpype with numpy SegFault when using jpype with numpy.linalg Jul 21, 2020
@Thrameos
Copy link
Contributor

Thanks for the note. We had an issue with centos a while back, but it doesn't look like the one you are reporting. I will look it over and see if I can identify the source.

@Thrameos
Copy link
Contributor

I replicated the issue, but it will be challenging to identify the reason for the crash. The crash occurs in libopenblasp running multithreaded code. The only linkages between these would if JPype released an object twice resulting in a bad object which get picked up by blas. But it JPype were releasing objects twice it would have destabilized in other codes such as the testbench.

Oddly running the libalg call before the jpype startJVM allows it to work so perhaps something in the threading of the JVM is messing up blas.

#0  0x00007f2eb6587125 in dgetrf_parallel () from /root/miniconda3/lib/python3.7/site-packages/numpy/core/../.libs/libopenblasp-r0-34a18dc3.3.7.so
#1  0x00007f2eb65872d7 in dgetrf_parallel () from /root/miniconda3/lib/python3.7/site-packages/numpy/core/../.libs/libopenblasp-r0-34a18dc3.3.7.so
#2  0x00007f2eb636dc7b in dgesv_ () from /root/miniconda3/lib/python3.7/site-packages/numpy/core/../.libs/libopenblasp-r0-34a18dc3.3.7.so
#3  0x00007f2eb3dc64ba in call_dgesv (params=0x7fff5c03b2b0) at numpy/linalg/umath_linalg.c.src:1567
#4  DOUBLE_inv (args=0x7f2ee57e2948, dimensions=<optimized out>, steps=<optimized out>, __NPY_UNUSED_TAGGEDfunc=<optimized out>)
    at numpy/linalg/umath_linalg.c.src:1712
#5  0x00007f2ee542ffae in PyUFunc_GeneralizedFunction (op=0x561550d66a10, kwds=<optimized out>, args=<optimized out>, ufunc=0x0)
    at numpy/core/src/umath/ufunc_object.c:3007
#6  PyUFunc_GenericFunction (ufunc=ufunc@entry=0x7f2eb420c650, args=args@entry=0x7f2efaf56e50, kwds=kwds@entry=0x7f2ee57fb0f0, op=op@entry=0x7fff5c03cca0)
    at numpy/core/src/umath/ufunc_object.c:3142
#7  0x00007f2ee54303de in ufunc_generic_call (ufunc=0x7f2eb420c650, args=0x7f2efaf56e50, kwds=0x7f2ee57fb0f0) at numpy/core/src/umath/ufunc_object.c:4724
#8  0x000056154f67b8fb in _PyObject_FastCallKeywords () at /tmp/build/80754af9/python_1565725737370/work/Objects/call.c:199
#9  0x000056154f6dfa8f in call_function (kwnames=0x7f2eb41f2fa0, oparg=<optimized out>, pp_stack=<synthetic pointer>)
    at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:4619
#10 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:3139
#11 0x000056154f62456b in function_code_fastcall (globals=<optimized out>, nargs=1, args=<optimized out>, co=0x7f2eb41f49c0)
    at /tmp/build/80754af9/python_1565725737370/work/Objects/call.c:283
#12 _PyFunction_FastCallDict () at /tmp/build/80754af9/python_1565725737370/work/Objects/call.c:322
#13 0x00007f2ee52170bd in array_implement_array_function (__NPY_UNUSED_TAGGEDdummy=<optimized out>, positional_args=<optimized out>)
    at numpy/core/src/multiarray/arrayfunction_override.c:259
#14 0x000056154f6736e0 in _PyMethodDef_RawFastCallKeywords () at /tmp/build/80754af9/python_1565725737370/work/Objects/call.c:698
#15 0x000056154f673861 in _PyCFunction_FastCallKeywords (func=0x7f2ee57a70a0, args=args@entry=0x7f2eb1e511d8, nargs=nargs@entry=5, kwnames=kwnames@entry=0x0)

@Thrameos Thrameos added the bug Unable to deliver desired behavior (crash, fail, untested) label Jul 22, 2020
@Thrameos
Copy link
Contributor

@AbdealiJK I noticed that if you call the inverse once before starting the JVM that you can call it again after the JVM is started and it did not crash for me. Would that be an acceptable workaround for now?

@AbdealiLoKo
Copy link
Author

AbdealiLoKo commented Jul 24, 2020

Nice!
Yep, that works for me for now for scripts I'm running in RHEL7 environments 👍

@Thrameos
Copy link
Contributor

I could add a piece of code like this to JPype for now to be called prior to starting the JVM. It would hurt my already poor load times, but it may be the only option until numpy has a fix.

try:
    import numpy
    numpy.linalg.inv([[1,0],[0,1]])
except ImportError:
    pass

@AbdealiLoKo
Copy link
Author

I think users facing the issue can juat do it in their own scripts.
It seems to be very specific issue for pip installed numpy on RHEL7 - and so im not sure if jpype should be handling such a specific case

@Thrameos Thrameos added the notice Long standing JPype limitation (outside of our control) label Jul 24, 2020
@Thrameos
Copy link
Contributor

Okay I set this to be notice until such time as it is addressed.

@pelson
Copy link
Contributor

pelson commented Jul 27, 2020

Another workaround: only have a single openblas thread by setting an OMP_NUM_THREADS=1 env-var.


I also did a little check to ensure that library shadowing in the linker wasn't the culprit. By that I mean, if two shared libraries have transient dependencies on a common library, but one of them has a requirement for a newer version found by looking in the declared RPATH/RUNPATH path, then if the library that doesn't have that requirement is run first, when the second is triggered it will not get the newer version, but will instead end up with the one that has already been loaded.

$ LD_DEBUG=libs /root/miniconda3/bin/python \
    -c "import jpype; jpype.startJVM(); import numpy as np; tmp = np.linalg.inv(np.random.rand(24, 24))" \
    2>&1 | grep "calling init" | cut -d ":" -f 3 > broken.txt


$ LD_DEBUG=libs /root/miniconda3/bin/python \
    -c "import numpy as np; tmp = np.linalg.inv(np.random.rand(24, 24)); import jpype; jpype.startJVM();" \
    2>&1 | grep "calling init" | cut -d ":" -f 3 > fine.txt
diff -u broken.txt fine.txt 
--- broken.txt	2020-07-27 16:08:56.112584675 +0000
+++ fine.txt	2020-07-27 16:09:21.340813671 +0000
@@ -4,21 +4,12 @@
  /lib64/librt.so.1
  /lib64/libutil.so.1
  /lib64/libdl.so.2
- /root/miniconda3/bin/../lib/libgcc_s.so.1
- /root/miniconda3/bin/../lib/libstdc++.so.6
- /root/miniconda3/lib/python3.7/site-packages/_jpype.cpython-37m-x86_64-linux-gnu.so
  /root/miniconda3/lib/python3.7/lib-dynload/_heapq.cpython-37m-x86_64-linux-gnu.so
- /root/miniconda3/lib/python3.7/lib-dynload/math.cpython-37m-x86_64-linux-gnu.so
- /root/miniconda3/lib/python3.7/lib-dynload/_datetime.cpython-37m-x86_64-linux-gnu.so
- /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/lib/amd64/server/libjvm.so
- /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/lib/amd64/libverify.so
- /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/lib/amd64/libjava.so
- /lib64/libnss_files.so.2
- /root/miniconda3/bin/../lib/libz.so.1
- /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/lib/amd64/libzip.so
  /root/miniconda3/lib/python3.7/site-packages/numpy/core/../.libs/libgfortran-ed201abd.so.3.0.0
  /root/miniconda3/lib/python3.7/site-packages/numpy/core/../.libs/libopenblasp-r0-34a18dc3.3.7.so
  /root/miniconda3/lib/python3.7/site-packages/numpy/core/_multiarray_umath.cpython-37m-x86_64-linux-gnu.so
+ /root/miniconda3/lib/python3.7/lib-dynload/math.cpython-37m-x86_64-linux-gnu.so
+ /root/miniconda3/lib/python3.7/lib-dynload/_datetime.cpython-37m-x86_64-linux-gnu.so
  /root/miniconda3/lib/python3.7/lib-dynload/_struct.cpython-37m-x86_64-linux-gnu.so
  /root/miniconda3/lib/python3.7/lib-dynload/_pickle.cpython-37m-x86_64-linux-gnu.so
  /root/miniconda3/lib/python3.7/site-packages/numpy/core/_multiarray_tests.cpython-37m-x86_64-linux-gnu.so
@@ -28,6 +19,7 @@
  /root/miniconda3/lib/python3.7/lib-dynload/select.cpython-37m-x86_64-linux-gnu.so
  /root/miniconda3/lib/python3.7/site-packages/numpy/linalg/lapack_lite.cpython-37m-x86_64-linux-gnu.so
  /root/miniconda3/lib/python3.7/site-packages/numpy/linalg/_umath_linalg.cpython-37m-x86_64-linux-gnu.so
+ /root/miniconda3/lib/python3.7/lib-dynload/../../libz.so.1
  /root/miniconda3/lib/python3.7/lib-dynload/zlib.cpython-37m-x86_64-linux-gnu.so
  /root/miniconda3/lib/python3.7/lib-dynload/_bz2.cpython-37m-x86_64-linux-gnu.so
  /root/miniconda3/lib/python3.7/lib-dynload/../../liblzma.so.5
@@ -51,3 +43,11 @@
  /root/miniconda3/lib/python3.7/site-packages/numpy/random/pcg64.cpython-37m-x86_64-linux-gnu.so
  /root/miniconda3/lib/python3.7/site-packages/numpy/random/sfc64.cpython-37m-x86_64-linux-gnu.so
  /root/miniconda3/lib/python3.7/site-packages/numpy/random/generator.cpython-37m-x86_64-linux-gnu.so
+ /root/miniconda3/bin/../lib/libgcc_s.so.1
+ /root/miniconda3/bin/../lib/libstdc++.so.6
+ /root/miniconda3/lib/python3.7/site-packages/_jpype.cpython-37m-x86_64-linux-gnu.so
+ /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/lib/amd64/server/libjvm.so
+ /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/lib/amd64/libverify.so
+ /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/lib/amd64/libjava.so
+ /lib64/libnss_files.so.2
+ /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/lib/amd64/libzip.so

There are no different libraries being used, so it doesn't appear to be a linker issue as I had originally thought it might be with the fact that we have libraries from 3 different sources (Centos, Miniconda, & wheels (numpy & JPype)).

@pelson
Copy link
Contributor

pelson commented Jan 6, 2021

This one keeps cropping up for a few users, so I thought I'd do a bit more digging. Turns out that what appears to be the same problem was found for octave in https://savannah.gnu.org/bugs/?55395. I wonder if the stack size is being changed by both the JVM and OpenBLAS, and hence the issue.

I also found OpenMathLib/OpenBLAS#246 which looks like it has a similar stack trace.

I think the next steps are to test this with a debug build of open-blas to track down the exact culprit...

@Thrameos
Copy link
Contributor

Thrameos commented Jan 6, 2021

Thanks for working on this. The bug on OpenBLAS seems really old, so it is unclear why it would be cropping up now. Though perhaps it is not as resolved as stated. I thought it was more likely a problem of library shadowing as well. The other possibility is that the JVM is setting some flag in pthreads that is affecting OpenBLAS operation. If it is a stack size issue then perhaps the JVM changed the stack allocation routine for threads. If that were the case then the JVM setting for thread stack size should affect the bug. I can't recall what it is called. The other interaction is if we hit a page limit of the stack and the grows down routine gets called then the JVM one rather than the glibc one is likely to get called.

As it stands my feeling is this is not really a JPype bug but we are the victim of an OpenBLAS bug that only comes up with a certain compiler option (like the number of threads). Is there an active issue in OpenBLAS for this?

@pelson
Copy link
Contributor

pelson commented Mar 18, 2021

A bit more circumstantial evidence/clues (in the hope that this can help down the line): The JVM stack size is important.

Setting the stack size to 2M solves the issue for smaller operations while setting it to 4M resulted in me not being able to trigger the core dump for any size (upto what my machine's memory could handle):

import jpype
import numpy as np

jpype.startJVM('-Xss4M')
tmp = np.linalg.inv(np.random.rand(2400, 2400))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unable to deliver desired behavior (crash, fail, untested) notice Long standing JPype limitation (outside of our control)
Projects
None yet
Development

No branches or pull requests

3 participants