SegFault when using jpype with numpy.linalg #808

AbdealiLoKo · 2020-07-21T19:50:23Z

I am currently using jnius, and was looking into jpype for some of my machine learning usecases to see how jpype behaves in general.
(Refer: https://gist.github.com/AbdealiJK/1dd5b7677435ba22f9ab3e26016bb3e7)

I found that the issue reported at numpy/numpy#15691 seems to be with jpype and numpy too
This is also an issue in pyjnius: kivy/pyjnius#490

Just posting it here - in case it helps

$ docker run --rm -it centos:7 /bin/bash
# yum install -y wget bzip2 which java-1.8.0-openjdk-devel
# wget https://repo.anaconda.com/miniconda/Miniconda3-4.7.12-Linux-x86_64.sh
# bash ./Miniconda3-4.7.12-Linux-x86_64.sh -b
# /root/miniconda3/bin/pip install jpype1==1.0.1 numpy==1.17.4
# /root/miniconda3/bin/python
>>> import jpype
>>> jpype.startJVM()
>>> import numpy as np
>>> tmp = np.linalg.inv(np.random.rand(24, 24))
Segmentation fault

The text was updated successfully, but these errors were encountered:

Thrameos · 2020-07-21T20:04:45Z

Thanks for the note. We had an issue with centos a while back, but it doesn't look like the one you are reporting. I will look it over and see if I can identify the source.

Thrameos · 2020-07-22T02:49:20Z

I replicated the issue, but it will be challenging to identify the reason for the crash. The crash occurs in libopenblasp running multithreaded code. The only linkages between these would if JPype released an object twice resulting in a bad object which get picked up by blas. But it JPype were releasing objects twice it would have destabilized in other codes such as the testbench.

Oddly running the libalg call before the jpype startJVM allows it to work so perhaps something in the threading of the JVM is messing up blas.

#0  0x00007f2eb6587125 in dgetrf_parallel () from /root/miniconda3/lib/python3.7/site-packages/numpy/core/../.libs/libopenblasp-r0-34a18dc3.3.7.so
#1  0x00007f2eb65872d7 in dgetrf_parallel () from /root/miniconda3/lib/python3.7/site-packages/numpy/core/../.libs/libopenblasp-r0-34a18dc3.3.7.so
#2  0x00007f2eb636dc7b in dgesv_ () from /root/miniconda3/lib/python3.7/site-packages/numpy/core/../.libs/libopenblasp-r0-34a18dc3.3.7.so
#3  0x00007f2eb3dc64ba in call_dgesv (params=0x7fff5c03b2b0) at numpy/linalg/umath_linalg.c.src:1567
#4  DOUBLE_inv (args=0x7f2ee57e2948, dimensions=<optimized out>, steps=<optimized out>, __NPY_UNUSED_TAGGEDfunc=<optimized out>)
    at numpy/linalg/umath_linalg.c.src:1712
#5  0x00007f2ee542ffae in PyUFunc_GeneralizedFunction (op=0x561550d66a10, kwds=<optimized out>, args=<optimized out>, ufunc=0x0)
    at numpy/core/src/umath/ufunc_object.c:3007
#6  PyUFunc_GenericFunction (ufunc=ufunc@entry=0x7f2eb420c650, args=args@entry=0x7f2efaf56e50, kwds=kwds@entry=0x7f2ee57fb0f0, op=op@entry=0x7fff5c03cca0)
    at numpy/core/src/umath/ufunc_object.c:3142
#7  0x00007f2ee54303de in ufunc_generic_call (ufunc=0x7f2eb420c650, args=0x7f2efaf56e50, kwds=0x7f2ee57fb0f0) at numpy/core/src/umath/ufunc_object.c:4724
#8  0x000056154f67b8fb in _PyObject_FastCallKeywords () at /tmp/build/80754af9/python_1565725737370/work/Objects/call.c:199
#9  0x000056154f6dfa8f in call_function (kwnames=0x7f2eb41f2fa0, oparg=<optimized out>, pp_stack=<synthetic pointer>)
    at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:4619
#10 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1565725737370/work/Python/ceval.c:3139
#11 0x000056154f62456b in function_code_fastcall (globals=<optimized out>, nargs=1, args=<optimized out>, co=0x7f2eb41f49c0)
    at /tmp/build/80754af9/python_1565725737370/work/Objects/call.c:283
#12 _PyFunction_FastCallDict () at /tmp/build/80754af9/python_1565725737370/work/Objects/call.c:322
#13 0x00007f2ee52170bd in array_implement_array_function (__NPY_UNUSED_TAGGEDdummy=<optimized out>, positional_args=<optimized out>)
    at numpy/core/src/multiarray/arrayfunction_override.c:259
#14 0x000056154f6736e0 in _PyMethodDef_RawFastCallKeywords () at /tmp/build/80754af9/python_1565725737370/work/Objects/call.c:698
#15 0x000056154f673861 in _PyCFunction_FastCallKeywords (func=0x7f2ee57a70a0, args=args@entry=0x7f2eb1e511d8, nargs=nargs@entry=5, kwnames=kwnames@entry=0x0)

Thrameos · 2020-07-23T04:21:26Z

@AbdealiJK I noticed that if you call the inverse once before starting the JVM that you can call it again after the JVM is started and it did not crash for me. Would that be an acceptable workaround for now?

AbdealiLoKo · 2020-07-24T12:26:48Z

Nice!
Yep, that works for me for now for scripts I'm running in RHEL7 environments 👍

Thrameos · 2020-07-24T15:42:23Z

I could add a piece of code like this to JPype for now to be called prior to starting the JVM. It would hurt my already poor load times, but it may be the only option until numpy has a fix.

try:
    import numpy
    numpy.linalg.inv([[1,0],[0,1]])
except ImportError:
    pass

AbdealiLoKo · 2020-07-24T16:08:39Z

I think users facing the issue can juat do it in their own scripts.
It seems to be very specific issue for pip installed numpy on RHEL7 - and so im not sure if jpype should be handling such a specific case

Thrameos · 2020-07-24T16:26:01Z

Okay I set this to be notice until such time as it is addressed.

pelson · 2020-07-27T16:31:10Z

Another workaround: only have a single openblas thread by setting an OMP_NUM_THREADS=1 env-var.

I also did a little check to ensure that library shadowing in the linker wasn't the culprit. By that I mean, if two shared libraries have transient dependencies on a common library, but one of them has a requirement for a newer version found by looking in the declared RPATH/RUNPATH path, then if the library that doesn't have that requirement is run first, when the second is triggered it will not get the newer version, but will instead end up with the one that has already been loaded.

$ LD_DEBUG=libs /root/miniconda3/bin/python \
    -c "import jpype; jpype.startJVM(); import numpy as np; tmp = np.linalg.inv(np.random.rand(24, 24))" \
    2>&1 | grep "calling init" | cut -d ":" -f 3 > broken.txt


$ LD_DEBUG=libs /root/miniconda3/bin/python \
    -c "import numpy as np; tmp = np.linalg.inv(np.random.rand(24, 24)); import jpype; jpype.startJVM();" \
    2>&1 | grep "calling init" | cut -d ":" -f 3 > fine.txt

diff -u broken.txt fine.txt 
--- broken.txt	2020-07-27 16:08:56.112584675 +0000
+++ fine.txt	2020-07-27 16:09:21.340813671 +0000
@@ -4,21 +4,12 @@
  /lib64/librt.so.1
  /lib64/libutil.so.1
  /lib64/libdl.so.2
- /root/miniconda3/bin/../lib/libgcc_s.so.1
- /root/miniconda3/bin/../lib/libstdc++.so.6
- /root/miniconda3/lib/python3.7/site-packages/_jpype.cpython-37m-x86_64-linux-gnu.so
  /root/miniconda3/lib/python3.7/lib-dynload/_heapq.cpython-37m-x86_64-linux-gnu.so
- /root/miniconda3/lib/python3.7/lib-dynload/math.cpython-37m-x86_64-linux-gnu.so
- /root/miniconda3/lib/python3.7/lib-dynload/_datetime.cpython-37m-x86_64-linux-gnu.so
- /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/lib/amd64/server/libjvm.so
- /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/lib/amd64/libverify.so
- /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/lib/amd64/libjava.so
- /lib64/libnss_files.so.2
- /root/miniconda3/bin/../lib/libz.so.1
- /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/lib/amd64/libzip.so
  /root/miniconda3/lib/python3.7/site-packages/numpy/core/../.libs/libgfortran-ed201abd.so.3.0.0
  /root/miniconda3/lib/python3.7/site-packages/numpy/core/../.libs/libopenblasp-r0-34a18dc3.3.7.so
  /root/miniconda3/lib/python3.7/site-packages/numpy/core/_multiarray_umath.cpython-37m-x86_64-linux-gnu.so
+ /root/miniconda3/lib/python3.7/lib-dynload/math.cpython-37m-x86_64-linux-gnu.so
+ /root/miniconda3/lib/python3.7/lib-dynload/_datetime.cpython-37m-x86_64-linux-gnu.so
  /root/miniconda3/lib/python3.7/lib-dynload/_struct.cpython-37m-x86_64-linux-gnu.so
  /root/miniconda3/lib/python3.7/lib-dynload/_pickle.cpython-37m-x86_64-linux-gnu.so
  /root/miniconda3/lib/python3.7/site-packages/numpy/core/_multiarray_tests.cpython-37m-x86_64-linux-gnu.so
@@ -28,6 +19,7 @@
  /root/miniconda3/lib/python3.7/lib-dynload/select.cpython-37m-x86_64-linux-gnu.so
  /root/miniconda3/lib/python3.7/site-packages/numpy/linalg/lapack_lite.cpython-37m-x86_64-linux-gnu.so
  /root/miniconda3/lib/python3.7/site-packages/numpy/linalg/_umath_linalg.cpython-37m-x86_64-linux-gnu.so
+ /root/miniconda3/lib/python3.7/lib-dynload/../../libz.so.1
  /root/miniconda3/lib/python3.7/lib-dynload/zlib.cpython-37m-x86_64-linux-gnu.so
  /root/miniconda3/lib/python3.7/lib-dynload/_bz2.cpython-37m-x86_64-linux-gnu.so
  /root/miniconda3/lib/python3.7/lib-dynload/../../liblzma.so.5
@@ -51,3 +43,11 @@
  /root/miniconda3/lib/python3.7/site-packages/numpy/random/pcg64.cpython-37m-x86_64-linux-gnu.so
  /root/miniconda3/lib/python3.7/site-packages/numpy/random/sfc64.cpython-37m-x86_64-linux-gnu.so
  /root/miniconda3/lib/python3.7/site-packages/numpy/random/generator.cpython-37m-x86_64-linux-gnu.so
+ /root/miniconda3/bin/../lib/libgcc_s.so.1
+ /root/miniconda3/bin/../lib/libstdc++.so.6
+ /root/miniconda3/lib/python3.7/site-packages/_jpype.cpython-37m-x86_64-linux-gnu.so
+ /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/lib/amd64/server/libjvm.so
+ /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/lib/amd64/libverify.so
+ /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/lib/amd64/libjava.so
+ /lib64/libnss_files.so.2
+ /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.252.b09-2.el7_8.x86_64/jre/lib/amd64/libzip.so

There are no different libraries being used, so it doesn't appear to be a linker issue as I had originally thought it might be with the fact that we have libraries from 3 different sources (Centos, Miniconda, & wheels (numpy & JPype)).

pelson · 2021-01-06T15:47:14Z

This one keeps cropping up for a few users, so I thought I'd do a bit more digging. Turns out that what appears to be the same problem was found for octave in https://savannah.gnu.org/bugs/?55395. I wonder if the stack size is being changed by both the JVM and OpenBLAS, and hence the issue.

I also found OpenMathLib/OpenBLAS#246 which looks like it has a similar stack trace.

I think the next steps are to test this with a debug build of open-blas to track down the exact culprit...

Thrameos · 2021-01-06T17:36:40Z

Thanks for working on this. The bug on OpenBLAS seems really old, so it is unclear why it would be cropping up now. Though perhaps it is not as resolved as stated. I thought it was more likely a problem of library shadowing as well. The other possibility is that the JVM is setting some flag in pthreads that is affecting OpenBLAS operation. If it is a stack size issue then perhaps the JVM changed the stack allocation routine for threads. If that were the case then the JVM setting for thread stack size should affect the bug. I can't recall what it is called. The other interaction is if we hit a page limit of the stack and the grows down routine gets called then the JVM one rather than the glibc one is likely to get called.

As it stands my feeling is this is not really a JPype bug but we are the victim of an OpenBLAS bug that only comes up with a certain compiler option (like the number of threads). Is there an active issue in OpenBLAS for this?

pelson · 2021-03-18T10:47:19Z

A bit more circumstantial evidence/clues (in the hope that this can help down the line): The JVM stack size is important.

Setting the stack size to 2M solves the issue for smaller operations while setting it to 4M resulted in me not being able to trigger the core dump for any size (upto what my machine's memory could handle):

import jpype
import numpy as np

jpype.startJVM('-Xss4M')
tmp = np.linalg.inv(np.random.rand(2400, 2400))

AbdealiLoKo changed the title ~~SegFault when using jpype with numpy~~ SegFault when using jpype with numpy.linalg Jul 21, 2020

Thrameos added the bug Unable to deliver desired behavior (crash, fail, untested) label Jul 22, 2020

Thrameos added the notice Long standing JPype limitation (outside of our control) label Jul 24, 2020

AbdealiLoKo mentioned this issue Oct 18, 2022

BUG: SegFault when using numpy with jnius numpy/numpy#15691

Closed

matteogrolla mentioned this issue Sep 20, 2024

segmentation fault when running Logit(...).fit() after connection to cloudera with jdbc driver statsmodels/statsmodels#9367

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SegFault when using jpype with numpy.linalg #808

SegFault when using jpype with numpy.linalg #808

AbdealiLoKo commented Jul 21, 2020 •

edited

Loading

Thrameos commented Jul 21, 2020

Thrameos commented Jul 22, 2020

Thrameos commented Jul 23, 2020

AbdealiLoKo commented Jul 24, 2020 •

edited

Loading

Thrameos commented Jul 24, 2020

AbdealiLoKo commented Jul 24, 2020

Thrameos commented Jul 24, 2020

pelson commented Jul 27, 2020

pelson commented Jan 6, 2021

Thrameos commented Jan 6, 2021

pelson commented Mar 18, 2021

SegFault when using jpype with numpy.linalg #808

SegFault when using jpype with numpy.linalg #808

Comments

AbdealiLoKo commented Jul 21, 2020 • edited Loading

Thrameos commented Jul 21, 2020

Thrameos commented Jul 22, 2020

Thrameos commented Jul 23, 2020

AbdealiLoKo commented Jul 24, 2020 • edited Loading

Thrameos commented Jul 24, 2020

AbdealiLoKo commented Jul 24, 2020

Thrameos commented Jul 24, 2020

pelson commented Jul 27, 2020

pelson commented Jan 6, 2021

Thrameos commented Jan 6, 2021

pelson commented Mar 18, 2021

AbdealiLoKo commented Jul 21, 2020 •

edited

Loading

AbdealiLoKo commented Jul 24, 2020 •

edited

Loading