-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SegFault when using jpype with numpy.linalg #808
Comments
Thanks for the note. We had an issue with centos a while back, but it doesn't look like the one you are reporting. I will look it over and see if I can identify the source. |
I replicated the issue, but it will be challenging to identify the reason for the crash. The crash occurs in libopenblasp running multithreaded code. The only linkages between these would if JPype released an object twice resulting in a bad object which get picked up by blas. But it JPype were releasing objects twice it would have destabilized in other codes such as the testbench. Oddly running the libalg call before the jpype startJVM allows it to work so perhaps something in the threading of the JVM is messing up blas.
|
@AbdealiJK I noticed that if you call the inverse once before starting the JVM that you can call it again after the JVM is started and it did not crash for me. Would that be an acceptable workaround for now? |
Nice! |
I could add a piece of code like this to JPype for now to be called prior to starting the JVM. It would hurt my already poor load times, but it may be the only option until numpy has a fix.
|
I think users facing the issue can juat do it in their own scripts. |
Okay I set this to be notice until such time as it is addressed. |
Another workaround: only have a single openblas thread by setting an I also did a little check to ensure that library shadowing in the linker wasn't the culprit. By that I mean, if two shared libraries have transient dependencies on a common library, but one of them has a requirement for a newer version found by looking in the declared RPATH/RUNPATH path, then if the library that doesn't have that requirement is run first, when the second is triggered it will not get the newer version, but will instead end up with the one that has already been loaded.
There are no different libraries being used, so it doesn't appear to be a linker issue as I had originally thought it might be with the fact that we have libraries from 3 different sources (Centos, Miniconda, & wheels (numpy & JPype)). |
This one keeps cropping up for a few users, so I thought I'd do a bit more digging. Turns out that what appears to be the same problem was found for octave in https://savannah.gnu.org/bugs/?55395. I wonder if the stack size is being changed by both the JVM and OpenBLAS, and hence the issue. I also found OpenMathLib/OpenBLAS#246 which looks like it has a similar stack trace. I think the next steps are to test this with a debug build of open-blas to track down the exact culprit... |
Thanks for working on this. The bug on OpenBLAS seems really old, so it is unclear why it would be cropping up now. Though perhaps it is not as resolved as stated. I thought it was more likely a problem of library shadowing as well. The other possibility is that the JVM is setting some flag in pthreads that is affecting OpenBLAS operation. If it is a stack size issue then perhaps the JVM changed the stack allocation routine for threads. If that were the case then the JVM setting for thread stack size should affect the bug. I can't recall what it is called. The other interaction is if we hit a page limit of the stack and the grows down routine gets called then the JVM one rather than the glibc one is likely to get called. As it stands my feeling is this is not really a JPype bug but we are the victim of an OpenBLAS bug that only comes up with a certain compiler option (like the number of threads). Is there an active issue in OpenBLAS for this? |
A bit more circumstantial evidence/clues (in the hope that this can help down the line): The JVM stack size is important. Setting the stack size to 2M solves the issue for smaller operations while setting it to 4M resulted in me not being able to trigger the core dump for any size (upto what my machine's memory could handle):
|
I am currently using jnius, and was looking into jpype for some of my machine learning usecases to see how jpype behaves in general.
(Refer: https://gist.github.com/AbdealiJK/1dd5b7677435ba22f9ab3e26016bb3e7)
I found that the issue reported at numpy/numpy#15691 seems to be with jpype and numpy too
This is also an issue in pyjnius: kivy/pyjnius#490
Just posting it here - in case it helps
The text was updated successfully, but these errors were encountered: