-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SuiteSparse segfaults on PPC64le #20123
Comments
We could attempt upgrading, which I haven't touched because I'm not sure how the better-late-than-never official upstream support for building shared libraries will interact with our home grown way of doing it. |
The test suite definitely passed on power at some point. |
I definitely remember it passing in late September. |
Interesting. It's the particular matrix that we're feeding in on ppc64le. My guess is that our initialization of
The serialized |
I definitely remember the tests passing too. Perhaps this is due to change in compilers or something? |
Yes, I think this is indeed a compiler change. I've managed to narrow down that compiling SuiteSparse with I'm compiling with GCC 6.3.0, built from-source. |
In general, I am more comfortable using |
In general I don't think it's recommended to build software with |
This sounds similar to a gcc bug that we found when using musl libc on alpine linux a while back https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71505. That was causing an internal compiler error which I used creduce to shrink down from all of cholmod to int a, b;
double c, d, e;
double *f;
void fn1() {
for (;;) {
e = f[a];
d = f[a] = f[b];
c = f[b + 1] = c;
f[a] = f[a + 1] = d;
f[b] = f[b + 1] = c;
}
} and it wound up due to different qsort behavior giving some out of bounds accesses in the gcc tree vectorizer. Does it work if you compile with clang at |
I haven't tried I think using |
don't waste the CI time, it's going to fail until I backport a set of fixes to get travis and appveyor working again on that branch |
I meant don't open a separate PR against release-0.5, if there aren't conflicts then marking it backport pending is fine. But anyway. Did you have a docker container that was able to do powerpc cross-compilation and run executables in qemu from a x86_64 host, or am I imagining things? |
I recently added in cross-compilation in .travis.yml for openlibm. That's not super helpful though, I suggest using something like the For example, I'm attempting to compile Julia for
Note that you will need to do things like install |
I'm not asking for the sake of building Julia, but for reproducing the gcc bug. |
Not sure what the state of this is but the SuiteSparse test passed for me on Power:
|
if you translate this example code #20123 (comment) into its underlying ccall, does it still segfault? is there a way to reproduce this via qemu on a non power system? |
Linking the ccall-based MWE here as well. Yes, it does still segfault. In my experiments with |
And if you compile #include <stdint.h>
int64_t umfpack_zl_symbolic(int64_t, int64_t, int64_t*, int64_t*, double*, double*, void**, double*, double*);
int64_t umfpack_zl_numeric(int64_t*, int64_t*, double*, double*, void*, void**, double*, double*);
int64_t colptr[] = {0,17,33,49,66,82,95,111,129,145,158,175,190,202,214,227,246,261,274,290,306};
int64_t rowval[] = {0,1,3,4,5,6,7,8,10,11,12,14,15,16,17,18,19,0,1,3,6,7,8,9,10,12,13,14,15,16,17,18,19,0,2,3,4,5,6,9,10,11,12,14,15,16,17,18,19,0,1,2,3,4,5,6,7,9,10,11,12,13,14,16,18,19,1,2,3,4,5,6,7,9,10,11,12,13,14,15,16,19,0,4,5,6,7,8,9,10,13,14,16,17,18,0,1,2,3,4,5,6,7,10,11,12,14,15,17,18,19,0,1,2,4,5,6,7,8,9,10,12,13,14,15,16,17,18,19,0,1,2,3,4,5,7,8,9,10,11,12,15,16,18,19,0,2,4,7,8,10,11,12,15,16,17,18,19,0,1,2,3,4,6,8,9,10,11,12,13,14,16,17,18,19,0,1,2,3,4,5,6,7,8,10,11,13,15,17,19,0,4,5,7,9,10,12,13,14,15,16,18,2,5,6,7,8,9,10,13,14,15,18,19,0,2,3,5,7,8,9,11,12,13,14,15,16,0,1,2,3,4,5,6,7,8,9,10,11,13,14,15,16,17,18,19,0,1,3,4,5,6,7,8,10,11,12,14,15,16,18,0,1,4,5,6,7,8,10,11,12,14,15,17,0,1,2,3,4,5,6,7,8,9,10,11,13,16,17,19,0,1,2,3,4,6,8,10,12,13,14,15,16,17,18,19};
double realval[] = {0.0,0.337806,0.0,-1.63287,0.0,-0.44613,0.19799,1.14926,0.298043,1.02148,0.0,0.131325,0.0,0.206393,-0.311376,2.58573,0.186837,1.38221,-0.393738,-2.00667,0.20917,0.0,0.792833,-0.97752,0.0,0.0,0.0,0.853601,-0.595753,-1.08044,0.379664,0.0,1.0999,1.21105,0.37784,-0.986784,-0.562703,0.163778,-0.0126376,-0.170773,0.71282,1.40237,1.0462,0.406037,-0.554043,0.0,-1.92096,0.0,0.0,0.939853,-0.213045,-0.511759,0.0,0.0,0.0,-0.227391,0.0,-1.0151,0.0,-0.477085,-0.32659,0.0,0.469942,1.64964,0.0,0.0,1.78145,-0.763605,0.0,0.0,0.0131958,0.0,-0.162672,-2.70516,-0.122919,-2.00008,-0.698962,1.09993,-1.7705,-2.00725,0.0,-0.998107,0.00869002,-0.163939,0.625094,0.0725784,0.0,-0.838007,-0.442824,0.41261,1.26817,0.69029,-1.41511,0.0187943,0.0,-0.468559,1.61116,0.0,-0.727772,0.523392,0.0,0.0,0.0,0.356031,-0.4948,0.202831,0.0,-1.39671,-0.540818,1.17547,1.2347,1.43989,0.0,0.138636,0.0,-1.14957,0.0245661,0.0,0.0,1.23577,1.28436,0.0,-1.62606,-0.780131,0.462515,0.0,-0.491959,0.0,0.0,-1.01862,0.198236,1.73314,1.29703,0.0,0.21179,0.744656,0.220431,-0.702916,0.749352,-0.742024,0.0,1.38957,0.0,0.0,-0.224411,0.0,0.0,-1.02532,0.0,0.0,0.0,0.966394,0.0367359,0.0,-0.796899,0.0611722,0.116889,0.366088,0.0,1.06849,0.0,0.0,-1.16406,1.55251,-0.362438,1.36879,-1.07224,-1.23966,0.0,0.461177,0.795476,3.06289,0.818629,0.857781,0.0,1.23709,0.0,-1.2936,-0.411399,0.279837,0.182068,0.0,0.0,0.0,0.198476,-0.0630913,-0.103114,0.71032,0.0,1.29788,0.0,1.07622,1.75651,0.458845,0.0,0.0475068,0.0,-0.25331,0.521942,0.0,-0.107332,0.0,-0.495542,0.0,-1.99779,0.769662,0.0,1.86222,-0.996834,0.0,0.0,0.0,0.654088,0.0,1.20758,0.0109424,-0.372857,-0.468379,0.40064,0.0,0.344445,0.448019,0.0,0.0,1.02937,-0.370604,-1.27289,0.0,0.0,-0.171546,0.0,-0.720716,-0.633376,0.0,-2.0201,2.09947,0.0,0.0,0.0,0.0,0.550492,-1.92742,0.645865,0.0197031,-0.657945,0.467725,-0.278258,0.0,1.14977,0.0,0.49476,0.0,0.876636,2.16184,0.0,1.29705,0.0,-1.35523,-0.663317,-1.22075,-0.0638587,0.666348,0.196576,-1.62458,-0.846266,0.633132,0.589679,0.0,-0.00264666,1.60007,0.782907,1.22297,-0.722215,-0.923931,0.0,0.0,0.349859,0.0,1.48932,0.0865408,0.0,-1.83066,-1.07821,0.0,-0.58498,0.0,0.471699,-0.664718,-0.150759,-0.284796,-0.386488,-1.25471,1.81056,-0.697832,0.0,0.0,2.0124,0.992998,0.260153,-1.39448,0.0,-0.723773,-0.145429,-0.93486,0.194438,0.0};
double imagval[] = {0.40613,0.0,-0.0846339,-0.13798,-1.02557,1.23125,0.027402,-2.03573,0.0,0.575466,1.31344,0.0,-0.107559,0.0,0.251055,0.0,0.736455,0.0,0.0,0.0,-0.318469,0.510821,-1.16124,-2.0675,-0.935379,-0.984719,-0.0427068,0.0,0.0,-1.04169,0.0,-0.721603,1.76978,0.0,0.0,-1.6339,0.171513,1.07942,-0.341863,-0.451416,0.0,-0.385623,-0.177979,0.0,-0.527409,0.274356,-1.40177,-0.92233,-0.897793,0.0,0.741349,0.12676,-0.14502,0.167712,-1.86289,0.580561,-0.0656835,0.580498,0.0601529,0.0,0.0,0.330769,0.0,0.0,0.244537,-0.876682,0.156848,0.0,1.27292,-2.54138,0.0,-0.235694,2.0349,0.148812,0.0,0.0,-0.871853,1.0887,0.0,-0.186493,0.352587,0.0,0.0,0.0,0.0,1.20885,-0.997256,0.0,-0.774319,-1.01799,-0.250655,0.0,0.581978,0.0,-1.99611,0.804366,0.0,1.83116,0.293086,-0.645082,0.246694,-1.06371,-1.71633,0.0,0.0,0.662097,-0.0167129,-0.0591024,0.0,0.0,0.0,0.0,-0.253807,0.756052,-0.237708,0.0,0.0,-1.41261,0.546773,-0.656529,0.0,-0.054959,1.00811,0.189775,0.0,-1.65752,0.0,-0.278691,-0.512486,0.0,-1.76222,-0.398593,-0.990992,2.09176,-0.524177,0.0,0.182589,0.0,-1.66172,0.914951,-0.801739,0.554185,0.72665,-2.421,0.926461,1.46664,0.956341,1.2396,-1.25235,1.21628,1.03836,-1.04291,-1.02315,-0.180262,-0.211879,0.0,0.0,0.0,-2.02002,0.0,0.301225,1.88696,0.0,1.11314,-0.663909,-0.812546,0.0,-1.01952,-0.261015,0.0,-1.65223,-0.209834,0.0,0.755999,-0.33215,0.368023,0.136506,0.0,-0.259282,0.0,2.31561,-0.560358,0.0308454,1.12144,-0.997661,-0.118676,-1.72015,0.24544,-1.45935,0.0,-0.900858,0.408868,-0.50157,-0.694633,0.728268,0.0,2.23729,0.297442,1.11827,0.605953,0.128544,0.75236,0.0,-0.177461,-0.943643,0.999748,-0.344966,0.0,0.0,-2.09372,-1.05075,-0.00600076,0.0,-1.54667,0.425878,-0.0684746,0.0,0.0,0.0,-1.14376,0.0,-0.15198,-1.11284,-0.779306,0.0,0.0,-1.33519,1.21656,-0.573046,0.0,1.78693,0.0,0.0,1.89433,0.243233,0.0,-0.0222436,-0.0812604,0.209523,-0.779886,0.0,0.878099,0.0,0.0,0.0,0.0,0.0,-0.709713,0.0,1.21249,0.0238692,0.86202,0.0,0.0,-1.10935,0.489313,-0.179003,-0.834192,0.0,-0.338043,0.0,0.115793,0.0,1.12293,0.0,-1.03264,0.330903,0.407553,1.27351,0.0,0.0,0.1353,0.0,0.0,0.41193,-1.08955,0.0,0.0454245,0.0,0.419015,-1.86116,0.0,0.0,-0.271129,-0.281632,0.663675,0.0,-1.00765,0.842544,0.0,-1.02119,0.0,-0.4281,1.29197,-0.610726,-0.969805,0.0862868,0.772603,0.0,-1.20073,0.832971,0.0,1.13026,0.0,-2.05935,-1.3191};
double umf_ctrl[] = {1.0,0.2,0.2,0.1,32.0,0.0,0.7,2.0,1.0,0.0,1.0,1.0,0.0,0.0,10.0,0.001,1.0,0.5,0.0,1.0};
double umf_info[] = {2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,2.32446e-310,0.0};
void* tmp;
void* symbolic;
int main() {
umfpack_zl_symbolic(20,20,colptr,rowval,realval,imagval,&tmp,umf_ctrl,umf_info);
symbolic = tmp;
return umfpack_zl_numeric(colptr,rowval,realval,imagval,symbolic,&tmp,umf_ctrl,umf_info);
} linked against Julia's copy of umfpack, does that also segfault for binaries built before the patch, run successfully after? Setting |
Yes, it does. Nice job shrinking that down.
I've updated the docker container to build this C library as well (and run it by default) and added your C code to the gist. |
Now can we take the compiler invocation that builds libumfpack and use |
Bump. You still owe me a gcc bug report. |
I spent some time today trying to do this, but it's hard to disentangle the rest of libsuitesparse from this invocation (The calls we're using sub out to things in
I get that you don't like leaving loose ends lying around, and that this PR was pushed in against your wishes, but I've been pretty vocal about my time constraints and how much time/effort I'm willing to put into trying to get this communicated upstream to the |
Does this happen when linked to system blas with 32 bit ints? That would remove the need to deal with openblas from reproducing this. If Viral or anyone else who cares about this architecture wants to make sure the patch doesn't get removed, we'll need to reduce it as much as we can. |
I managed to get all of |
With luck it won't have to use any of the |
Looking at our makefiles, we actually don't use ILP64 on Power by default. Maybe we should change that at some point. If I have suitesparse built statically into the test executable instead of linking it as a shared library, qemu is capable of reproducing the problem from amd64. But it only happens using openblas (either the debian ppc64el package, or out of a Julia binary), not debian's ppc64el reference blas package. I wonder if maybe openblas could have a memory corruption bug in its power kernels, and changing the optimization level of umfpack results in a different memory access pattern or something? |
I took your reduced case and started another run based off of it on the ppc64le machine. |
Agreed, I'll open an issue about that. And here's the PR. |
Also worth testing this against the latest development branch of openblas, given how long it's been since their last release. |
creduce, which has been running all day, has gone down from 8M LOC to 25K LOC. We're getting close. |
Creduce was a little too zealous and reduced down to the simplest possible segfault; it just calls an uninitialized function pointer. |
It's best to compile and run (under a timeout) both working and broken copies in the test script. My test script had a bug where I typoed the segfaulting version so it reduced to an empty main, whoops. |
Yeah good idea with ensuring that the non-optimized version still runs, I'm rerunning it with two versions, and also not reducing |
I am running a new |
It turns out, this was due to some misbehaving ppc64le kernels within OpenBLAS. Using the latest OpenBLAS head ( |
Great, thanks for testing that and sorry for being a bit of a jerk about it. If the fix can be reverse-bisected to a power specific commit that would apply cleanly to the last release, maybe we should swap patches. if not, we should drop the suitesparse patch when upgrading to the next openblas version. |
We haven't been running the testsuite on PPC64le so far, so when I started doing so on the next iteration of the buildbots, I noticed that we don't actually make it through the
sparse/sparse
tests:Looks like it's pointing to this block of code as the culprit, but it doesn't trigger when I just run that piece of code manually. Reporting here in case anyone else wants to take a shot at this.
The text was updated successfully, but these errors were encountered: