Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pari segfault on Sage startup in Cygwin #11551

Closed
kcrisman opened this issue Jun 28, 2011 · 66 comments
Closed

Pari segfault on Sage startup in Cygwin #11551

kcrisman opened this issue Jun 28, 2011 · 66 comments

Comments

@kcrisman
Copy link
Member

In both Windows XP and Window 7 it is now possible (again) to build Sage on Cygwin. However, Sage has a segmentation fault in Pari upon startup.

This happens in initalizing the Pynac i (init_pynac_I in sage/symbolic/pynac.pyx), but the final thing is that the mpfr number 1.00000000000 causes the segfault upon running the ._pari_() method. Suggestions as to why that would be - and a potential fix - are welcome.

CC: @dimpase @mwhansen @jdemeyer @jpflori

Component: porting: Cygwin

Keywords: pari

Reviewer: Karl-Dieter Crisman, Jean-Pierre Flori

Issue created by migration from https://trac.sagemath.org/ticket/11551

@kcrisman kcrisman added this to the sage-5.4 milestone Jun 28, 2011
@kcrisman
Copy link
Member Author

Attachment: Parisegfault.PNG.gz

Screenshot of the problem

@kcrisman
Copy link
Member Author

comment:1

I've attached a screenshot of the traceback - the best I can do in Cygwin with my limited experience.

@jdemeyer
Copy link

comment:2

Can you please attach your sage/rings/real_mpfr.c?

Please do not report errors on non-released Sage versions (like sage-4.7.1.alpha4 in your case). Those versions can (and probably will) change slightly, which makes it harder to reproduce errors.

@dimpase
Copy link
Member

dimpase commented Jun 28, 2011

comment:3

Replying to @kcrisman:

I've attached a screenshot of the traceback - the best I can do in Cygwin with my limited experience.

hmm, you should be able to just copy the thing with your mouse and paste...

perhaps, running in an better terminal window, such as mintty : http://code.google.com/p/mintty/

@kcrisman
Copy link
Member Author

comment:4

Replying to @jdemeyer:

Can you please attach your sage/rings/real_mpfr.c?

I'll try - depends on whether my wifi will work. I'm not on that computer currently.


Please do not report errors on non-released Sage versions (like sage-4.7.1.alpha4 in your case). Those versions can (and probably will) change slightly, which makes it harder to reproduce errors.

Well, building on Cygwin is not exactly straightforward, and (at least for me) extremely time-consuming, so I wanted to make sure I had as bleeding-edge of code as possible to catch potential problems. I find it unlikely that patches or spkgs will currently be backed out just because they break Cygwin, though if that is not true, that would make this job much easier and I would be very grateful.

Luckily, Mike Hansen already had this error (almost assuredly the same one) in 4.7.alpha3 - see this sage-devel thread. So I think that is the place to look. He thought it was the new error handling or the Pari upgrade, but the message sounds more like Pari itself </uninformed opinion>).


I'd love to try a better terminal - William also had suggested one at Sage Days 31 - but I've only been really using Cygwin for maybe a week, and so I wouldn't even know how to ask Cygwin to use a different shell. Cut-and-paste does not work, as far as I've been able to tell.

@kcrisman
Copy link
Member Author

comment:5

Replying to @kcrisman:

Replying to @jdemeyer:

Can you please attach your sage/rings/real_mpfr.c?

I'll try - depends on whether my wifi will work. I'm not on that computer currently.

Okay, that's a 1.5 MB file, so I am just posting a link.

http://sage.math.washington.edu/home/kcrisman/real_mpfr.c

This would be so great if it was possible to track down without too much trouble.

@kcrisman
Copy link
Member Author

comment:6

As another (possibly unrelated) data point, #6743 has two patches which change the behavior of sage/rings/complex_double.pyx to get Sage to start (well, a year or two ago).

@kcrisman
Copy link
Member Author

comment:7

I put in print statements at every conceivable place. Here is as far as it gets:

    def _pari_(self):
<snip comments/docs>
        sig_on()
        if mpfr_nan_p(self.value) or mpfr_inf_p(self.value):
            raise ValueError, 'Cannot convert NaN or infinity to Pari float'

        # wordsize for PARI
        cdef unsigned long wordsize = sizeof(long)*8

        cdef int prec
        prec = (<RealField_class>self._parent).__prec

        # We round up the precision to the nearest multiple of wordsize.
        cdef int rounded_prec
        rounded_prec = (self.prec() + wordsize - 1) & ~(wordsize - 1)

        # Yes, assigning to self works fine, even in Pyrex.
        if rounded_prec > prec:
            self = RealField(rounded_prec)(self)

        cdef mpz_t mantissa
        cdef mp_exp_t exponent
        cdef GEN pari_float

        if mpfr_zero_p(self.value):
            pari_float = real_0_bit(-rounded_prec)
        else:
            # Now we can extract the mantissa, and it will be normalized
            # (the most significant bit of the most significant word will be 1).
            mpz_init(mantissa)
            exponent = mpfr_get_z_exp(mantissa, self.value)
 
WE GET HERE AND NO FURTHER
           
            # Create a PARI REAL
            pari_float = cgetr(2 + rounded_prec / wordsize)
            mpz_export(&pari_float[2], NULL, 1, wordsize/8, 0, 0, mantissa)
            mpz_clear(mantissa)
            setexpo(pari_float, exponent + rounded_prec - 1)
            setsigne(pari_float, mpfr_sgn(self.value))
        
        cdef PariInstance P
        P = sage.libs.pari.all.pari
        return P.new_gen(pari_float)

Since


    # level1.h (incomplete!)
    
    GEN     cgetg_copy(long lx, GEN x)
    GEN     cgetg(long x, long y)
    GEN     cgeti(long x)
    GEN     cgetr(long x)
    long    itos(GEN x)
    GEN     real_0_bit(long bitprec)
    GEN     stoi(long s)

so cgetr is indeed from level1.h, which is where the sage -gdb backtrace ends up before raising the interrupt. What would get that to have problems?

Also attaching screenshot of the traceback.

@kcrisman
Copy link
Member Author

Screenshot of last bits of backtrace from sage -gdb

@kcrisman
Copy link
Member Author

comment:8

Attachment: Screen shot 2011-06-29 at 9.19.14 PM.png

A little further "print"-ing revealed that that cgetr is the problem. By the way, 2 + rounded_prec / wordsize = 2 + 64/32 = 4.

What does

cgetr (x=(value optimized out))

mean? Does this mean that the Pari float will always have the same precision no matter what?

@kcrisman
Copy link
Member Author

comment:9
GEN cgetr(long n) allocates memory on the stack for a t_REAL of length n, and initializes its first codeword. Identical to cgetg(n,t_REAL).

I'm going to try a few other things and then stop for now. But hopefully this helps.

@kcrisman
Copy link
Member Author

comment:10

Trying even cgetg(4,t_REAL) raises a similar error. Pari seems to not be able to allocate anything - I don't know whether there is anything before this in initialization of Sage that has a problem.


Another data point: sage -gp works fine. Something in libpari might be off. How might I test that without actually starting Sage?

@kcrisman
Copy link
Member Author

kcrisman commented Jul 1, 2011

comment:11

I have now confirmed this with the released 4.7.1.alpha3 on both XP and Win7. It is very reproducible, always the same place.

@kcrisman
Copy link
Member Author

kcrisman commented Aug 1, 2011

comment:12

Another update: commenting out everything about initializing the Pynac I doesn't help, because there is another place in initialization this is used:

rings/qqbar.py:5800:    QQbar_I_nf = QuadraticField(-1, 'I', embedding=CC.gen())

which also causes the identical problem.

And _init_qqbar in sage/all.py seems like a fairly big thing to try to work around, even in testing. But commenting this out as well does allow Sage to start!

@jdemeyer
Copy link

jdemeyer commented Aug 1, 2011

comment:13

Some things to try:

  1. The new PARI spkg from Update PARI to version 2.5.0 #11130.
  2. Compiling PARI with SAGE_DEBUG=yes and posting a backtrace again.

@kcrisman
Copy link
Member Author

kcrisman commented Aug 2, 2011

comment:14

I tried 2. first. Not very exciting.

Program received signal SIGSEGV, Segmentation fault.
0x343a8ad5 in pari_err () from /home/.../sage-4.7.1.alpha3/devel/sage/sage-main/build/sage/rings/real_mpfr.dll

I couldn't get anything out of it that I hadn't seen before.

Again, knowing how to test whether libpari is working at all would be really helpful. The files in local/lib/ certainly exist, at any rate, and they are the ones created when I ./sage -f'ed it just now.

@kcrisman
Copy link
Member Author

kcrisman commented Aug 2, 2011

comment:15

I can't get 1. to install on Cygwin. Seems like a linking order error or something, see #11130.

@nexttime
Copy link
Mannequin

nexttime mannequin commented Aug 2, 2011

comment:16

Replying to @kcrisman:

I tried 2. first. Not very exciting.

Program received signal SIGSEGV, Segmentation fault.
0x343a8ad5 in pari_err () from /home/.../sage-4.7.1.alpha3/devel/sage/sage-main/build/sage/rings/real_mpfr.dll

I couldn't get anything out of it that I hadn't seen before.

First "result" (i.e., I don't know yet why pari_error() is called at all, but see below):

Debugging this bottom-up, according to your nice screen shot the segfault originates from:

static void
err_init(void)
{
  /* make sure pari_err msg starts at the beginning of line */
  if (!pari_last_was_newline()) pari_putc('\n');
  pariOut->flush(); /***** THIS SEGFAULTS *****/
  pariErr->flush();
  pariOut = pariErr;
  term_color(c_ERR);
}

So obviously pariOut (and most probably also pariErr) aren't properly initialized at that point. (Note that line 885 in the vanilla PARI sources is the assignment statement, but we patch src/src/language/init.c such that we get an offset of +2 lines.)

PARI error number 14 is "errpile" (i.e. heap / [PARI] stack error), which is most probably raised for the same reason, namely because the PARI stack apparently isn't [yet] initialized when cgetr() gets called.

For the moment, it's up to someone else to donate his/her 2 ct or more... ;-)

@nexttime
Copy link
Mannequin

nexttime mannequin commented Aug 2, 2011

comment:17

I have no idea why real_mpfr[.pyx] shouldn't initialize [the] PARI [library] (i.e., the pari_instance variable defined in sage/libs/pari/gen.pyx), but you (Karl-Dieter) could verify it gets initialized by putting some print statement(s) into PariInstance's __init__(), preferably (also) around pari_init_opts(), to make sure the latter really gets called, because of

        if bot:
            return  # pari already initialized.

There are a few things that might be relevant here:

  • Cython doesn't support C enum constants (here e.g. INIT_DFTm), therefore one has to declare them as cdef extern ints, but I don't think that's the problem here.

  • bot is a very bad name for a global variable (of a library!), i.e. some other library / module might use the same for a different purpose, such that the one supposed to be PARI's may actually already have some non-zero value despite PARI not yet being initialized. (The early-return check in PariInstance's __init__() worsens that to some extent, though other problems would certainly arise later in that case.)

@kcrisman
Copy link
Member Author

kcrisman commented Aug 3, 2011

comment:18

Replying to @nexttime:

I have no idea why real_mpfr[.pyx] shouldn't initialize [the] PARI [library] (i.e., the pari_instance variable defined in sage/libs/pari/gen.pyx), but you (Karl-Dieter) could verify it gets initialized by putting some print statement(s) into PariInstance's __init__(), preferably (also) around pari_init_opts(),

Thanks, Leif - that seems very reasonable. Unfortunately I sort of destroyed my installations trying to do #11130 and I'm not sure how to fix that. I didn't know what the #0 error was, so I just started at #1, which at least I could interpret - well, I don't know much about Pari internals. But this explanation makes sense; can't allocate something to something that doesn't exist.

I'll try this when I get a chance.

@kcrisman
Copy link
Member Author

kcrisman commented Aug 3, 2011

comment:19
  • Cython doesn't support C enum constants (here e.g. INIT_DFTm), therefore one has to declare them as cdef extern ints, but I don't think that's the problem here.

  • bot is a very bad name for a global variable (of a library!), i.e. some other library / module might use the same for a different purpose, such that the one supposed to be PARI's may actually already have some non-zero value despite PARI not yet being initialized. (The early-return check in PariInstance's __init__() worsens that to some extent, though other problems would certainly arise later in that case.)

It's not far enough along to try these, but here's something naive.

        cdef GEN pari_float
<snip>
        else:
<snip>
            # Create a PARI REAL
            pari_float = cgetr(2 + rounded_prec / wordsize)
 <snip>
        cdef PariInstance P
        P = sage.libs.pari.all.pari
        return P.new_gen(pari_float)

So it looks like the GEN gets defined before the PariInstance - is that a problem for some reason? Again, this is totally naive, and probably wrong since this works everywhere else.

@nexttime
Copy link
Mannequin

nexttime mannequin commented Aug 3, 2011

comment:20

Replying to @kcrisman:

Unfortunately I sort of destroyed my installations trying to do #11130 and I'm not sure how to fix that.

I don't know how you managed that ;-) but you should be able to just reinstall the "old" PARI (2.4.3.alpha.p7) at least (assuming you also have a Sage branch without #11130's patches applied, though these only change doctests IIRC).

If you think something may get mixed up with a previous installation, you can also

$ rm -rf $SAGE_ROOT/local/include/pari/
$ rm $SAGE_ROOT/local/lib/libpari*
$ rm $SAGE_ROOT/local/bin/{libpari,gp}*

before reinstalling the PARI package.
(And perhaps also run ./sage -ba-force after you've reinstalled it.)

I didn't know what the #0 error was, so I just started at #1, which at least I could interpret [...]

No idea what the #0 and #1 refer to...

So it looks like the GEN gets defined before the PariInstance - is that a problem for some reason?

No. The weird trailer just explicitly uses the one and only global "PariInstance" pari_instance alias P alias sage.libs.pari.gen.pari alias sage.libs.pari.all.pari (which should get initialized as soon as you import from that module (sage.libs.pari.gen), which is done far above in real_mpfr.pyx), because new_gen() is only available as a member function (or "method") of an instance, for whatever reason.

@nexttime
Copy link
Mannequin

nexttime mannequin commented Aug 3, 2011

comment:21

Replying to @jdemeyer:

Some things to try:
2. Compiling PARI with SAGE_DEBUG=yes and posting a backtrace again.

We compile PARI with -g by default btw., SAGE_DEBUG only adds -O0.

@kcrisman
Copy link
Member Author

kcrisman commented Aug 3, 2011

Segfault with #11130 applied

@kcrisman
Copy link
Member Author

kcrisman commented Aug 3, 2011

comment:22

Attachment: Pari-2.5.0Segfault.png

Latest screenshot shows that the upgrade of Pari in #11130 causes a slightly different segfault backtrace, but still along the same lines of what Leif is suggesting and nearly the same as before.

  if (x > (avma-bot) / sizeof(long)) pari_err(errpile);

is line 86 in level1.h, unless there are patches in Sage, and with the same two-line offset the #0 error in the backtrace is the same as above. What I find interesting is that this time it doesn't mention real_mpfr or cgetr, though I assume that is still where the problem is.

@kcrisman
Copy link
Member Author

kcrisman commented Aug 3, 2011

comment:23

Okay, inserting appropriate print statements gives

Got to first line of PariInsstance init
Got beyond 'if bot' of PariInstance init instead
Got beyond 'pari_init_opts'
Got here
Got to just before pari_float

which can be interpreted as

  • Pari was initialized
  • the 'if bot' was NOT taken
  • and pari_init_opts was apparently called
  • then we got to the complex number line
  • then we got to the real_mpfr line with the cgetr
  • and we didn't make it past that, as usual.

So apparently bot was not yet set, contrary to your hypothesis, but there is still something weird going on with the stack. The rest of the lines in the _init_ don't look that innocent either; if one of them failed or allocated something null would it raise an error? (Like the pari_free line or the pariOut lines?)

@kcrisman
Copy link
Member Author

kcrisman commented Aug 8, 2011

comment:42

As for the printing:

  • after initializing the stack, bot, top, and avma are all 2121924616

I assume top and avma are 2121924616 + 16000000 = 2137924616, or bot is 2105924616; otherwise this would be the first error.

You are correct, I didn't look closely enough. Should have had a check digit :)

  • in integer.pyx, they are all zero.

So now we just need that import hook thing Robert Bradshaw talked about to find out where this could have changed in between.

@kcrisman
Copy link
Member Author

kcrisman commented Aug 9, 2011

comment:43

Well, this doesn't happen in any of the other files where /pari/decl.pxi is defined, unfortunately (as it would have been easier to find) - those are all imported on startup after integer.pyx, apparently, if at all (for instance, factorint.pyx isn't). I still can't find any other places in libs/pari/gen.pyx which is called during the startup where bot and friends are bad, either.

In fact, avma is exactly what it is supposed to be (213...) again well after all the bad stuff happens! Maybe Pari is 'unitialized' somehow, then initialized again since bot is once again zero... just not in time to save the Pynac_I initialization.

@nexttime
Copy link
Mannequin

nexttime mannequin commented Aug 9, 2011

comment:44

Well, some random pointer might corrupt PARI's stack variables as well.

But dumping the values of these variables whenever some module gets imported should help narrowing the place where this or similar happens.

@kcrisman
Copy link
Member Author

kcrisman commented Aug 9, 2011

comment:45

Just before the 0s appear, the 'deep copy to Python heap' is called and avma goes down to ...600 instead of 616. The others are the same.

Then we have zeros, also in the other things while importing CDF.

But the next time the deep copy in gen.pyx is called, avma is back to normal, down to 588 (later up to 592 when a new_gen is created).

So it seems that it might indeed be happening one of those places where avma is reset in complex_double.pyx? Presumably not the special function ones.


Random? But why is it so reliable then, and on different computers/versions of Windows? (Unless the importing of randstate did it, but I don't think that's what you meant, and I don't think this has avma or bot or top.)


The real problem is that I can't dump values of these variables while the pyx files are being imported, because editing the pyx file complex_double won't do that until after they are all imported. And I can't dump them from places where they aren't defined - precious few, really.

Really needing that import hook thingie.

@kcrisman
Copy link
Member Author

kcrisman commented Dec 2, 2011

comment:46

This has reappeared in the discussion to #12104. Leif has a patch to get Cython files to say where they have problems importing somewhere, hopefully will be of use.

@dimpase
Copy link
Member

dimpase commented Dec 2, 2011

comment:47

Replying to @kcrisman:

This has reappeared in the discussion to #12104. Leif has a patch to get Cython files to say where they have problems importing somewhere, hopefully will be of use.

these global variables in Windows DDLs... IMHO they need a special treatment: search for "global" here:
http://cygwin.com/faq/faq.programming.html

perhaps this is a source of all this blues.

@kcrisman

This comment has been minimized.

@kcrisman
Copy link
Member Author

comment:49

Still a problem. But:

User 1@GC02635 /home/SageUser/sage-4.7.2
$ ./sage
----------------------------------------------------------------------
| Sage Version 4.7.2, Release Date: 2011-10-29                       |
| Type notebook() for the GUI, and license() for information.        |
----------------------------------------------------------------------
sage: 2+2
4

This is on XP, after commenting out the inits in sage/symbolic/pynac.pyx and sage/all.py. So we just need to track down the problem.

@kcrisman
Copy link
Member Author

comment:50

@jpflori - any ideas on this one?

@jpflori
Copy link

jpflori commented Jul 11, 2012

comment:51

I remember doing something with the symbolic i some time ago, potentially while updating pynac.
Not sure this is related, but I'll begin with finding traces of that.

@jpflori
Copy link

jpflori commented Jul 11, 2012

comment:52

This was here #12950 comment:11 but seems unrelated at first sight.

@kcrisman
Copy link
Member Author

comment:53

No, this is quite different, I think. You might as well read the long list of updates here first - it was quite an education even doing all this, though I was ultimately woefully unsuccessful.

@kcrisman
Copy link
Member Author

kcrisman commented Oct 8, 2012

comment:54

This is not showing on on Cygwin on XP with the current status of building, nor apparently on Windows 7 (JP, can you confirm this). Maybe we should close this, though it's frustrating not to know what the problem was.

@jpflori
Copy link

jpflori commented Oct 8, 2012

comment:55

No, I did not got that error on my Windows 7 install.
Maybe this is related to #11116?

@kcrisman
Copy link
Member Author

kcrisman commented Oct 8, 2012

comment:56

I doubt it. Since neither of us is seeing this currently, I'll mark it to close, though if it ever happens again at least this info is here for posterity!

@kcrisman
Copy link
Member Author

kcrisman commented Oct 8, 2012

Reviewer: Karl-Dieter Crisman, Jean-Pierre Flori

@jpflori
Copy link

jpflori commented Oct 8, 2012

comment:58

The bug in #11116 only happens on one arch, and potentially with only a few versions of Sage.
Maybe it's the same here.
You were unlucky enough to try out a particular version of Sage on a particular machine where the problems of initialization order lead to a segfault.
But a innocent looking chenge since anywhere in the Sage library might have made the problem disappear.

@kcrisman
Copy link
Member Author

kcrisman commented Oct 8, 2012

comment:59

But a innocent looking chenge since anywhere in the Sage library might have made the problem disappear.

It's true. At the same time, it wasn't just me - Mike Hansen had this traceback well over a year ago. It is possible it was only ever on XP, who knows.

@dimpase
Copy link
Member

dimpase commented Oct 8, 2012

comment:60

Replying to @kcrisman:

But a innocent looking chenge since anywhere in the Sage library might have made the problem disappear.

It's true. At the same time, it wasn't just me - Mike Hansen had this traceback well over a year ago. It is possible it was only ever on XP, who knows.

AFAIR, I saw this on my 32-bit Win7, too. IMHO, it's a Cygwin improvement that is to credit.

By the way, I have strange problems with my new 64-bit Win7, Sage install does not get past bzip2. Or is bzip2 supposed to come from Cygwin natively? Perhaps the toolchain is broken?

I use the latest Cygwin. Does it still need a manual fix in that libtool or autoconf or what was that?

@jpflori
Copy link

jpflori commented Oct 9, 2012

comment:61

Replying to @dimpase:

Replying to @kcrisman:

But a innocent looking chenge since anywhere in the Sage library might have made the problem disappear.

It's true. At the same time, it wasn't just me - Mike Hansen had this traceback well over a year ago. It is possible it was only ever on XP, who knows.

AFAIR, I saw this on my 32-bit Win7, too. IMHO, it's a Cygwin improvement that is to credit.

By the way, I have strange problems with my new 64-bit Win7, Sage install does not get past bzip2. Or is bzip2 supposed to come from Cygwin natively? Perhaps the toolchain is broken?

Did not encounter that problem, but I had the Cygwin bzip2 package installed before building Sage.

I use the latest Cygwin. Does it still need a manual fix in that libtool or autoconf or what was that?

I don't know... the problem was that updating from the gcc4 default package (4.3.stg) to the 4.5.stg "forgot" to update some pathes in configuration files in the postinst script.
The latest 4.5 package seems to be from october 2011, so I doubt the problem has been fixed.

@embray
Copy link
Contributor

embray commented Apr 18, 2016

comment:63

I've encountered exactly this issue trying to import sage, which I finally got to build after a number of other fixes (not all of which I've posted patches for yet).

Since this ticket has already been closed and is quite long, should I open a new one? Or just reopen this one? It appears to be the exact same issue--I'm getting a segfault at the line

pari_float = cgetr(2 + rounded_prec / wordsize)

in real_mpfr.pyx.

@jdemeyer
Copy link

comment:64

I suggest to open a new ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants