Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

matrix multiplication over ZZ sometimes gives incorrect results #11358

Closed
sagetrac-tomc mannequin opened this issue May 20, 2011 · 28 comments
Closed

matrix multiplication over ZZ sometimes gives incorrect results #11358

sagetrac-tomc mannequin opened this issue May 20, 2011 · 28 comments

Comments

@sagetrac-tomc
Copy link
Mannequin

sagetrac-tomc mannequin commented May 20, 2011

Something is wrong with the multi-modular matrix multiplication code for matrices over ZZ. At random, and infrequently, it gives incorrect results. For example, the following code chooses random 3x2 and 2x10 integer matrices and multiplies them together using the multi-modular algorithm. It then does the same multiplication 100 more times, checks that the answer is always the same, and if not raises an exception:

sage: for n in range(2000):
....:     A = MatrixSpace(ZZ,3,2).random_element()    
....:     B = MatrixSpace(ZZ,2,10).random_element()
....:     try_once = A._multiply_multi_modular(B)
....:     for k in range(100):
....:             try_again = A._multiply_multi_modular(B)
....:         if try_once != try_again:
....:                 print "="*60
....:             print "n = %s, k = %s"%(n,k)
....:             print "A = "
....:             print A
....:             print "B ="
....:             print B
....:             print "first attempt = "
....:             print try_once
....:             print "k-th retry = "
....:             print try_again
....:             raise RuntimeError
....: 

This fails with very high probability (on Sage 4.6.2 under OS X, built from source) with output such as:

============================================================
n = 27, k = 43
A = 
[-1  0]
[ 1  0]
[ 0  1]
B =
[  -2    1   -1    0    0   -1   -1    3   -1   -1]
[   1    1 -116    3    0   -1   -1    0   -1    0]
first attempt = 
[   2   -1    1    0    0    1    1   -3    1    1]
[  -2    1   -1    0    0   -1   -1    3   -1   -1]
[   1    1 -116    3    0   -1   -1    0   -1    0]
k-th retry = 
[   2 1102    1    0    0    1    1 1100    1    1]
[1101    1 1102    0    0 1102 1102    3 1102 1102]
[   1    1  987    3    0 1102 1102    0 1102    0]
---------------------------------------------------------------------------
RuntimeError [...]

Note that the two candidates for the matrix product here are congruent modulo 1103, which is prime. If you rerun the code with verbose logging, using set_verbose(2), then every time it fails the two candidates for the matrix product are congruent modulo the prime being used in the multi-modular algorithm. Thus I suspect that the Chinese Remainder Theorem code in sage/ext/multi_modular.pyx is not handling a corner case properly.

SEE ALSO: #10281, which touches the sage/ext/multi_modular.pyx code in nontrivial ways.

CC: @burcin

Component: linear algebra

Keywords: matrix multiplication, multi-modular, integers, ZZ

Stopgaps: #12710

Author: William Stein

Reviewer: Douglas McNeil

Merged: sage-5.0.beta12

Issue created by migration from https://trac.sagemath.org/ticket/11358

@sagetrac-tomc sagetrac-tomc mannequin added this to the sage-5.0 milestone May 20, 2011
@sagetrac-tomc
Copy link
Mannequin Author

sagetrac-tomc mannequin commented May 20, 2011

comment:1

This also fails on Sage 4.6.2 under Red Hat Enterprise Linux (binary install, x86_64 GNU/Linux)

@sagetrac-tomc
Copy link
Mannequin Author

sagetrac-tomc mannequin commented May 20, 2011

comment:2

Looking at the function mpz_crt_vec_tail() in sage/ext/multi_modular.pyx, it appears that three of the last five lines:

cdef Integer zz
zz = PY_NEW(Integer)
mpz_set(zz.value, self.half_product)

are redundant, because the variable zz is never used. (NB I am unfamiliar with both Cython and GMP, so I may be wrong here.) But why is this code there? Is it supposed to be doing something else?

@rbeezer
Copy link
Mannequin

rbeezer mannequin commented May 20, 2011

comment:3

From #sagemath on IRC:

[07:11] <sagebot> New news from tractimeline: Ticket #11358 (matrix multiplication over ZZ sometimes gives incorrect results) created
[07:32] <logix> hm, i just tried to reproduce that bug (which i can), and it failed both times when B had a full zero column
[07:32] <logix> (which is the case in the bug report too)
[07:33] <logix> perhaps that's the condition triggering the bug?
[07:34] <logix> i mean one of the conditions triggering the bug (there must be something else, as the inner loop runs more than once and it fails only sometimes

@sagetrac-tomc
Copy link
Mannequin Author

sagetrac-tomc mannequin commented May 20, 2011

comment:4

Replying to @rbeezer:

I don't think that this is correct. By running the code repeatedly, I found examples without zero columns.

@rbeezer
Copy link
Mannequin

rbeezer mannequin commented May 20, 2011

comment:5

Replying to @sagetrac-tomc:

I don't think that this is correct. By running the code repeatedly, I found examples without zero columns.

OK, just saw that in IRC and thought I'd capture it.

I'm curious to see what the root cause is here!

Rob

@sagetrac-dsm
Copy link
Mannequin

sagetrac-dsm mannequin commented May 21, 2011

comment:6

I think I've got it.

This seems to happen iff a random prime happens to be chosen twice. For example, if you modify _extend_moduli_to_height_c to always reuse a prime if more than one is needed, this always happens. The same bug seems to exist in _extend_moduli_to_count as well.

The problem seems to be that in _new_random_prime, there's a test which doesn't do what it's trying to do:

    cdef mod_int _new_random_prime(self):
        # choose a new random prime
        cdef Py_ssize_t i
        cdef mod_int p
        while True:
            p = random_prime(self._u_bound, lbound =self._l_bound)
            for i in range(self.n):
                if p == self.moduli[i]:
                    break
            else:
                return p

IIUC, the "for i in range(self.n)" loop is attempting to avoid the problem of repeated primes, but this only works if p is added to self.moduli immediately after it's returned. In the case of _extend_moduli_to_height_c, this isn't true; the addition is delayed until a later extend_with_primes call, so the above check decides that the prime isn't reused, we get repeated moduli, and the multiplication breaks.

Or even more obviously:

sage: from sage.ext.multi_modular import MultiModularBasis_base
sage: mm = MultiModularBasis_base(1)
sage: mm._extend_moduli_to_count(1000)
1000
sage: len(mm), len(set(mm))
(1000, 843)

ISTM either we can remove the check-for-duplicate code from _new_random_prime and do it externally, or we can add an argument and pass it the "accumulated" primes as we build them elsewhere so it can avoid duplicates there too. I prefer the first. Should I work up a patch?

@sagetrac-tomc
Copy link
Mannequin Author

sagetrac-tomc mannequin commented May 21, 2011

comment:8

Replying to @sagetrac-dsm:

Good catch! I agree.

ISTM either we can remove the check-for-duplicate code from _new_random_prime and do it externally, or we can add an argument and pass it the "accumulated" primes as we build them elsewhere so it can avoid duplicates there too. I prefer the first. Should I work up a patch?

I prefer the first too. Please write a patch and then I will review it.

@sagetrac-dsm
Copy link
Mannequin

sagetrac-dsm mannequin commented May 21, 2011

comment:9

Okay, here's my first attempt. Because another function calls _new_random_prime assuming that it correctly avoids duplication I had to change my intended approach a bit, but it should work.

Two other minor changes:

(1) as mentioned above, mpz_crt_vec_tail() has what looks to me like vestigial code, and I've removed it.

(2) the documentation for replace_prime_c claimed that it would replace a prime with a greater prime, but made no efforts to ensure the "greater than" part. I've changed the doc accordingly.

@sagetrac-dsm sagetrac-dsm mannequin added the s: needs review label May 21, 2011
@sagetrac-tomc
Copy link
Mannequin Author

sagetrac-tomc mannequin commented May 21, 2011

comment:10

Thanks. I'll review this.

@sagetrac-tomc
Copy link
Mannequin Author

sagetrac-tomc mannequin commented May 24, 2011

comment:11

I can't get the patch to apply. When I apply the patch and then do

sage -b

I get

Error converting Pyrex file to C:
------------------------------------------------------------
...
        """
        cdef Py_ssize_t i
        cdef mod_int p
        while True:
            p = random_prime(self._u_bound, lbound =self._l_bound)
            if p not in self.moduli[:self.n]: return p
                                  ^
------------------------------------------------------------

/Users/tomc/sage-4.6.2/devel/sage-test-crt-fix/sage/ext/multi_modular.pyx:191:35: Cannot convert 'mod_int *' to Python object

Also, in the new implementation of _new_random_prime() there is the possibility of an infinite loop if self.moduli already contains all the primes between self._lbound and self._ubound. I suggest modifying _new_random_prime() so that it:

  • makes some number of tries (10? 100?) to find a new random prime p in the range self._lbound < p < self._ubound; and

  • if this fails then it increases self_ubound and tries again.

@burcin
Copy link

burcin commented May 24, 2011

comment:12

Replying to @sagetrac-tomc:

Also, in the new implementation of _new_random_prime() there is the possibility of an infinite loop if self.moduli already contains all the primes between self._lbound and self._ubound. I suggest modifying _new_random_prime() so that it:

  • makes some number of tries (10? 100?) to find a new random prime p in the range self._lbound < p < self._ubound; and
  • if this fails then it increases self_ubound and tries again.

I suggest raising an error instead of increasing _ubound. The upper bound is usually set so that the primes fit in a single machine word. A lot of things would break if this changed without notice.

@sagetrac-tomc
Copy link
Mannequin Author

sagetrac-tomc mannequin commented May 24, 2011

comment:13

Replying to @burcin:

I suggest raising an error instead of increasing _ubound. The upper bound is usually set so that the primes fit in a single machine word. A lot of things would break if this changed without notice.

If we do raise an error then it should be caught before it propagates up to the user. For instance, it should not be that if A and B are integer matrices then evaluating AB can sometimes cause an exception because the multi-modular matrix multiplication algorithm ran out of primes: in this case we should catch the exception and evaluate AB using a different algorithm.

Is it really the case that raising self._ubound would cause things to break?  I thought from looking at the multi-modular code that it all used mpz_t types from GMP, and that these would automatically increase their allocated memory as necessary.

The multi-modular code is used in five places: dense matrix multiplication over ZZ; dense matrix multiplication over cyclotomics; dense matrix multiplication over QQ (although only in the rational reconstruction routines _lift_crt_rr() and _lift_crt_rr_with_lcm(), which I think are dead code); echelon form calculations over cyclotomics; and characteristic polynomial calculations over cyclotomics.  If we decide to raise an error if the multi-modular code runs out of primes then we will need to catch this error and handle it appropriately in all five places.

@burcin
Copy link

burcin commented May 24, 2011

comment:14

Replying to @sagetrac-tomc:

Replying to @burcin:

I suggest raising an error instead of increasing _ubound. The upper bound is usually set so that the primes fit in a single machine word. A lot of things would break if this changed without notice.

If we do raise an error then it should be caught before it propagates up to the user. For instance, it should not be that if A and B are integer matrices then evaluating AB can sometimes cause an exception because the multi-modular matrix multiplication algorithm ran out of primes: in this case we should catch the exception and evaluate AB using a different algorithm.

This is better than trying to handle it magically in the MultiModularBasis class.

Is it really the case that raising self._ubound would cause things to break?  I thought from looking at the multi-modular code that it all used mpz_t types from GMP, and that these would automatically increase their allocated memory as necessary.

Multi precision integers are needed for the lifting step. One of the reasons to use a multi-modular algorithm is to take advantage of fast arithmetic implemented in hardware. This requires that the modulus is small enough to fit in a word size. For linear algebra, the dimensions of the matrix also come into play. Ideally you should reduce only once for each row*column multiplication.

The multi-modular code is used in five places: dense matrix multiplication over ZZ; dense matrix multiplication over cyclotomics; dense matrix multiplication over QQ (although only in the rational reconstruction routines _lift_crt_rr() and _lift_crt_rr_with_lcm(), which I think are dead code); echelon form calculations over cyclotomics; and characteristic polynomial calculations over cyclotomics.  If we decide to raise an error if the multi-modular code runs out of primes then we will need to catch this error and handle it appropriately in all five places.

You cannot assume that the only users of the code are in the Sage library. This class is designed to be an abstraction that makes it easy to implement multi modular algorithms.

Perhaps we can verify that this error will never occur in these places through some theoretical argument. It is highly unlikely that we will always keep choosing the same random prime.

@sagetrac-tomc
Copy link
Mannequin Author

sagetrac-tomc mannequin commented May 24, 2011

comment:15

Replying to @burcin:

Multi precision integers are needed for the lifting step. One of the reasons to use a multi-modular algorithm is to take advantage of fast arithmetic implemented in hardware. This requires that the modulus is small enough to fit in a word size.

I agree that things will slow down if we ever raise self._ubound to be larger than a machine word: my question was whether anything would actually break. But in any case, since the MultiModularBasis class is explicitly documented as:

"""
    This class stores a list of machine-sized prime numbers...
"""

we should not allow the code to raise self._ubound. Thus we should raise an error if we repeatedly fail to choose a new random prime. As discussed, we will need to handle this error in several other places in Sage.

Note that the situation we are discussing is very unlikely to ever occur in practice, as the default values of _lbound and _ubound are repectively 210 and 215 and there are many primes in between 210 and 215. But in theory someone could do:

sage: mm = MultiModularBasis(lbound=4,ubound=6,...)

or something equally silly, and then the multi-modular code would quickly run out of primes.

@sagetrac-dsm
Copy link
Mannequin

sagetrac-dsm mannequin commented May 24, 2011

comment:16

Replying to @sagetrac-tomc:

I can't get the patch to apply.

Try it against one of the 4.7 release candidates. I should've mentioned what I was basing it on.

As for the case of running out of primes -- which I considered the caller's fault, and so I simply warned about it in the docstring rather than handle it -- I'm fine with whatever. (Once cython 0.15 gets integrated and we have cythonic generators it may be worth collecting random prime functions into a fast module of their own.)

Given that the moduli have to be below MAX_MODULUS, it wouldn't cost much to check whether there really are any primes left and raise an exception only if we really are done and not merely unlucky. If we're doing that, though, then we should probably raise the same exception if you ask for more primes than there are available in extend_moduli_to_count as well.

@williamstein
Copy link
Contributor

comment:17

ping why is this ticket dead? It seems very, very important, but everybody forgot about it a year ago?!

Replying to @sagetrac-tomc:

Note that the situation we are discussing is very unlikely to ever occur in practice, as the default
values of _lbound and _ubound are repectively 210 and 215 and
there are many primes in between 210 and 215.

This is completely false. There are hardly any primes in that range:

sage: len(prime_range(2^10,2^15))
3340

The linear algebra in Sage is currently very broken in practice, perhaps due to various optimizations and re-implementations that have fundamentally broken things. For example, I'm constantly hitting a problem of "not enough primes" (see #10281), even for matrices over QQ.

@williamstein

This comment has been minimized.

@jbalakrishnan
Copy link

Stopgaps: #12710

@williamstein
Copy link
Contributor

this is a usable version of the test program in the ticket description

@williamstein
Copy link
Contributor

comment:20

Attachment: 11358.sage.gz

I've posted a patch trac_11358-wstein.patch that fixes the bug. The solution is definitely much better than what was proposed 10 years ago. You can run the attached 11358.sage to check that the exact bug cited here doesn't happen. But better is to see the new doctests which get to the heart of the matter.

@sagetrac-dsm
Copy link
Mannequin

sagetrac-dsm mannequin commented Mar 21, 2012

comment:21

The patch does seem to eliminate the original bug for me, but there's still some strangeness for different _extend methods:


sage: set_random_seed(0)
sage: from sage.ext.multi_modular import MultiModularBasis_base
sage: m = MultiModularBasis_base(0);
sage: m._extend_moduli(100)
sage: len(m), len(set(m))
(101, 99)
sage: [k for k in m if list(m).count(k) > 1]
[30859, 3217, 30859, 3217]


sage: set_random_seed(1)
sage: from sage.ext.multi_modular import MultiModularBasis_base
sage: m = MultiModularBasis_base(0);
sage: m._extend_moduli_to_count(100)
100
sage: len(m), len(set(m))
(100, 97)
sage: [k for k in m if list(m).count(k) > 1]
[4001, 5309, 23293, 4001, 23293, 5309]

So I think that two of the three _extend methods still don't give results that I'd expect.

@williamstein
Copy link
Contributor

comment:22

So I think that two of the three _extend methods still don't give results that I'd expect.

I have fixed the patch. There was one case where I didn't update use of _new_random_prime correctly.

@sagetrac-dsm
Copy link
Mannequin

sagetrac-dsm mannequin commented Mar 21, 2012

comment:23

The new patch:

(1) Applies against 5.0.beta8 with minor fuzz.
(2) Seems to eliminate the original bug.
(3) Matches the original diagnosis about where the problem was (use of repeated primes in a multimodular basis), so I get why it should fix the bug.
(4) Has doctests which verify that .._height now behaves.
(5) The new version passes my separate tests confirming that the other _extend methods behave too, while the first version didn't.

Positive review.

@loefflerd
Copy link
Mannequin

loefflerd mannequin commented Mar 26, 2012

Reviewer: Douglas McNeil

@loefflerd
Copy link
Mannequin

loefflerd mannequin commented Mar 26, 2012

Author: William Stein

@jdemeyer
Copy link

jdemeyer commented Apr 2, 2012

Merged: sage-5.0.beta12

@jdemeyer
Copy link

jdemeyer commented Apr 2, 2012

apply ONLY this.

@jdemeyer
Copy link

jdemeyer commented Apr 2, 2012

comment:26

Attachment: trac_11358-wstein.patch.gz

Rebased patch by removing a hunk which just added a newline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants