Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on Linux x64 dgeqp3_ #263

Closed
staticfloat opened this issue Jul 26, 2013 · 14 comments
Closed

Error on Linux x64 dgeqp3_ #263

staticfloat opened this issue Jul 26, 2013 · 14 comments
Labels
Milestone

Comments

@staticfloat
Copy link
Contributor

We have an error on Linux x64 regarding dgeqp3. It is giving different results on v0.2.7 than on v0.2.6. In Julia, the test that fails is this:

julia> srand(1234321)

julia> a = rand(10,10)
10x10 Float64 Array:
 0.0944218  0.735234   0.969904  0.963961   0.465516   0.849889  0.0124892  0.244381    0.745372   0.131392  
 0.936611   0.212451   0.185209  0.387011   0.45674    0.42185   0.757422   0.371978    0.727993   0.0173534 
 0.258327   0.845895   0.460513  0.152221   0.0239988  0.290466  0.559113   0.00902172  0.0858923  0.00643378
 0.930924   0.0171136  0.957225  0.538903   0.926619   0.421925  0.653393   0.355525    0.419125   0.97461   
 0.555283   0.903764   0.958056  0.558128   0.522457   0.921068  0.833648   0.672057    0.821811   0.54126   
 0.87151    0.843184   0.484397  0.221613   0.110692   0.566173  0.277034   0.724802    0.793848   0.275229  
 0.041553   0.143325   0.805794  0.404656   0.487905   0.867104  0.228606   0.0406683   0.946614   0.799287  
 0.968779   0.914686   0.372153  0.0988376  0.669727   0.901132  0.727284   0.446438    0.197482   0.261148  
 0.653566   0.715679   0.601021  0.0703304  0.33555    0.688406  0.445822   0.432431    0.324625   0.255576  
 0.458101   0.757121   0.804712  0.427103   0.212379   0.576132  0.929957   0.975792    0.815186   0.817669  

julia> LinAlg.LAPACK.geqp3!(a[1:5,:])
(
5x10 Float64 Array:
 -1.73817    -1.03966    -1.16458   -1.20003    -0.74461    -1.33618    -1.05513   -1.11305    -1.22387    -0.911931
  0.0683915  -1.02321     0.201393  -0.364832   -0.249304    0.011094   -0.810407  -0.462729   -0.0284873  -0.348784
  0.170052    0.0347848  -0.849424  -0.236841   -0.0776151  -0.290039   -0.402718   0.30069    -0.0164622   0.452553
  0.353471    0.281608   -0.451136   0.560977    0.226932    0.342346   -0.159632   0.117607    0.380469   -0.206646
  0.353778    0.0818541   0.175374  -0.0556841   0.271015    0.0536057   0.224983  -0.0150287  -0.244104    0.283875,

[1.5580023539798393,1.839565572847273,1.62037786050218,1.9938177239784762,0.0],[3,1,2,9,8,6,7,5,4,10])

This last step is not the same as when this is run on the same machine with v0.2.6:

julia> LinAlg.LAPACK.geqp3!(a[1:5,:])
(
5x10 Float64 Array:
 -1.73817    -1.03966    -1.16458   -1.20003    -0.911931  -1.33618    -1.05513   -0.74461    -1.22387    -1.11305  
  0.0683915  -1.02321     0.201393  -0.364832   -0.283484  -0.063964   -0.810407  -0.296921   -0.0312812  -0.462729 
  0.170052    0.0347848  -0.849424  -0.236841    0.463068  -0.302125   -0.402718  -0.0852822  -0.016912    0.30069  
  0.353471    0.281608   -0.451136   0.560977   -0.220472   0.358238   -0.159632   0.237013    0.381061    0.117607 
  0.353778    0.0818541   0.175374  -0.0556841   0.292253   0.0439763   0.224983   0.264906   -0.244462   -0.0150287,

[1.5580023539798393,1.839565572847273,1.62037786050218,1.9938177239784762,0.0],[3,1,2,9,10,6,7,8,4,5])

The function LinAlg.LAPACK.geqp3!() calls out immediately to OpenBLAS's symbol dgeqp3_, and all that has changed between these two functions is the fact that one was built with against v0.2.7, and one was against v0.2.6. They have identical configurations, the only return from openblas_get_config() is NO_AFFINITY. Here is the relevant information for the machine:

$ uname -a
Linux debug-hs-1801-ruby-1374796630 2.6.32-042stab061.2 #1 SMP Fri Aug 24 09:07:21 MSK 2012 x86_64 x86_64 x86_64 GNU/Linux

$ cat /proc/cpuinfo | gist
uploaded to gist: https://gist.github.com/anonymous/6092537

Unfortunately, this machine is not mine, it belongs to the guys at travis-ci.com, so I will lose access to it tomorrow. I will do my best to provide any further information you need to debug this problem, but I won't be able to login after tomorrow, and I haven't been able to reproduce on my own systems yet.

@staticfloat
Copy link
Contributor Author

I think this might be a piledriver bug. I can't execute the resulting binaries on my machines, because the library is optimized for piledriver kernels:

$ ls -la libopenblas.a
libopenblas.a -> libopenblas_piledriverp-r0.2.7.a

You can download the entire openblas-v0.2.7 folder from the Julia distribution here, to poke around and see if the binary code was compiled incorrectly, or test it on your machines, etc... This was compiled with gcc-4.7 and gfortran-4.7:

$ gcc-4.7 --version
gcc-4.7 (Ubuntu/Linaro 4.7.3-2ubuntu1~12.04) 4.7.3
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ gfortran-4.7 --version
GNU Fortran (Ubuntu/Linaro 4.7.3-2ubuntu1~12.04) 4.7.3
Copyright (C) 2012 Free Software Foundation, Inc.

GNU Fortran comes with NO WARRANTY, to the extent permitted by law.
You may redistribute copies of GNU Fortran
under the terms of the GNU General Public License.
For more information about these matters, see the file named COPYING

@xianyi
Copy link
Collaborator

xianyi commented Jul 27, 2013

The Piledriver uses @wernsaar 's bulldozer kernels. Thus, I think this is computational bug in bulldozer too.

@xianyi
Copy link
Collaborator

xianyi commented Jul 27, 2013

I haven't bulldozer or piledriver machine yet. Thus, I cannot debug this issue immediately.

@staticfloat
Copy link
Contributor Author

I will try to disable the PILEDRIVER codes in 0.2.7 for now. Is there an easy way to generate a DYNAMIC_ARCH executable without using the PILEDRIVER codes?

@ViralBShah
Copy link
Contributor

Is this in 0.2.7?

@xianyi
Copy link
Collaborator

xianyi commented Jul 30, 2013

This patch is in hotfix-v0.2.8 branch. I think this branch is ready to test on tomorrow. If it passed Julia test, I will release 0.2.8 version.

@ViralBShah
Copy link
Contributor

Ok - ping me when I should test out the hotfix-v0.2.8 branch.

@ViralBShah
Copy link
Contributor

Perhaps you may want to try release candidate tags - 0.2.8-rc1, 0.2.8-rc2 and so on - which makes it easy to test with julia. We are happy to try and verify with julia.

Also, you can very easily test with julia by editing julia/deps/Versions.make and changing v0.2.7 to any other tag or just to develop and then doing make. Then do make testall and see if anything is caught. Just delete deps/openblas-* before trying.

@staticfloat
Copy link
Contributor Author

As soon as something is tagged, I will test on Travis with launchpad as well
On Jul 30, 2013 10:09 AM, "Viral B. Shah" [email protected] wrote:

Perhaps you may want to try release candidate tags - 0.2.8-rc1, 0.2.8-rc2
and so on - which makes it easy to test with julia. We are happy to try and
verify with julia.

Also, you can very easily test with julia by editing
julia/deps/Versions.make and changing v0.2.7 to any other tag or just to
develop and then doing make. Then do make testall and see if anything is
caught. Just delete deps/openblas-* before trying.


Reply to this email directly or view it on GitHubhttps://github.com//issues/263#issuecomment-21804629
.

staticfloat referenced this issue in JuliaLang/julia Jul 30, 2013
@xianyi
Copy link
Collaborator

xianyi commented Jul 31, 2013

Hi all,

I just release v0.2.8-rc1 tag. Could you try it?

Thank you

Xianyi

@ViralBShah
Copy link
Contributor

Trying it now.

@ViralBShah
Copy link
Contributor

Worked fine on both OS X and Linux for me.

@staticfloat
Copy link
Contributor Author

@xianyi 0.2.8-rc1 worked great on all my machines, including Travis.

@xianyi
Copy link
Collaborator

xianyi commented Aug 1, 2013

I will close this issue for 0.2.8 version. Instead, I create #268 to track the bug in AMD bulldozer kernels.

@xianyi xianyi closed this as completed Aug 1, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants