Single Precision MILC build hangs when using QUDA FF #158

mathiaswagner · 2014-10-09T20:42:58Z

I built su3_rhmc_hisq from MILC with quad 0.7 for single precision and tried to run it with the test/su3_rhmc_hisq.1.sample-in input.
While it does start it hangs for a while.
If I disable the FF in the Makefile
WANT_FF_GPU =false
or use double precision for MILC (obviously with test/su3_rhmc_hisq.2.sample-in) the run completes.

The text was updated successfully, but these errors were encountered:

maddyscientist · 2014-10-09T20:53:25Z

This will be for Justin I guess. Can you post which machine this happened on, and which CUDA toolkit and driver?

mathiaswagner · 2014-10-09T20:58:48Z

I observed it on a local machine in Bloomington. I used CUDA 6.0, driver 340.32 and was running on our K40.
On 09.10.2014, at 16:53, mikeaclark [email protected] wrote:

This will be for Justin I guess. Can you post which machine this happened on, and which CUDA toolkit and driver?

—
Reply to this email directly or view it on GitHub.

Mathias Wagner
Department of Physics SW 117 - Indiana University
Bloomington, IN 47405
email: [email protected]

maddyscientist · 2014-10-09T21:13:43Z

If it was running on a single K40, a big help with the debugging effort would be to run it in gdb and see where it hangs. If you recompile with HOST_DEBUG enabled, we'll even get the exact line that is hanging.

mathiaswagner · 2014-10-09T21:30:41Z

I will do and let you know what happened.

Am 09.10.2014 um 17:13 schrieb mikeaclark <[email protected]mailto:[email protected]>:

If it was running on a single K40, a big help with the debugging effort would be to run it in gdb and see where it hangs. If you recompile with HOST_DEBUG enabled, we'll even get the exact line that is hanging.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/158#issuecomment-58578748.

mathiaswagner · 2014-10-10T11:10:50Z

I ran with gdb and here is the backtrace.

(gdb) bt
#0  0x0000003475a044eb in clock_gettime () from /lib64/librt.so.1
#1  0x00002aaaad953cae in ?? () from /usr/lib64/libcuda.so
#2  0x00002aaaad2b7983 in ?? () from /usr/lib64/libcuda.so
#3  0x00002aaaad2a0136 in ?? () from /usr/lib64/libcuda.so
#4  0x00002aaaad290b57 in ?? () from /usr/lib64/libcuda.so
#5  0x00002aaaad20505a in ?? () from /usr/lib64/libcuda.so
#6  0x00002aaaad206faa in ?? () from /usr/lib64/libcuda.so
#7  0x00002aaaad1d7035 in cuMemcpyDtoH_v2 () from /usr/lib64/libcuda.so
#8  0x00002aaaaaab964e in ?? () from /home/mwagner/cuda-6.0/lib64/libcudart.so.6.0
#9  0x00002aaaaaae0e38 in cudaMemcpy () from /home/mwagner/cuda-6.0/lib64/libcudart.so.6.0
#10 0x00000000004a0b93 in computeKSLinkQuda (fatlink=0x2aaacb280b80, longlink=0x0, ulink=0x2aaacb566870, inlink=, path_coeff=0x7fffffffe340, param=0x7fffffffe240, 
    method=QUDA_COMPUTE_FAT_STANDARD) at interface_quda.cpp:3556
#11 0x000000000049633c in qudaLoadUnitarizedLink(int, ._88, const double *, void *, void *, void *) (prec=, fatlink_args=, act_path_coeff=0x7fffffffe340, 
    inlink=0x2aaacb225970, fatlink=0x2aaacb280b80, ulink=0x2aaacb566870) at milc_interface.cpp:229
#12 0x0000000000444770 in load_hisq_aux_links_gpu (info=0x7fffffffe4d0, ap=0xb621d80, aux=0x2aaacb1c5e20, links=0x2aaacb338200) at ../generic_ks/fermion_links_hisq_load_gpu.c:50
#13 0x00000000004444e5 in create_hisq_links_milc (info=0x7fffffffe4d0, fn=0xb621e90, fn_deps=0xb621ef0, aux=0xb621e88, ap=0xb621d80, links=0x2aaacb338200, want_deps=0, want_back=1)
    at ../generic_ks/fermion_links_hisq_load_milc.c:765
#14 0x00000000004425ac in restore_hisq_links_t (info=0x7fffffffe4d0, hl=0xb621e80, links=0x2aaacb338200, options=0xb5bf0e0) at ../generic_ks/fermion_links_hisq_milc.c:91
#15 0x00000000004427bc in restore_milc_hisq_links_t (info=0x7fffffffe4d0, hl=0xb5b24a0, links=0x2aaacb338200, options=0xb5bf0e0) at ../generic_ks/fermion_links_hisq_milc.c:177
#16 0x0000000000442b66 in restore_fermion_links_hisq (fl=0xb5bf0e0, precision=2, phases_in=1, links=0x2aaacb338200) at ../generic_ks/fermion_links_hisq_milc.c:323
#17 0x0000000000431c74 in restore_fermion_links_from_site (fl=0xb5bf0e0, prec=2) at ../generic_ks/fermion_links_from_site.c:36
#18 0x00000000004086b6 in update_h_fermion (eps=0.0199999996, multi_x=0x2aaacb2dbe30) at update_h_rhmc.c:79
#19 0x000000000040984e in update () at update_rhmc.c:356
#20 0x0000000000492299 in main (argc=2, argv=0x7fffffffe7f8) at control.c:76

I will try to get some more insight into this and pin down the issue a bit more.

mathiaswagner · 2014-10-10T17:09:08Z

A bit more information from cuda-gdb when I send SIGINT while the program hangs:

(cuda-gdb) bt
#0  0x000000000d90f710 in void quda::getUnitarizedField(float2 const*, float2 const*, float2*, float2*, int*, int) ()
#1  0x000000000d90c7f8 in void quda::getUnitarizedField(float2 const*, float2 const*, float2*, float2*, int*, int)<<<(14,1,1),(96,1,1)>>> ()

maddyscientist · 2014-10-10T17:58:29Z

To rule out a synchronization problem, can you run with the environment variable CUDA_LAUNCH_BLOCKING=1?

maddyscientist · 2014-10-10T18:09:32Z

Also, this appears to be a hang in the kernel itself. Are you running with gdb or cuda-gdb, if gdb, can you try cuda-gdb with DEVICE_DEBUG enabled to see if it gives more info?

maddyscientist · 2014-10-10T18:12:50Z

Another thing to try is to disable the tuning, to see if this is related to the hang.

mathiaswagner · 2014-10-10T18:17:18Z

I will. Might however take a while to recompile.

Right now I wanted to check whether the issues also shows up in any of the fermion force related tests from QUDA. That might make debugging easier.

mathiaswagner · 2014-10-10T20:31:29Z

I now have first hints from cuda-gdb with DEVICE_DEBUG:

Program received signal SIGINT, Interrupt.
0x00000000158ef378 in quda::Matrix::operator() (this=0x3fff4c0, i=1, j=2) at /home/mwagner/quda07/lib/./quda_matrix.h:353
353         __device__ __host__ inline T const & operator()(int i, int j) const{
(cuda-gdb) bt
#0  0x00000000158ef378 in quda::Matrix::operator() (this=0x3fff4c0, i=1, j=2) at /home/mwagner/quda07/lib/./quda_matrix.h:353
#1  0x00000000158b3380 in quda::(anonymous namespace)::getLambdaMax (b=0x3fff4c0, lambda_max=)
    at /home/mwagner/quda07/lib/./svd_quda.h:94
#2  0x0000000015901c18 in quda::(anonymous namespace)::bdSVD (u=0x3fff4c0, v=(cached) 0x3fff550, b=(cached) 0x3fff4c0, max_it=(cached) 500)
    at /home/mwagner/quda07/lib/./svd_quda.h:493
#3  0x00000000158df2d0 in quda::(anonymous namespace)::computeSVD (m=0x3fff4c0, u=(cached) 0x3fff550, v=(cached) 0x3fff4c0, 
    singular_values=) at /home/mwagner/quda07/lib/./svd_quda.h:649
#4  0x00000000158d6b68 in quda::unitarizeLinkMILC (in=0x3fff4c0, result=(cached) 0x3fff550) at unitarize_links_quda.cu:285
#5  0x0000000015917d20 in quda::getUnitarizedField<<<(11,1,1),(128,1,1)>>> (inlink_even=0xf05bc0000, inlink_odd=0xf05bf5400, 
    outlink_even=0xf05c2a800, outlink_odd=0xf05c5fc00, num_failures=0xf05dc0000, threads=1296) at unitarize_links_quda.cu:379

It seems like the single precision runs spends a lot of time in the SVD. I am not sure what is going on there as the double precision run did just fine (run completed within seconds) and thought the unitarization is always done in single precision ...
I always killed the single precision run after some minutes without any output but maybe I can just try to have it running over night and see what happens ...

maddyscientist · 2014-10-10T23:25:40Z

This location in svd_quda.h where it is reporting is a trivial multiplication. The thread can't really hung have there, I suspect it stuck in a loop. Perhaps the SVD isn't converging or something. I'll keep looking though hopefully Justin can comment.

mathiaswagner · 2014-10-10T23:45:53Z

Yes, it is stuck in the loop. I stopped and continued it in cuda-gdb and it was at several different locations in the svd. I will let the job run for some hours to see whether it gets out of the loop at some point.

maddyscientist · 2014-10-10T23:51:30Z

The do-while condition is

((b(0,1) != 0.0 || b(1,2) != 0.0) && it < max_it),

so if nothing else it should exit the loop once it reaches max_it. However, I see in the code that while it declares

int it;

it doesn't actually increment "it" at all, hence if the SVD doesn't converge, it will get stuck. So that's a definite bug.

The question that remains is why is it not converging?

mathiaswagner · 2014-10-10T23:54:51Z

I will add the missing increment. After that we might get some more data to work with.

maddyscientist · 2014-10-10T23:55:13Z

But I see that

#define SVDPREC 1e-11

which will likely never converge in single precision, so that's probably why it doesn't converge. So I think the solution is that we need to reduce the SVDPREC to something achievable like 1e-7 for single precision.

Perhaps the SVD should be done in double always anyway, but we should fix this hang regardless.

Justin, thoughts?

mathiaswagner · 2014-10-10T23:58:28Z

I only have my iPad with me now but I think there is a line that says always do unitarizationin double in unitarize_links.cpp ( or similar filename).
So I assumed svd would always be in double but I have to check.

maddyscientist · 2014-10-11T00:06:25Z

I think you're correct, line 376 of unitarize_links_quda.cu

    // Unitarization is always done in double precision
    Matrix<double2,3> v, result;

So the question is just then why isn't it converging?

jpfoley · 2014-10-11T03:41:40Z

Sorry for not commenting earlier. These emails messages aren't sent to me automatically and Jessica and I were at a wedding rehearsal this afternoon. It's highly likely there's a problem with SVD. There are so many branches that I don't think they've all been checked properly. I guess the single-precision run is exposing an SVD bug. More generally, I don't think SVD should be used at all, because even if it works properly it serializes the calculation. There must be a way of doing the unitarization in double precision with a cutoff that allows us to avoid using SVD in a way that ensures the resulting links are unitary to single-precision accuracy. I'm also wondering why Caley-Hamilton based unitarization is failing in this example. Are the input gauge fields particularly coarse?

jpfoley · 2014-10-13T12:25:21Z

I'm wondering again what the input links look like when the SVD hangs. For example, I'm not sure what would happen if the input link components were all zero. Are the input matrices valid links, I wonder?

mathiaswagner · 2014-10-13T13:17:56Z

This is still in my today queue but for now my best guess is: If I run the job with FF on the CPU everything works fine. Still it might be just that MILC does not suffer from the infinite loop that seems to happen with QUDA.
Let me fix that first and I will look into it again and let you know what happened.

jpfoley · 2014-10-13T15:36:17Z

Right, so it could either be a problem with the QUDA implementation of the SVD or it could be that the links are not being passed to QUDA correctly.

mathiaswagner · 2014-10-13T16:13:53Z

Well, I fixed the stopping criterion and now I end up with nan. But at least the program now detects an error.

jpfoley · 2014-10-16T20:28:29Z

The bug in MILC is arising in the outer product calculation, which isn't included in the QUDA test suite. Therefore, the segmentation fault in the internal test is a separate issue.

jpfoley · 2014-10-16T20:30:33Z

There really should be an internal test for the outer-product calc though.

mathiaswagner · 2014-10-16T20:31:07Z

I assumed that. Anyhow, it would probably be good to get the outer product also in the quda test suite ?=

maddyscientist · 2014-10-16T20:32:24Z

Yes. Make an issue of this so we don’t forget. Probably something Justin has to do since he’s the instigator of the outer-product code.

This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by

reply email and destroy all copies of the original message.

mathiaswagner · 2014-10-20T18:10:12Z

Fixed in d1a5967. Thanks Justin!

mathiaswagner added the bug label Oct 9, 2014

mathiaswagner added this to the QUDA 0.7.0 milestone Oct 9, 2014

maddyscientist assigned jpfoley Oct 9, 2014

mathiaswagner mentioned this issue Oct 16, 2014

hisq_paths_force_test --gauge-order milc crashes with Segmentation fault #163

Closed

mathiaswagner closed this as completed Oct 20, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single Precision MILC build hangs when using QUDA FF #158

Single Precision MILC build hangs when using QUDA FF #158

mathiaswagner commented Oct 9, 2014

maddyscientist commented Oct 9, 2014

mathiaswagner commented Oct 9, 2014

maddyscientist commented Oct 9, 2014

mathiaswagner commented Oct 9, 2014

mathiaswagner commented Oct 10, 2014

mathiaswagner commented Oct 10, 2014

maddyscientist commented Oct 10, 2014

maddyscientist commented Oct 10, 2014

maddyscientist commented Oct 10, 2014

mathiaswagner commented Oct 10, 2014

mathiaswagner commented Oct 10, 2014

maddyscientist commented Oct 10, 2014

mathiaswagner commented Oct 10, 2014

maddyscientist commented Oct 10, 2014

mathiaswagner commented Oct 10, 2014

maddyscientist commented Oct 10, 2014

mathiaswagner commented Oct 10, 2014

maddyscientist commented Oct 11, 2014

jpfoley commented Oct 11, 2014

jpfoley commented Oct 13, 2014

mathiaswagner commented Oct 13, 2014

jpfoley commented Oct 13, 2014

mathiaswagner commented Oct 13, 2014

jpfoley commented Oct 16, 2014

jpfoley commented Oct 16, 2014

mathiaswagner commented Oct 16, 2014

maddyscientist commented Oct 16, 2014

mathiaswagner commented Oct 20, 2014

Single Precision MILC build hangs when using QUDA FF #158

Single Precision MILC build hangs when using QUDA FF #158

Comments

mathiaswagner commented Oct 9, 2014

maddyscientist commented Oct 9, 2014

mathiaswagner commented Oct 9, 2014

maddyscientist commented Oct 9, 2014

mathiaswagner commented Oct 9, 2014

mathiaswagner commented Oct 10, 2014

mathiaswagner commented Oct 10, 2014

maddyscientist commented Oct 10, 2014

maddyscientist commented Oct 10, 2014

maddyscientist commented Oct 10, 2014

mathiaswagner commented Oct 10, 2014

mathiaswagner commented Oct 10, 2014

maddyscientist commented Oct 10, 2014

mathiaswagner commented Oct 10, 2014

maddyscientist commented Oct 10, 2014

mathiaswagner commented Oct 10, 2014

maddyscientist commented Oct 10, 2014

mathiaswagner commented Oct 10, 2014

maddyscientist commented Oct 11, 2014

jpfoley commented Oct 11, 2014

jpfoley commented Oct 13, 2014

mathiaswagner commented Oct 13, 2014

jpfoley commented Oct 13, 2014

mathiaswagner commented Oct 13, 2014

jpfoley commented Oct 16, 2014

jpfoley commented Oct 16, 2014

mathiaswagner commented Oct 16, 2014

maddyscientist commented Oct 16, 2014

reply email and destroy all copies of the original message.

mathiaswagner commented Oct 20, 2014