Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single Precision MILC build hangs when using QUDA FF #158

Closed
mathiaswagner opened this issue Oct 9, 2014 · 28 comments
Closed

Single Precision MILC build hangs when using QUDA FF #158

mathiaswagner opened this issue Oct 9, 2014 · 28 comments
Assignees
Labels
Milestone

Comments

@mathiaswagner
Copy link
Member

I built su3_rhmc_hisq from MILC with quad 0.7 for single precision and tried to run it with the test/su3_rhmc_hisq.1.sample-in input.
While it does start it hangs for a while.
If I disable the FF in the Makefile
WANT_FF_GPU =false
or use double precision for MILC (obviously with test/su3_rhmc_hisq.2.sample-in) the run completes.

@mathiaswagner mathiaswagner added this to the QUDA 0.7.0 milestone Oct 9, 2014
@maddyscientist
Copy link
Member

This will be for Justin I guess. Can you post which machine this happened on, and which CUDA toolkit and driver?

@mathiaswagner
Copy link
Member Author

I observed it on a local machine in Bloomington. I used CUDA 6.0, driver 340.32 and was running on our K40.
On 09.10.2014, at 16:53, mikeaclark [email protected] wrote:

This will be for Justin I guess. Can you post which machine this happened on, and which CUDA toolkit and driver?


Reply to this email directly or view it on GitHub.


Mathias Wagner
Department of Physics SW 117 - Indiana University
Bloomington, IN 47405
email: [email protected]

@maddyscientist
Copy link
Member

If it was running on a single K40, a big help with the debugging effort would be to run it in gdb and see where it hangs. If you recompile with HOST_DEBUG enabled, we'll even get the exact line that is hanging.

@mathiaswagner
Copy link
Member Author

I will do and let you know what happened.

Am 09.10.2014 um 17:13 schrieb mikeaclark <[email protected]mailto:[email protected]>:

If it was running on a single K40, a big help with the debugging effort would be to run it in gdb and see where it hangs. If you recompile with HOST_DEBUG enabled, we'll even get the exact line that is hanging.


Reply to this email directly or view it on GitHubhttps://github.com//issues/158#issuecomment-58578748.

@mathiaswagner
Copy link
Member Author

I ran with gdb and here is the backtrace.

(gdb) bt
#0  0x0000003475a044eb in clock_gettime () from /lib64/librt.so.1
#1  0x00002aaaad953cae in ?? () from /usr/lib64/libcuda.so
#2  0x00002aaaad2b7983 in ?? () from /usr/lib64/libcuda.so
#3  0x00002aaaad2a0136 in ?? () from /usr/lib64/libcuda.so
#4  0x00002aaaad290b57 in ?? () from /usr/lib64/libcuda.so
#5  0x00002aaaad20505a in ?? () from /usr/lib64/libcuda.so
#6  0x00002aaaad206faa in ?? () from /usr/lib64/libcuda.so
#7  0x00002aaaad1d7035 in cuMemcpyDtoH_v2 () from /usr/lib64/libcuda.so
#8  0x00002aaaaaab964e in ?? () from /home/mwagner/cuda-6.0/lib64/libcudart.so.6.0
#9  0x00002aaaaaae0e38 in cudaMemcpy () from /home/mwagner/cuda-6.0/lib64/libcudart.so.6.0
#10 0x00000000004a0b93 in computeKSLinkQuda (fatlink=0x2aaacb280b80, longlink=0x0, ulink=0x2aaacb566870, inlink=, path_coeff=0x7fffffffe340, param=0x7fffffffe240, 
    method=QUDA_COMPUTE_FAT_STANDARD) at interface_quda.cpp:3556
#11 0x000000000049633c in qudaLoadUnitarizedLink(int, ._88, const double *, void *, void *, void *) (prec=, fatlink_args=, act_path_coeff=0x7fffffffe340, 
    inlink=0x2aaacb225970, fatlink=0x2aaacb280b80, ulink=0x2aaacb566870) at milc_interface.cpp:229
#12 0x0000000000444770 in load_hisq_aux_links_gpu (info=0x7fffffffe4d0, ap=0xb621d80, aux=0x2aaacb1c5e20, links=0x2aaacb338200) at ../generic_ks/fermion_links_hisq_load_gpu.c:50
#13 0x00000000004444e5 in create_hisq_links_milc (info=0x7fffffffe4d0, fn=0xb621e90, fn_deps=0xb621ef0, aux=0xb621e88, ap=0xb621d80, links=0x2aaacb338200, want_deps=0, want_back=1)
    at ../generic_ks/fermion_links_hisq_load_milc.c:765
#14 0x00000000004425ac in restore_hisq_links_t (info=0x7fffffffe4d0, hl=0xb621e80, links=0x2aaacb338200, options=0xb5bf0e0) at ../generic_ks/fermion_links_hisq_milc.c:91
#15 0x00000000004427bc in restore_milc_hisq_links_t (info=0x7fffffffe4d0, hl=0xb5b24a0, links=0x2aaacb338200, options=0xb5bf0e0) at ../generic_ks/fermion_links_hisq_milc.c:177
#16 0x0000000000442b66 in restore_fermion_links_hisq (fl=0xb5bf0e0, precision=2, phases_in=1, links=0x2aaacb338200) at ../generic_ks/fermion_links_hisq_milc.c:323
#17 0x0000000000431c74 in restore_fermion_links_from_site (fl=0xb5bf0e0, prec=2) at ../generic_ks/fermion_links_from_site.c:36
#18 0x00000000004086b6 in update_h_fermion (eps=0.0199999996, multi_x=0x2aaacb2dbe30) at update_h_rhmc.c:79
#19 0x000000000040984e in update () at update_rhmc.c:356
#20 0x0000000000492299 in main (argc=2, argv=0x7fffffffe7f8) at control.c:76

I will try to get some more insight into this and pin down the issue a bit more.

@mathiaswagner
Copy link
Member Author

A bit more information from cuda-gdb when I send SIGINT while the program hangs:

(cuda-gdb) bt
#0  0x000000000d90f710 in void quda::getUnitarizedField(float2 const*, float2 const*, float2*, float2*, int*, int) ()
#1  0x000000000d90c7f8 in void quda::getUnitarizedField(float2 const*, float2 const*, float2*, float2*, int*, int)<<<(14,1,1),(96,1,1)>>> ()

@maddyscientist
Copy link
Member

To rule out a synchronization problem, can you run with the environment variable CUDA_LAUNCH_BLOCKING=1?

@maddyscientist
Copy link
Member

Also, this appears to be a hang in the kernel itself. Are you running with gdb or cuda-gdb, if gdb, can you try cuda-gdb with DEVICE_DEBUG enabled to see if it gives more info?

@maddyscientist
Copy link
Member

Another thing to try is to disable the tuning, to see if this is related to the hang.

@mathiaswagner
Copy link
Member Author

I will. Might however take a while to recompile.

Right now I wanted to check whether the issues also shows up in any of the fermion force related tests from QUDA. That might make debugging easier.

@mathiaswagner
Copy link
Member Author

I now have first hints from cuda-gdb with DEVICE_DEBUG:

Program received signal SIGINT, Interrupt.
0x00000000158ef378 in quda::Matrix::operator() (this=0x3fff4c0, i=1, j=2) at /home/mwagner/quda07/lib/./quda_matrix.h:353
353         __device__ __host__ inline T const & operator()(int i, int j) const{
(cuda-gdb) bt
#0  0x00000000158ef378 in quda::Matrix::operator() (this=0x3fff4c0, i=1, j=2) at /home/mwagner/quda07/lib/./quda_matrix.h:353
#1  0x00000000158b3380 in quda::(anonymous namespace)::getLambdaMax (b=0x3fff4c0, lambda_max=)
    at /home/mwagner/quda07/lib/./svd_quda.h:94
#2  0x0000000015901c18 in quda::(anonymous namespace)::bdSVD (u=0x3fff4c0, v=(cached) 0x3fff550, b=(cached) 0x3fff4c0, max_it=(cached) 500)
    at /home/mwagner/quda07/lib/./svd_quda.h:493
#3  0x00000000158df2d0 in quda::(anonymous namespace)::computeSVD (m=0x3fff4c0, u=(cached) 0x3fff550, v=(cached) 0x3fff4c0, 
    singular_values=) at /home/mwagner/quda07/lib/./svd_quda.h:649
#4  0x00000000158d6b68 in quda::unitarizeLinkMILC (in=0x3fff4c0, result=(cached) 0x3fff550) at unitarize_links_quda.cu:285
#5  0x0000000015917d20 in quda::getUnitarizedField<<<(11,1,1),(128,1,1)>>> (inlink_even=0xf05bc0000, inlink_odd=0xf05bf5400, 
    outlink_even=0xf05c2a800, outlink_odd=0xf05c5fc00, num_failures=0xf05dc0000, threads=1296) at unitarize_links_quda.cu:379

It seems like the single precision runs spends a lot of time in the SVD. I am not sure what is going on there as the double precision run did just fine (run completed within seconds) and thought the unitarization is always done in single precision ...
I always killed the single precision run after some minutes without any output but maybe I can just try to have it running over night and see what happens ...

@maddyscientist
Copy link
Member

This location in svd_quda.h where it is reporting is a trivial multiplication. The thread can't really hung have there, I suspect it stuck in a loop. Perhaps the SVD isn't converging or something. I'll keep looking though hopefully Justin can comment.

@mathiaswagner
Copy link
Member Author

Yes, it is stuck in the loop. I stopped and continued it in cuda-gdb and it was at several different locations in the svd. I will let the job run for some hours to see whether it gets out of the loop at some point.

@maddyscientist
Copy link
Member

The do-while condition is

((b(0,1) != 0.0 || b(1,2) != 0.0) && it < max_it), 

so if nothing else it should exit the loop once it reaches max_it. However, I see in the code that while it declares

int it;

it doesn't actually increment "it" at all, hence if the SVD doesn't converge, it will get stuck. So that's a definite bug.

The question that remains is why is it not converging?

@mathiaswagner
Copy link
Member Author

I will add the missing increment. After that we might get some more data to work with.

@maddyscientist
Copy link
Member

But I see that

#define SVDPREC 1e-11

which will likely never converge in single precision, so that's probably why it doesn't converge. So I think the solution is that we need to reduce the SVDPREC to something achievable like 1e-7 for single precision.

Perhaps the SVD should be done in double always anyway, but we should fix this hang regardless.

Justin, thoughts?

@mathiaswagner
Copy link
Member Author

I only have my iPad with me now but I think there is a line that says always do unitarizationin double in unitarize_links.cpp ( or similar filename).
So I assumed svd would always be in double but I have to check.

@maddyscientist
Copy link
Member

I think you're correct, line 376 of unitarize_links_quda.cu

    // Unitarization is always done in double precision
    Matrix<double2,3> v, result;

So the question is just then why isn't it converging?

@jpfoley
Copy link
Member

jpfoley commented Oct 11, 2014

Sorry for not commenting earlier. These emails messages aren't sent to me automatically and Jessica and I were at a wedding rehearsal this afternoon. It's highly likely there's a problem with SVD. There are so many branches that I don't think they've all been checked properly. I guess the single-precision run is exposing an SVD bug. More generally, I don't think SVD should be used at all, because even if it works properly it serializes the calculation. There must be a way of doing the unitarization in double precision with a cutoff that allows us to avoid using SVD in a way that ensures the resulting links are unitary to single-precision accuracy. I'm also wondering why Caley-Hamilton based unitarization is failing in this example. Are the input gauge fields particularly coarse?

@jpfoley
Copy link
Member

jpfoley commented Oct 13, 2014

I'm wondering again what the input links look like when the SVD hangs. For example, I'm not sure what would happen if the input link components were all zero. Are the input matrices valid links, I wonder?

@mathiaswagner
Copy link
Member Author

This is still in my today queue but for now my best guess is: If I run the job with FF on the CPU everything works fine. Still it might be just that MILC does not suffer from the infinite loop that seems to happen with QUDA.
Let me fix that first and I will look into it again and let you know what happened.

@jpfoley
Copy link
Member

jpfoley commented Oct 13, 2014

Right, so it could either be a problem with the QUDA implementation of the SVD or it could be that the links are not being passed to QUDA correctly.

@mathiaswagner
Copy link
Member Author

Well, I fixed the stopping criterion and now I end up with nan. But at least the program now detects an error.

@jpfoley
Copy link
Member

jpfoley commented Oct 16, 2014

The bug in MILC is arising in the outer product calculation, which isn't included in the QUDA test suite. Therefore, the segmentation fault in the internal test is a separate issue.

@jpfoley
Copy link
Member

jpfoley commented Oct 16, 2014

There really should be an internal test for the outer-product calc though.

@mathiaswagner
Copy link
Member Author

I assumed that. Anyhow, it would probably be good to get the outer product also in the quda test suite ?=

@maddyscientist
Copy link
Member

Yes. Make an issue of this so we don’t forget. Probably something Justin has to do since he’s the instigator of the outer-product code.


This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by

reply email and destroy all copies of the original message.

@mathiaswagner
Copy link
Member Author

Fixed in d1a5967. Thanks Justin!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants