-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Single Precision MILC build hangs when using QUDA FF #158
Comments
This will be for Justin I guess. Can you post which machine this happened on, and which CUDA toolkit and driver? |
I observed it on a local machine in Bloomington. I used CUDA 6.0, driver 340.32 and was running on our K40.
Mathias Wagner |
If it was running on a single K40, a big help with the debugging effort would be to run it in gdb and see where it hangs. If you recompile with HOST_DEBUG enabled, we'll even get the exact line that is hanging. |
I will do and let you know what happened. Am 09.10.2014 um 17:13 schrieb mikeaclark <[email protected]mailto:[email protected]>: If it was running on a single K40, a big help with the debugging effort would be to run it in gdb and see where it hangs. If you recompile with HOST_DEBUG enabled, we'll even get the exact line that is hanging. — |
I ran with gdb and here is the backtrace. (gdb) bt #0 0x0000003475a044eb in clock_gettime () from /lib64/librt.so.1 #1 0x00002aaaad953cae in ?? () from /usr/lib64/libcuda.so #2 0x00002aaaad2b7983 in ?? () from /usr/lib64/libcuda.so #3 0x00002aaaad2a0136 in ?? () from /usr/lib64/libcuda.so #4 0x00002aaaad290b57 in ?? () from /usr/lib64/libcuda.so #5 0x00002aaaad20505a in ?? () from /usr/lib64/libcuda.so #6 0x00002aaaad206faa in ?? () from /usr/lib64/libcuda.so #7 0x00002aaaad1d7035 in cuMemcpyDtoH_v2 () from /usr/lib64/libcuda.so #8 0x00002aaaaaab964e in ?? () from /home/mwagner/cuda-6.0/lib64/libcudart.so.6.0 #9 0x00002aaaaaae0e38 in cudaMemcpy () from /home/mwagner/cuda-6.0/lib64/libcudart.so.6.0 #10 0x00000000004a0b93 in computeKSLinkQuda (fatlink=0x2aaacb280b80, longlink=0x0, ulink=0x2aaacb566870, inlink=, path_coeff=0x7fffffffe340, param=0x7fffffffe240, method=QUDA_COMPUTE_FAT_STANDARD) at interface_quda.cpp:3556 #11 0x000000000049633c in qudaLoadUnitarizedLink(int, ._88, const double *, void *, void *, void *) (prec=, fatlink_args=, act_path_coeff=0x7fffffffe340, inlink=0x2aaacb225970, fatlink=0x2aaacb280b80, ulink=0x2aaacb566870) at milc_interface.cpp:229 #12 0x0000000000444770 in load_hisq_aux_links_gpu (info=0x7fffffffe4d0, ap=0xb621d80, aux=0x2aaacb1c5e20, links=0x2aaacb338200) at ../generic_ks/fermion_links_hisq_load_gpu.c:50 #13 0x00000000004444e5 in create_hisq_links_milc (info=0x7fffffffe4d0, fn=0xb621e90, fn_deps=0xb621ef0, aux=0xb621e88, ap=0xb621d80, links=0x2aaacb338200, want_deps=0, want_back=1) at ../generic_ks/fermion_links_hisq_load_milc.c:765 #14 0x00000000004425ac in restore_hisq_links_t (info=0x7fffffffe4d0, hl=0xb621e80, links=0x2aaacb338200, options=0xb5bf0e0) at ../generic_ks/fermion_links_hisq_milc.c:91 #15 0x00000000004427bc in restore_milc_hisq_links_t (info=0x7fffffffe4d0, hl=0xb5b24a0, links=0x2aaacb338200, options=0xb5bf0e0) at ../generic_ks/fermion_links_hisq_milc.c:177 #16 0x0000000000442b66 in restore_fermion_links_hisq (fl=0xb5bf0e0, precision=2, phases_in=1, links=0x2aaacb338200) at ../generic_ks/fermion_links_hisq_milc.c:323 #17 0x0000000000431c74 in restore_fermion_links_from_site (fl=0xb5bf0e0, prec=2) at ../generic_ks/fermion_links_from_site.c:36 #18 0x00000000004086b6 in update_h_fermion (eps=0.0199999996, multi_x=0x2aaacb2dbe30) at update_h_rhmc.c:79 #19 0x000000000040984e in update () at update_rhmc.c:356 #20 0x0000000000492299 in main (argc=2, argv=0x7fffffffe7f8) at control.c:76 I will try to get some more insight into this and pin down the issue a bit more. |
A bit more information from cuda-gdb when I send SIGINT while the program hangs: (cuda-gdb) bt #0 0x000000000d90f710 in void quda::getUnitarizedField(float2 const*, float2 const*, float2*, float2*, int*, int) () #1 0x000000000d90c7f8 in void quda::getUnitarizedField(float2 const*, float2 const*, float2*, float2*, int*, int)<<<(14,1,1),(96,1,1)>>> () |
To rule out a synchronization problem, can you run with the environment variable CUDA_LAUNCH_BLOCKING=1? |
Also, this appears to be a hang in the kernel itself. Are you running with gdb or cuda-gdb, if gdb, can you try cuda-gdb with DEVICE_DEBUG enabled to see if it gives more info? |
Another thing to try is to disable the tuning, to see if this is related to the hang. |
I will. Might however take a while to recompile. Right now I wanted to check whether the issues also shows up in any of the fermion force related tests from QUDA. That might make debugging easier. |
I now have first hints from cuda-gdb with DEVICE_DEBUG: Program received signal SIGINT, Interrupt. 0x00000000158ef378 in quda::Matrix::operator() (this=0x3fff4c0, i=1, j=2) at /home/mwagner/quda07/lib/./quda_matrix.h:353 353 __device__ __host__ inline T const & operator()(int i, int j) const{ (cuda-gdb) bt #0 0x00000000158ef378 in quda::Matrix::operator() (this=0x3fff4c0, i=1, j=2) at /home/mwagner/quda07/lib/./quda_matrix.h:353 #1 0x00000000158b3380 in quda::(anonymous namespace)::getLambdaMax (b=0x3fff4c0, lambda_max=) at /home/mwagner/quda07/lib/./svd_quda.h:94 #2 0x0000000015901c18 in quda::(anonymous namespace)::bdSVD (u=0x3fff4c0, v=(cached) 0x3fff550, b=(cached) 0x3fff4c0, max_it=(cached) 500) at /home/mwagner/quda07/lib/./svd_quda.h:493 #3 0x00000000158df2d0 in quda::(anonymous namespace)::computeSVD (m=0x3fff4c0, u=(cached) 0x3fff550, v=(cached) 0x3fff4c0, singular_values=) at /home/mwagner/quda07/lib/./svd_quda.h:649 #4 0x00000000158d6b68 in quda::unitarizeLinkMILC (in=0x3fff4c0, result=(cached) 0x3fff550) at unitarize_links_quda.cu:285 #5 0x0000000015917d20 in quda::getUnitarizedField<<<(11,1,1),(128,1,1)>>> (inlink_even=0xf05bc0000, inlink_odd=0xf05bf5400, outlink_even=0xf05c2a800, outlink_odd=0xf05c5fc00, num_failures=0xf05dc0000, threads=1296) at unitarize_links_quda.cu:379 It seems like the single precision runs spends a lot of time in the SVD. I am not sure what is going on there as the double precision run did just fine (run completed within seconds) and thought the unitarization is always done in single precision ... |
This location in svd_quda.h where it is reporting is a trivial multiplication. The thread can't really hung have there, I suspect it stuck in a loop. Perhaps the SVD isn't converging or something. I'll keep looking though hopefully Justin can comment. |
Yes, it is stuck in the loop. I stopped and continued it in cuda-gdb and it was at several different locations in the svd. I will let the job run for some hours to see whether it gets out of the loop at some point. |
The do-while condition is
so if nothing else it should exit the loop once it reaches max_it. However, I see in the code that while it declares
it doesn't actually increment "it" at all, hence if the SVD doesn't converge, it will get stuck. So that's a definite bug. The question that remains is why is it not converging? |
I will add the missing increment. After that we might get some more data to work with. |
But I see that
which will likely never converge in single precision, so that's probably why it doesn't converge. So I think the solution is that we need to reduce the SVDPREC to something achievable like 1e-7 for single precision. Perhaps the SVD should be done in double always anyway, but we should fix this hang regardless. Justin, thoughts? |
I only have my iPad with me now but I think there is a line that says always do unitarizationin double in unitarize_links.cpp ( or similar filename). |
I think you're correct, line 376 of unitarize_links_quda.cu
So the question is just then why isn't it converging? |
Sorry for not commenting earlier. These emails messages aren't sent to me automatically and Jessica and I were at a wedding rehearsal this afternoon. It's highly likely there's a problem with SVD. There are so many branches that I don't think they've all been checked properly. I guess the single-precision run is exposing an SVD bug. More generally, I don't think SVD should be used at all, because even if it works properly it serializes the calculation. There must be a way of doing the unitarization in double precision with a cutoff that allows us to avoid using SVD in a way that ensures the resulting links are unitary to single-precision accuracy. I'm also wondering why Caley-Hamilton based unitarization is failing in this example. Are the input gauge fields particularly coarse? |
I'm wondering again what the input links look like when the SVD hangs. For example, I'm not sure what would happen if the input link components were all zero. Are the input matrices valid links, I wonder? |
This is still in my today queue but for now my best guess is: If I run the job with FF on the CPU everything works fine. Still it might be just that MILC does not suffer from the infinite loop that seems to happen with QUDA. |
Right, so it could either be a problem with the QUDA implementation of the SVD or it could be that the links are not being passed to QUDA correctly. |
Well, I fixed the stopping criterion and now I end up with nan. But at least the program now detects an error. |
The bug in MILC is arising in the outer product calculation, which isn't included in the QUDA test suite. Therefore, the segmentation fault in the internal test is a separate issue. |
There really should be an internal test for the outer-product calc though. |
I assumed that. Anyhow, it would probably be good to get the outer product also in the quda test suite ?= |
Yes. Make an issue of this so we don’t forget. Probably something Justin has to do since he’s the instigator of the outer-product code. This email message is for the sole use of the intended recipient(s) and may contain reply email and destroy all copies of the original message. |
Fixed in d1a5967. Thanks Justin! |
I built
su3_rhmc_hisq
from MILC with quad 0.7 for single precision and tried to run it with thetest/su3_rhmc_hisq.1.sample-in
input.While it does start it hangs for a while.
If I disable the FF in the Makefile
WANT_FF_GPU =false
or use double precision for MILC (obviously with
test/su3_rhmc_hisq.2.sample-in
) the run completes.The text was updated successfully, but these errors were encountered: