-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU_COMMS and clover bicg #231
Comments
Also true for other fermion actions, e.g., Wilson fermions. This bug is related to the "odd-odd" type of preconditioning. |
Might it be that we accidentally overwrite the solution? The reliable updates seem to work? On May 11, 2015, at 17:37, Alexei Strelchenko <[email protected]mailto:[email protected]> wrote: Also true for other fermion actions, e.g., Wilson fermions. This bug is related to the "odd-odd" type of preconditioning. — |
Revisited my HISQ eigCG, and it seems that this bug effected reported problem with the eigCG breakdown. With GPU_COMM=no, HISQ eigCG works normally. |
Just pointing out that I have a live branch (hotfix/gdr) at the moment for fixing some ongoing GPU_COMMS issues. If you have a simple way to reproduce this, I can perhaps include the fix in this branch. |
Including mixed precision? On May 12, 2015, at 18:41, Alexei Strelchenko <[email protected]mailto:[email protected]> wrote: Revisited my HISQ eigCG, and it seems that this bug effected reported problem with the eigCG breakdown. With GPU_COMM=no, HISQ eigCG works normally. — |
Hopefully any outstanding bugs can be fixed in this branch. |
Forgot to mention, yes that was mixed precision. BTW, I had no problems with full precision eigCG, anyway. To reproduce the bug, at least you can run bicgstab inversion for wilson with odd-odd preconditioning, if this is successful then I can check eigCG as well. |
Does that mean that all the eigcg issues you mentioned (apart from features you still might want to add) in the call are due to GPU_COMMS and we don't need a separate issue. That would be great news. |
I think so, yes. |
Alexei, can you try and reproduce the issue you had using the hotfix/gdr branch? When compiling use --enable-host-debug, as I've added some additional memory checking (see lib/comm_common.cpp) for the communicators to ensure that only valid buffers are passed to MPI/QMP. |
My GPU_COMMS fixes are now complete and I've created a pull request (#238). |
let me check, our system is off-line currently so it needs some time |
Update : (mixed precision) eigCG still does not converge (when GPU_COMM is on). |
I assume this can be reproduced simply by running deflation_test? Can you tell me the command-line argument I should run to reproduce this? |
this is not a regular test : I'm using custom milc code for HISQ eigCG tests. I see that the solver converges when GPU_COMM=no, while 'yes' option makes it divergent. |
Can you try to reproduce it then in a QUDA internal test? Eg, if you make a staggered_deflation_test (which should be there anyway), does this also cause the problem? This email message is for the sole use of the intended recipient(s) and may contain reply email and destroy all copies of the original message. |
Some results taken from internal clover bicgstab tests (random fields, 2 gpus):
Note that RUN #2 is broken. Also from case ii) we see that switching off GPU_COMMS removes the problem : both even_even_asym and odd_odd_asym are consistent. |
Ok, thanks Alexei. That's good information for me to look into this. |
Narrowing the bug search space: additional inputs from my side.
That is, regardless of the GPU_COMMS option, both cases are consistent if solution_type = QUDA_MATPC_SOLUTION. So the bug affects the preparation/reconstruction methods (and shows up when GPU_COMMS = yes). Investigating. |
I've also found that
|
I have traced the dslash_test failures to
|
yes, clover prepare/reconstruct calls DiracWilson::DslashXpay , so we have similar problem for clover solvers |
Ok, I've worked out the problem. When creating a full-field, together with its even and odd subsets, the full-field is allocated and even/odd subsets are references to the full field. The even subset points to the first half of the full field, and the odd subset points to the second half. This involved a hack for setting the pointer for the odd subset, whereby after the odd subset is created, the pointers are adjusted, e.g. cudaColorSpinorField::create:
For the
With this patch in place, all tests seem to pass now. I'll create a pull request with this patch. |
…be updated after the subsets have been created. This should fix #231.
Can we close this bug now then as well? |
GPU_COMMS and clover might still have other issues but everything mentioned here has been addressed. |
Enabling GPU_COMMS results in large true residual in the bicgstab inverter with clover fermions. Alexei has reproduced this issue with the internal invert_test.
This is issue originates from #224.
The text was updated successfully, but these errors were encountered: