GPU_COMMS and clover bicg #231

mathiaswagner · 2015-05-06T18:23:43Z

Enabling GPU_COMMS results in large true residual in the bicgstab inverter with clover fermions. Alexei has reproduced this issue with the internal invert_test.

This is issue originates from #224.

alexstrel · 2015-05-11T21:37:15Z

Also true for other fermion actions, e.g., Wilson fermions. This bug is related to the "odd-odd" type of preconditioning.

mathiaswagner · 2015-05-11T21:45:54Z

Might it be that we accidentally overwrite the solution?

The reliable updates seem to work?

On May 11, 2015, at 17:37, Alexei Strelchenko <[email protected]mailto:[email protected]> wrote:

Also true for other fermion actions, e.g., Wilson fermions. This bug is related to the "odd-odd" type of preconditioning.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/231#issuecomment-101055467.

alexstrel · 2015-05-12T22:41:48Z

Revisited my HISQ eigCG, and it seems that this bug effected reported problem with the eigCG breakdown. With GPU_COMM=no, HISQ eigCG works normally.

maddyscientist · 2015-05-12T22:43:58Z

Just pointing out that I have a live branch (hotfix/gdr) at the moment for fixing some ongoing GPU_COMMS issues. If you have a simple way to reproduce this, I can perhaps include the fix in this branch.

mathiaswagner · 2015-05-12T23:27:00Z

Including mixed precision?

On May 12, 2015, at 18:41, Alexei Strelchenko <[email protected]mailto:[email protected]> wrote:

Revisited my HISQ eigCG, and it seems that this bug effected reported problem with the eigCG breakdown. With GPU_COMM=no, HISQ eigCG works normally.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/231#issuecomment-101446230.

maddyscientist · 2015-05-12T23:29:21Z

Hopefully any outstanding bugs can be fixed in this branch.

alexstrel · 2015-05-13T14:29:06Z

Forgot to mention, yes that was mixed precision. BTW, I had no problems with full precision eigCG, anyway. To reproduce the bug, at least you can run bicgstab inversion for wilson with odd-odd preconditioning, if this is successful then I can check eigCG as well.

mathiaswagner · 2015-05-13T14:34:23Z

Does that mean that all the eigcg issues you mentioned (apart from features you still might want to add) in the call are due to GPU_COMMS and we don't need a separate issue. That would be great news.

alexstrel · 2015-05-13T14:46:25Z

I think so, yes.

maddyscientist · 2015-05-13T17:41:28Z

Alexei, can you try and reproduce the issue you had using the hotfix/gdr branch? When compiling use --enable-host-debug, as I've added some additional memory checking (see lib/comm_common.cpp) for the communicators to ensure that only valid buffers are passed to MPI/QMP.

maddyscientist · 2015-05-13T19:55:25Z

My GPU_COMMS fixes are now complete and I've created a pull request (#238).

alexstrel · 2015-05-13T20:01:54Z

let me check, our system is off-line currently so it needs some time

alexstrel · 2015-05-13T22:58:32Z

Update : (mixed precision) eigCG still does not converge (when GPU_COMM is on).

maddyscientist · 2015-05-13T23:02:11Z

I assume this can be reproduced simply by running deflation_test? Can you tell me the command-line argument I should run to reproduce this?

alexstrel · 2015-05-14T14:42:08Z

this is not a regular test : I'm using custom milc code for HISQ eigCG tests. I see that the solver converges when GPU_COMM=no, while 'yes' option makes it divergent.

maddyscientist · 2015-05-14T14:52:49Z

Can you try to reproduce it then in a QUDA internal test? Eg, if you make a staggered_deflation_test (which should be there anyway), does this also cause the problem?

This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by

reply email and destroy all copies of the original message.

alexstrel · 2015-05-14T16:49:47Z

Some results taken from internal clover bicgstab tests (random fields, 2 gpus):

 i) GPU_COMM=yes, solution_type = QUDA_MAT_SOLUTION

RUN #1
prec    prec_sloppy   multishift  matpc_type  
double   double          0             even_even_asym     
Source: CPU = 1.57319e+06, CUDA copy = 1.57319e+06
Prepared source post mass rescale = 1.65046e+06
BiCGstab: Convergence at 17 iterations, L2 relative residual: iterated = 2.818557e-13, true = 2.818557e-13
Reconstructed: CUDA solution = 5.93605e+06, CPU copy = 5.93605e+06
Residuals: (L2 relative) tol 1e-12, QUDA = 2.81856e-13, host = 2.88696e-13;

RUN #2
prec    prec_sloppy   multishift  matpc_type
double   double          0     odd_odd_asym
Source: CPU = 1.57319e+06, CUDA copy = 1.57319e+06
Prepared source post mass rescale = 1.6509e+06
BiCGstab: Convergence at 24 iterations, L2 relative residual: iterated = 9.573096e-13, true = 4.983650e-02
Reconstructed: CUDA solution = 5.91613e+06, CPU copy = 5.91613e+06
Residuals: (L2 relative) tol 1e-12, QUDA = 0.0498365, host = 0.0869824;

ii) GPU_COMM=no, solution_type = QUDA_MAT_SOLUTION

RUN #3 
prec    prec_sloppy   multishift  matpc_type
double   double          0     even_even_asym
Source: CPU = 1.57319e+06, CUDA copy = 1.57319e+06
Prepared source post mass rescale = 1.65046e+06
BiCGstab: Convergence at 17 iterations, L2 relative residual: iterated = 2.818557e-13, true = 2.818557e-13
Reconstructed: CUDA solution = 5.93605e+06, CPU copy = 5.93605e+06
Residuals: (L2 relative) tol 1e-12, QUDA = 2.81856e-13, host = 2.88696e-13;

RUN #4 
prec    prec_sloppy   multishift  matpc_type 
double   double          0     odd_odd_asym
Source: CPU = 1.57319e+06, CUDA copy = 1.57319e+06
Prepared source post mass rescale = 1.6509e+06
BiCGstab: Convergence at 17 iterations, L2 relative residual: iterated = 2.963415e-13, true = 2.963415e-13
Reconstructed: CUDA solution = 5.93605e+06, CPU copy = 5.93605e+06
Residuals: (L2 relative) tol 1e-12, QUDA = 2.96341e-13, host = 3.03573e-13;

Note that RUN #2 is broken. Also from case ii) we see that switching off GPU_COMMS removes the problem : both even_even_asym and odd_odd_asym are consistent.

maddyscientist · 2015-05-14T17:11:04Z

Ok, thanks Alexei. That's good information for me to look into this.

alexstrel · 2015-05-14T18:49:42Z

Narrowing the bug search space: additional inputs from my side.

iii) GPU_COMM=no, solution_type = QUDA_MATPC_SOLUTION

RUN 5
prec    prec_sloppy   multishift  matpc_type
double   double          0     odd_odd_asym
Prepared source post mass rescale = 786520
BiCGstab: Convergence at 17 iterations, L2 relative residual: iterated = 3.294689e-13, true = 3.294694e-13
Reconstructed: CUDA solution = 1.25821e+06, CPU copy = 1.25821e+06
Residuals: (L2 relative) tol 1e-12, QUDA = 3.29469e-13, host = 3.29469e-13


iv) GPU_COMM=yes, solution_type = QUDA_MATPC_SOLUTION

RUN 6
prec    prec_sloppy   multishift  matpc_type
double   double          0     odd_odd_asym
Prepared source post mass rescale = 786520
BiCGstab: Convergence at 17 iterations, L2 relative residual: iterated = 3.294689e-13, true = 3.294694e-13
Reconstructed: CUDA solution = 1.25821e+06, CPU copy = 1.25821e+06
Residuals: (L2 relative) tol 1e-12, QUDA = 3.29469e-13, host = 3.29469e-13;

That is, regardless of the GPU_COMMS option, both cases are consistent if solution_type = QUDA_MATPC_SOLUTION. So the bug affects the preparation/reconstruction methods (and shows up when GPU_COMMS = yes). Investigating.

maddyscientist · 2015-05-14T18:57:16Z

I've also found that

dslash_test --partition 15 --test 2 and dslash_test --partition 15 --test 4 both fail when GPU_COMMS=yes regardless of the parity
This bug also affects Wilson fermions

maddyscientist · 2015-05-14T20:10:35Z

I have traced the dslash_test failures to

  void DiracWilson::M(cudaColorSpinorField &out, const cudaColorSpinorField &in) const
  {
    checkFullSpinor(out, in);
    DslashXpay(out.Odd(), in.Even(), QUDA_ODD_PARITY, in.Odd(), -kappa);  // working
    DslashXpay(out.Even(), in.Odd(), QUDA_EVEN_PARITY, in.Even(), -kappa);  // broken
  }

alexstrel · 2015-05-14T20:25:17Z

yes, clover prepare/reconstruct calls DiracWilson::DslashXpay , so we have similar problem for clover solvers

maddyscientist · 2015-05-15T05:31:44Z

Ok, I've worked out the problem.

When creating a full-field, together with its even and odd subsets, the full-field is allocated and even/odd subsets are references to the full field. The even subset points to the first half of the full field, and the odd subset points to the second half. This involved a hack for setting the pointer for the odd subset, whereby after the odd subset is created, the pointers are adjusted, e.g. cudaColorSpinorField::create:

// need this hackery for the moment (need to locate the odd pointer half way into the full field)
(dynamic_cast<cudaColorSpinorField*>(odd))->v = (void*)((char*)v + bytes/2);
if (precision == QUDA_HALF_PRECISION) 
  (dynamic_cast<cudaColorSpinorField*>(odd))->norm = (void*)((char*)norm + norm_bytes/2);

For the GPU_COMMS case, the ghost[] array is used as a shorthand to point to the ghost zones at the end of the (single parity) v array (and similarly for the ghostNorm[] array). These pointers also need to be updated.

for(int i=0; i<nDim; ++i){
  if(commDimPartitioned(i)){
    (dynamic_cast<cudaColorSpinorField*>(odd))->ghost[i] =
      static_cast<char*>((dynamic_cast<cudaColorSpinorField*>(odd))->ghost[i]) + bytes/2;
    if(precision == QUDA_HALF_PRECISION)
      (dynamic_cast<cudaColorSpinorField*>(odd))->ghostNorm[i] =
        static_cast<char*>((dynamic_cast<cudaColorSpinorField*>(odd))->ghostNorm[i]) + norm_bytes/2;
  }
}

With this patch in place, all tests seem to pass now. I'll create a pull request with this patch.

…be updated after the subsets have been created. This should fix #231.

maddyscientist · 2015-05-15T17:41:45Z

Can we close this bug now then as well?

mathiaswagner · 2015-05-15T17:50:55Z

GPU_COMMS and clover might still have other issues but everything mentioned here has been addressed.

mathiaswagner added the bug label May 6, 2015

mathiaswagner assigned alexstrel May 6, 2015

mathiaswagner mentioned this issue May 6, 2015

qudaCloverInvert issues with MILC #224

Closed

maddyscientist mentioned this issue May 13, 2015

Improve robustness of GPUDirect and fix silent errors #238

Merged

maddyscientist added a commit that referenced this issue May 15, 2015

When doing full-field fermions, the odd-field ghost pointers need to …

51f7d5e

…be updated after the subsets have been created. This should fix #231.

maddyscientist added this to the QUDA 0.7.1 milestone May 15, 2015

maddyscientist mentioned this issue May 15, 2015

Fix GPU_COMMS for full-field fermions #242

Merged

mathiaswagner closed this as completed May 15, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU_COMMS and clover bicg #231

GPU_COMMS and clover bicg #231

mathiaswagner commented May 6, 2015

alexstrel commented May 11, 2015

mathiaswagner commented May 11, 2015

alexstrel commented May 12, 2015

maddyscientist commented May 12, 2015

mathiaswagner commented May 12, 2015

maddyscientist commented May 12, 2015

alexstrel commented May 13, 2015

mathiaswagner commented May 13, 2015

alexstrel commented May 13, 2015

maddyscientist commented May 13, 2015

maddyscientist commented May 13, 2015

alexstrel commented May 13, 2015

alexstrel commented May 13, 2015

maddyscientist commented May 13, 2015

alexstrel commented May 14, 2015

maddyscientist commented May 14, 2015

alexstrel commented May 14, 2015

maddyscientist commented May 14, 2015

alexstrel commented May 14, 2015

maddyscientist commented May 14, 2015

maddyscientist commented May 14, 2015

alexstrel commented May 14, 2015

maddyscientist commented May 15, 2015

maddyscientist commented May 15, 2015

mathiaswagner commented May 15, 2015

GPU_COMMS and clover bicg #231

GPU_COMMS and clover bicg #231

Comments

mathiaswagner commented May 6, 2015

alexstrel commented May 11, 2015

mathiaswagner commented May 11, 2015

alexstrel commented May 12, 2015

maddyscientist commented May 12, 2015

mathiaswagner commented May 12, 2015

maddyscientist commented May 12, 2015

alexstrel commented May 13, 2015

mathiaswagner commented May 13, 2015

alexstrel commented May 13, 2015

maddyscientist commented May 13, 2015

maddyscientist commented May 13, 2015

alexstrel commented May 13, 2015

alexstrel commented May 13, 2015

maddyscientist commented May 13, 2015

alexstrel commented May 14, 2015

maddyscientist commented May 14, 2015

reply email and destroy all copies of the original message.

alexstrel commented May 14, 2015

maddyscientist commented May 14, 2015

alexstrel commented May 14, 2015

maddyscientist commented May 14, 2015

maddyscientist commented May 14, 2015

alexstrel commented May 14, 2015

maddyscientist commented May 15, 2015

maddyscientist commented May 15, 2015

mathiaswagner commented May 15, 2015