Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU_COMMS and clover bicg #231

Closed
mathiaswagner opened this issue May 6, 2015 · 25 comments · Fixed by #242
Closed

GPU_COMMS and clover bicg #231

mathiaswagner opened this issue May 6, 2015 · 25 comments · Fixed by #242
Assignees
Labels
Milestone

Comments

@mathiaswagner
Copy link
Member

Enabling GPU_COMMS results in large true residual in the bicgstab inverter with clover fermions. Alexei has reproduced this issue with the internal invert_test.

This is issue originates from #224.

@alexstrel
Copy link
Member

Also true for other fermion actions, e.g., Wilson fermions. This bug is related to the "odd-odd" type of preconditioning.

@mathiaswagner
Copy link
Member Author

Might it be that we accidentally overwrite the solution?

The reliable updates seem to work?

On May 11, 2015, at 17:37, Alexei Strelchenko <[email protected]mailto:[email protected]> wrote:

Also true for other fermion actions, e.g., Wilson fermions. This bug is related to the "odd-odd" type of preconditioning.


Reply to this email directly or view it on GitHubhttps://github.com//issues/231#issuecomment-101055467.

@alexstrel
Copy link
Member

Revisited my HISQ eigCG, and it seems that this bug effected reported problem with the eigCG breakdown. With GPU_COMM=no, HISQ eigCG works normally.

@maddyscientist
Copy link
Member

Just pointing out that I have a live branch (hotfix/gdr) at the moment for fixing some ongoing GPU_COMMS issues. If you have a simple way to reproduce this, I can perhaps include the fix in this branch.

@mathiaswagner
Copy link
Member Author

Including mixed precision?

On May 12, 2015, at 18:41, Alexei Strelchenko <[email protected]mailto:[email protected]> wrote:

Revisited my HISQ eigCG, and it seems that this bug effected reported problem with the eigCG breakdown. With GPU_COMM=no, HISQ eigCG works normally.


Reply to this email directly or view it on GitHubhttps://github.com//issues/231#issuecomment-101446230.

@maddyscientist
Copy link
Member

Hopefully any outstanding bugs can be fixed in this branch.

@alexstrel
Copy link
Member

Forgot to mention, yes that was mixed precision. BTW, I had no problems with full precision eigCG, anyway. To reproduce the bug, at least you can run bicgstab inversion for wilson with odd-odd preconditioning, if this is successful then I can check eigCG as well.

@mathiaswagner
Copy link
Member Author

Does that mean that all the eigcg issues you mentioned (apart from features you still might want to add) in the call are due to GPU_COMMS and we don't need a separate issue. That would be great news.

@alexstrel
Copy link
Member

I think so, yes.

@maddyscientist
Copy link
Member

Alexei, can you try and reproduce the issue you had using the hotfix/gdr branch? When compiling use --enable-host-debug, as I've added some additional memory checking (see lib/comm_common.cpp) for the communicators to ensure that only valid buffers are passed to MPI/QMP.

@maddyscientist
Copy link
Member

My GPU_COMMS fixes are now complete and I've created a pull request (#238).

@alexstrel
Copy link
Member

let me check, our system is off-line currently so it needs some time

@alexstrel
Copy link
Member

Update : (mixed precision) eigCG still does not converge (when GPU_COMM is on).

@maddyscientist
Copy link
Member

I assume this can be reproduced simply by running deflation_test? Can you tell me the command-line argument I should run to reproduce this?

@alexstrel
Copy link
Member

this is not a regular test : I'm using custom milc code for HISQ eigCG tests. I see that the solver converges when GPU_COMM=no, while 'yes' option makes it divergent.

@maddyscientist
Copy link
Member

Can you try to reproduce it then in a QUDA internal test? Eg, if you make a staggered_deflation_test (which should be there anyway), does this also cause the problem?


This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by

reply email and destroy all copies of the original message.

@alexstrel
Copy link
Member

Some results taken from internal clover bicgstab tests (random fields, 2 gpus):

 i) GPU_COMM=yes, solution_type = QUDA_MAT_SOLUTION

RUN #1
prec    prec_sloppy   multishift  matpc_type  
double   double          0             even_even_asym     
Source: CPU = 1.57319e+06, CUDA copy = 1.57319e+06
Prepared source post mass rescale = 1.65046e+06
BiCGstab: Convergence at 17 iterations, L2 relative residual: iterated = 2.818557e-13, true = 2.818557e-13
Reconstructed: CUDA solution = 5.93605e+06, CPU copy = 5.93605e+06
Residuals: (L2 relative) tol 1e-12, QUDA = 2.81856e-13, host = 2.88696e-13;

RUN #2
prec    prec_sloppy   multishift  matpc_type
double   double          0     odd_odd_asym
Source: CPU = 1.57319e+06, CUDA copy = 1.57319e+06
Prepared source post mass rescale = 1.6509e+06
BiCGstab: Convergence at 24 iterations, L2 relative residual: iterated = 9.573096e-13, true = 4.983650e-02
Reconstructed: CUDA solution = 5.91613e+06, CPU copy = 5.91613e+06
Residuals: (L2 relative) tol 1e-12, QUDA = 0.0498365, host = 0.0869824;

ii) GPU_COMM=no, solution_type = QUDA_MAT_SOLUTION

RUN #3 
prec    prec_sloppy   multishift  matpc_type
double   double          0     even_even_asym
Source: CPU = 1.57319e+06, CUDA copy = 1.57319e+06
Prepared source post mass rescale = 1.65046e+06
BiCGstab: Convergence at 17 iterations, L2 relative residual: iterated = 2.818557e-13, true = 2.818557e-13
Reconstructed: CUDA solution = 5.93605e+06, CPU copy = 5.93605e+06
Residuals: (L2 relative) tol 1e-12, QUDA = 2.81856e-13, host = 2.88696e-13;

RUN #4 
prec    prec_sloppy   multishift  matpc_type 
double   double          0     odd_odd_asym
Source: CPU = 1.57319e+06, CUDA copy = 1.57319e+06
Prepared source post mass rescale = 1.6509e+06
BiCGstab: Convergence at 17 iterations, L2 relative residual: iterated = 2.963415e-13, true = 2.963415e-13
Reconstructed: CUDA solution = 5.93605e+06, CPU copy = 5.93605e+06
Residuals: (L2 relative) tol 1e-12, QUDA = 2.96341e-13, host = 3.03573e-13;

Note that RUN #2 is broken. Also from case ii) we see that switching off GPU_COMMS removes the problem : both even_even_asym and odd_odd_asym are consistent.

@maddyscientist
Copy link
Member

Ok, thanks Alexei. That's good information for me to look into this.

@alexstrel
Copy link
Member

Narrowing the bug search space: additional inputs from my side.

iii) GPU_COMM=no, solution_type = QUDA_MATPC_SOLUTION

RUN 5
prec    prec_sloppy   multishift  matpc_type
double   double          0     odd_odd_asym
Prepared source post mass rescale = 786520
BiCGstab: Convergence at 17 iterations, L2 relative residual: iterated = 3.294689e-13, true = 3.294694e-13
Reconstructed: CUDA solution = 1.25821e+06, CPU copy = 1.25821e+06
Residuals: (L2 relative) tol 1e-12, QUDA = 3.29469e-13, host = 3.29469e-13


iv) GPU_COMM=yes, solution_type = QUDA_MATPC_SOLUTION

RUN 6
prec    prec_sloppy   multishift  matpc_type
double   double          0     odd_odd_asym
Prepared source post mass rescale = 786520
BiCGstab: Convergence at 17 iterations, L2 relative residual: iterated = 3.294689e-13, true = 3.294694e-13
Reconstructed: CUDA solution = 1.25821e+06, CPU copy = 1.25821e+06
Residuals: (L2 relative) tol 1e-12, QUDA = 3.29469e-13, host = 3.29469e-13;

That is, regardless of the GPU_COMMS option, both cases are consistent if solution_type = QUDA_MATPC_SOLUTION. So the bug affects the preparation/reconstruction methods (and shows up when GPU_COMMS = yes). Investigating.

@maddyscientist
Copy link
Member

I've also found that

  • dslash_test --partition 15 --test 2 and dslash_test --partition 15 --test 4 both fail when GPU_COMMS=yes regardless of the parity
  • This bug also affects Wilson fermions

@maddyscientist
Copy link
Member

I have traced the dslash_test failures to

  void DiracWilson::M(cudaColorSpinorField &out, const cudaColorSpinorField &in) const
  {
    checkFullSpinor(out, in);
    DslashXpay(out.Odd(), in.Even(), QUDA_ODD_PARITY, in.Odd(), -kappa);  // working
    DslashXpay(out.Even(), in.Odd(), QUDA_EVEN_PARITY, in.Even(), -kappa);  // broken
  }

@alexstrel
Copy link
Member

yes, clover prepare/reconstruct calls DiracWilson::DslashXpay , so we have similar problem for clover solvers

@maddyscientist
Copy link
Member

Ok, I've worked out the problem.

When creating a full-field, together with its even and odd subsets, the full-field is allocated and even/odd subsets are references to the full field. The even subset points to the first half of the full field, and the odd subset points to the second half. This involved a hack for setting the pointer for the odd subset, whereby after the odd subset is created, the pointers are adjusted, e.g. cudaColorSpinorField::create:

// need this hackery for the moment (need to locate the odd pointer half way into the full field)
(dynamic_cast<cudaColorSpinorField*>(odd))->v = (void*)((char*)v + bytes/2);
if (precision == QUDA_HALF_PRECISION) 
  (dynamic_cast<cudaColorSpinorField*>(odd))->norm = (void*)((char*)norm + norm_bytes/2);

For the GPU_COMMS case, the ghost[] array is used as a shorthand to point to the ghost zones at the end of the (single parity) v array (and similarly for the ghostNorm[] array). These pointers also need to be updated.

for(int i=0; i<nDim; ++i){
  if(commDimPartitioned(i)){
    (dynamic_cast<cudaColorSpinorField*>(odd))->ghost[i] =
      static_cast<char*>((dynamic_cast<cudaColorSpinorField*>(odd))->ghost[i]) + bytes/2;
    if(precision == QUDA_HALF_PRECISION)
      (dynamic_cast<cudaColorSpinorField*>(odd))->ghostNorm[i] =
        static_cast<char*>((dynamic_cast<cudaColorSpinorField*>(odd))->ghostNorm[i]) + norm_bytes/2;
  }
}

With this patch in place, all tests seem to pass now. I'll create a pull request with this patch.

maddyscientist added a commit that referenced this issue May 15, 2015
…be updated after the subsets have been created. This should fix #231.
@maddyscientist maddyscientist added this to the QUDA 0.7.1 milestone May 15, 2015
@maddyscientist
Copy link
Member

Can we close this bug now then as well?

@mathiaswagner
Copy link
Member Author

GPU_COMMS and clover might still have other issues but everything mentioned here has been addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants