Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minimal trial implementation for a twisted clover determinant derivative (pre-draft) #1338

Merged
merged 101 commits into from
Dec 21, 2023

Conversation

kostrzewa
Copy link
Member

This is a temporary copy of computeCloverForceQuda with (almost) the minimum amount of functionality to implement the derivative of a twisted clover determinant monomial as a starting point for debugging

@kostrzewa kostrzewa requested a review from a team as a code owner November 18, 2022 20:13
@kostrzewa kostrzewa changed the title minimal tria'l implementation for a twisted clover determinant derivative (pre-draft) minimal trial implementation for a twisted clover determinant derivative (pre-draft) Nov 18, 2022
@kostrzewa
Copy link
Member Author

@Marcogarofalo @simone-romiti fyi

@kostrzewa
Copy link
Member Author

This is similar to @Marcogarofalo's #1330 but I want to focus on getting this to work with tmLQCD first and then generalize the original function from there. As a result, the present PR will be temporary and only to share progress.

@kostrzewa
Copy link
Member Author

The problem #1330 (comment) is clearly fixed by removing the extra qParam.x[0] /= 2;.

However, the next problem crops up when the Dslash is supposed to be applied:

# QUDA: Dslash(x.Odd(), x.Even(), QUDA_ODD_PARITY)
# QUDA: ERROR: Parity spinor volume 16384 doesn't match clover checkboard volume 16384 (rank 0, host node-03.bender, dirac_twisted_clover.cpp:43 in virtual void quda::DiracTwistedClover::checkParitySpinor(const quda::ColorSpinorField&, const quda::ColorSpinorField&) const())
# QUDA:        last kernel called was (name=N4quda15CopyColorSpinorILi4ELi3ENS_11colorspinor11FloatNOrderIdLi4ELi3ELi2ELb0ELb0EEENS1_21SpaceSpinorColorOrderIdLi4ELi3EEESt5tupleIJRNS_16ColorSpinorFieldERKS7_19QudaFieldLocation_sPdPKdEEEE,volume=8x16x16x8,aux=GPU-offline,vol=16384,parity=1,precision=8,order=9,Ns=4,Nc=3vol=16384,parity=1,precision=8,order=2,Ns=4,Nc=3,NonRelBasis)

and this I find quite strange since 16384 == 16384 😄

@kostrzewa
Copy link
Member Author

The issue above occurs because of

  void DiracTwistedClover::checkParitySpinor(const ColorSpinorField &out, const ColorSpinorField &in) const
  {
    Dirac::checkParitySpinor(out, in);

    if (out.TwistFlavor() == QUDA_TWIST_SINGLET) {
      if (out.Volume() != clover->VolumeCB())
        errorQuda("Parity spinor volume %lu doesn't match clover checkboard volume %lu", out.Volume(),
                  clover->VolumeCB());
    } else {
      //
      if (out.Volume() / 2 != clover->VolumeCB())
        errorQuda("Parity spinor volume %lu doesn't match clover checkboard volume %lu", out.Volume(),                                     
                  clover->VolumeCB());
    }
  }

and qParam.twist_flavor not being set when quarkX and quarkP are created.

I can update the error message above to indicate that this might be a possibility, but to be honest, maybe we generally should check more frequently if twistFlavor is set and compatible?

@kostrzewa
Copy link
Member Author

The next small step. Here I do not really understand where the problem comes from.

# QUDA: computeCloverForce(cudaForce, gaugePrecise, quarkX, quarkP, force_coeff);
# QUDA: ERROR: qudaEventSynchronize_ returned CUDA_ERROR_ILLEGAL_ADDRESS
 (timer.h:107 in peek())
 (rank 0, host node-02.bender, quda_api.cpp:72 in void quda::target::cuda::set_driver_error(CUresult, const char*, const char*, const char*, const char*, bool)())
# QUDA:        last kernel called was (name=N4quda11CloverForceIdLi3EL21QudaReconstructType_s18EEE,volume=16x16x16x8,aux=GPU-offline,vol=32768stride=16384precision=8geometry=4Nc=3,exterior,dir=1)
--------------------------------------------------------------------------

@SaltyChiang
Copy link
Contributor

The next small step. Here I do not really understand where the problem comes from.

# QUDA: computeCloverForce(cudaForce, gaugePrecise, quarkX, quarkP, force_coeff);
# QUDA: ERROR: qudaEventSynchronize_ returned CUDA_ERROR_ILLEGAL_ADDRESS
 (timer.h:107 in peek())
 (rank 0, host node-02.bender, quda_api.cpp:72 in void quda::target::cuda::set_driver_error(CUresult, const char*, const char*, const char*, const char*, bool)())
# QUDA:        last kernel called was (name=N4quda11CloverForceIdLi3EL21QudaReconstructType_s18EEE,volume=16x16x16x8,aux=GPU-offline,vol=32768stride=16384precision=8geometry=4Nc=3,exterior,dir=1)
--------------------------------------------------------------------------

I got a similar error while trying to use computeCloverForceQuda. I checked the doc of CUDA and find that cudaEventSynchronize will not return cudaErrorIllegalAddress, which is pretty strange.

When I try to disable autotuning via setting QUDA_ENABLE_TUNING=0, the same error code (cudaErrorIllegalAddress) happens at another place.

@kostrzewa
Copy link
Member Author

I suspect that the problem is in the spinor comms of computeCloverForce and the error message is erroneous. I just realised I've been running with a RELEASE build of QUDA and will investigate with a DEBUG build first of all :)

@kostrzewa
Copy link
Member Author

Thread 1 "hmc_tm" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 4320, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
quda::Kernel1D<quda::Exterior, quda::CloverForceArg<double, 3, (QudaReconstructType_s)18, 3>, false><<<(64,1,1),(32,1,1)>>> ()
    at /home/bkostrze/code/quda-tm_force/lib/../include/color_spinor.h:1017 in _ZN4quda18outerProdSpinTraceIdLi3ELi4EEENS_6MatrixINS_7complexIT_EEXT0_EEERKNS_11ColorSpinorIS3_XT0_EXT1_EEES9_ inlined from color_spinor.h:115

it seems that the error does not quite occur where it is reported above, as far as I can tell from a run with cuda-gdb.

@kostrzewa
Copy link
Member Author

Thanks for the tip about disabling tuning @SaltyChiang, this makes it quite obvious where the issue is:

# QUDA: ERROR: qudaDeviceSynchronize_ returned CUDA_ERROR_ILLEGAL_ADDRESS
 (clover_outer_product.cu:100 in exchangeGhost())
 (rank 2, host node-02.bender, quda_api.cpp:72 in void quda::target::cuda::set_driver_error(CUresult, const char*, const char*, const char*, const char*, bool)())
# QUDA:        last kernel called was (name=N4quda11CloverForceIdLi3EL21QudaReconstructType_s18EEE,volume=16x16x16x8,aux=GPU-offline,vol=32768stride=16384precision=8geometry=4Nc=3,exterior,dir=1)

More tests for correctness is needed.
@kostrzewa
Copy link
Member Author

I think I can confirm that the error in #1338 (comment) stems from exchangeGhost of computeCloverForce since it runs without erroring out with a single GPU (this is output from tmLQCD's https://github.com/etmc/tmLQCD/tree/quda_work_tm_force branch running a simple HMC with a twisted clover clover determinant):

# QUDA: Dslash(x.Odd(), x.Even(), QUDA_ODD_PARITY)
# QUDA: gamma5(tmp, x.Even())
# QUDA: M(p.Even(), tmp)
# QUDA: Dslash(p.Odd(), p.Even(), QUDA_ODD_PARITY)
# QUDA: computeCloverForce(cudaForce, gaugePrecise, quarkX, quarkP, force_coeff);
# QUDA: computeCloverSigmaTrace(oprod, *cloverPrecise, k_csw_ov_8);
# QUDA: computeCloverSigmaOprod(oprod, quarkX, quarkP, ferm_epsilon)
# QUDA: cloverDerivative(cudaForce, gaugeEx, *oprodEx, 1.0, QUDA_ODD_PARITY)
# QUDA: cloverDerivative(cudaForce, gaugeEx, *oprodEx, 1.0, QUDA_EVEN_PARITY)
# QUDA: updateMomentum(gpuMom, -1.0, cudaForce, "tmclover")
# TM_QUDA: Time for computeTMCloverForceQuda 3.765968e-02 s level: 3 proc_id: 0 /HMC/cloverdetlight:cloverdet_derivative/compute_cloverdet_derivative_quda/computeTMCloverForceQuda

@maddyscientist
Copy link
Member

I will get more involved with this debug in the next day or so. One thought though: when debugging it’s best to run with CUDA_LAUNCH_BLOCKING=1, this will ensure that all kernels are synchronized with the host and if an illegal memory access, etc., occurs you will be guaranteed to told exactly which kernel is misbehaving by QUDA’s error checking.

@kostrzewa
Copy link
Member Author

kostrzewa commented Nov 22, 2022

One thought though: when debugging it’s best to run with CUDA_LAUNCH_BLOCKING=1, this will ensure that all kernels are synchronized with the host and if an illegal memory access, etc., occurs you will be guaranteed to told exactly which kernel is misbehaving by QUDA’s error checking.

Thanks a lot, I keep forgetting about this...

On 2 GPUs:

# QUDA: ERROR: qudaLaunchKernel returned an illegal memory access was encountered
 (/home/bkostrze/code/quda-tm_force/lib/targets/cuda/quda_api.cpp:152 in qudaLaunchKernel())
 (rank 1, host node-04.bender, quda_api.cpp:58 in void quda::target::cuda::set_runtime_error(cudaError_t, const char*, const char*, const char*, const char*, bool)())
# QUDA:        last kernel called was (name=N4quda11CloverForceIdLi3EL21QudaReconstructType_s18EEE,volume=16x16x16x16,aux=GPU-offline,vol=65536stride=32768precision=8geometry=4Nc=3,exterior,dir=1)

@maddyscientist
Copy link
Member

The issue above occurs because of

  void DiracTwistedClover::checkParitySpinor(const ColorSpinorField &out, const ColorSpinorField &in) const
  {
    Dirac::checkParitySpinor(out, in);

    if (out.TwistFlavor() == QUDA_TWIST_SINGLET) {
      if (out.Volume() != clover->VolumeCB())
        errorQuda("Parity spinor volume %lu doesn't match clover checkboard volume %lu", out.Volume(),
                  clover->VolumeCB());
    } else {
      //
      if (out.Volume() / 2 != clover->VolumeCB())
        errorQuda("Parity spinor volume %lu doesn't match clover checkboard volume %lu", out.Volume(),                                     
                  clover->VolumeCB());
    }
  }

and qParam.twist_flavor not being set when quarkX and quarkP are created.

I can update the error message above to indicate that this might be a possibility, but to be honest, maybe we generally should check more frequently if twistFlavor is set and compatible?

To be honest I never really liked this design where a singlet fermion field is distinct from a non-twisted field. The dimensions are identical, and the field can be used as intended, so I don't see the need for the separate field type. I'm not sure what the original design motivation was for this (perhaps @alexstrel knows, as I think he was the originator, but this may be lost in the mists of time?)

A better approach here may be to dump the distinction between QUDA_TWIST_NO and QUDA_TWIST_SINGLET and allow these fields to be used interchangeably. This would remove some of this pointless checking, and then we only need to worry if we have a doublet field or not. Thoughts?

@maddyscientist
Copy link
Member

One thought though: when debugging it’s best to run with CUDA_LAUNCH_BLOCKING=1, this will ensure that all kernels are synchronized with the host and if an illegal memory access, etc., occurs you will be guaranteed to told exactly which kernel is misbehaving by QUDA’s error checking.

Thanks a lot, I keep forgetting about this...

On 2 GPUs:

# QUDA: ERROR: qudaLaunchKernel returned an illegal memory access was encountered
 (/home/bkostrze/code/quda-tm_force/lib/targets/cuda/quda_api.cpp:152 in qudaLaunchKernel())
 (rank 1, host node-04.bender, quda_api.cpp:58 in void quda::target::cuda::set_runtime_error(cudaError_t, const char*, const char*, const char*, const char*, bool)())
# QUDA:        last kernel called was (name=N4quda11CloverForceIdLi3EL21QudaReconstructType_s18EEE,volume=16x16x16x16,aux=GPU-offline,vol=65536stride=32768precision=8geometry=4Nc=3,exterior,dir=1)

Ok, so this suggests the issue isn't the exchangeGhost and is the CloverForce kernel. Moreover, we can see from the signature that it's the halo kernel (that's what the exterior part means). So that would explain why it's a multi-GPU issue.
Maybe I missed it, but what happens what happens when you run on a single GPU?

@kostrzewa
Copy link
Member Author

Maybe I missed it, but what happens what happens when you run on a single GPU?

It works fine (I haven't checked yet whether it's correct, but at least it doesn't crash). I will be able to work some more on this next week hopefully and make some more concrete progress.

@SaltyChiang
Copy link
Contributor

SaltyChiang commented Nov 22, 2022

I think it is the problem in exchangeGhost.

Please see this commit in #1339.

for (unsigned int i=0; i<x.size(); i++) {
for (int parity=0; parity<2; parity++) {
ColorSpinorField& inA = (parity&1) ? p[i]->Odd() : p[i]->Even();
ColorSpinorField& inB = (parity&1) ? x[i]->Even(): x[i]->Odd();
ColorSpinorField& inC = (parity&1) ? x[i]->Odd() : x[i]->Even();
ColorSpinorField& inD = (parity&1) ? p[i]->Even(): p[i]->Odd();
inB.exchangeGhost(parity ? QUDA_ODD_PARITY : QUDA_EVEN_PARITY, 1, dag);
inD.exchangeGhost(parity ? QUDA_ODD_PARITY : QUDA_EVEN_PARITY, 1, 1 - dag);
instantiate<CloverForce, ReconstructNo12>(U, force, inA, inB, inC, inD, parity, coeff[i]);
}
}

After using exchangeGhost in ColorSpinorField instead of one in clover_outer_product.cu, this function doesn't raise cudaErrorIllegalAddress for 2 GPUs.

exchangeGhost in clover_outer_product.cu seems to make the inB.Ghost() invalid for device function, still don't know why. At least we can go to the next step.

@kostrzewa
Copy link
Member Author

kostrzewa commented Nov 22, 2022

A better approach here may be to dump the distinction between QUDA_TWIST_NO and QUDA_TWIST_SINGLET and allow these fields to be used interchangeably. This would remove some of this pointless checking, and then we only need to worry if we have a doublet field or not. Thoughts?

I agree. The name is also a bit fundamentally confusing in some sense, since I would consider the "flavour" to correspond to either +mu or -mu or equivalently +r or -r in the "physical basis", rather than the distinction between single-flavour and doublet. That is a whole other story, however, and the present solution works well in practice.

On the other hand, having a flag indicating that one is dealing with a twisted mass operator (rather than having to check whether mu is non-zero against best practice, even though also this is done here and there) is not a bad thing. This fact is useful in a couple of places:

  1. clover field and its inverse
  2. implicit use in transfer->Vectors().TwistFlavor() (I think)

Before getting rid of it, perhaps it's worth checking if this would not introduce lots of local and equivalent variables (which will likely be bool twisted = param.mu != 0.0 (or a safer variant thereof). One might conclude, of course, that having these defined locally is better for one reason or another.

@maddyscientist
Copy link
Member

All good points @kostrzewa regarding TwistFlavour(). I will have a think about this and some hacking and see what solution makes sense.

I've taken a look at the clover_outer_product.cu file a bit more to see if I could visually isolate what's going on here. One thing I do see is that the halo exchange code is rather nasty, and is badly need of an update to clean this up. It wouldn't surprise me if this code has bit rotted, and that in turn could lead to the crashes you were seeing, e.g., if the ghost buffers weren't being sized correctly, or some such.@SaltyChiang tried replacing with ColorSpinorField::exchangeGhost in #1339 but that isn't quite the correct thing to do, at least in isolation, since kernel changes would be needed as well.

May I suggest the following (which I think is what you were trending towards anyway):

  • we focus on getting single GPU to work correctly
  • once we have that, it will make debugging the multi-GPU variant much easier and more self contained
  • moreover, in order to remove the dynamic clover aspect as a source of error, we focus on non-dynamic clover

What's the best way for me to run this function myself? I assume you're calling this from tmLQCD?

…suggested in review, and sprinkle in some more OMP for more speedup
@maddyscientist
Copy link
Member

The file tests/host_reference/CloverForce_reference.h and its twisted friends should be renamed clover_force_reference.h, etc, to match the conventions of other files.

This isn't a hill I'll die on, but I think there's reasonable scope to combine clover_force_reference.cpp and its twisted friend in a single file.

@weinbe2 c0fdc7b address this suggestion: merging CloverForce_reference.h and TMCloverForce_reference.cpp into a single file clover_force_reference.cpp.

@Marcogarofalo
Copy link
Member

@maddyscientist I just noticed that here

quda/lib/interface_quda.cpp

Lines 4661 to 4663 in 7bd79ed

gamma5(tmp, x[i].Odd());
dirac->Dslash(x[i].Even(), tmp, QUDA_EVEN_PARITY);
gamma5(x[i].Even(), x[i].Even());

we could remove completely the tmp spinor. Can I push this change or this part is under refurbishment?

@maddyscientist
Copy link
Member

@maddyscientist I just noticed that here

quda/lib/interface_quda.cpp

Lines 4661 to 4663 in 7bd79ed

gamma5(tmp, x[i].Odd());
dirac->Dslash(x[i].Even(), tmp, QUDA_EVEN_PARITY);
gamma5(x[i].Even(), x[i].Even());

we could remove completely the tmp spinor. Can I push this change or this part is under refurbishment?

Sounds like a good idea, thanks.

@maddyscientist
Copy link
Member

@Marcogarofalo @kostrzewa just making sure I understand the conventions here: does tmLQCD solve for MdagM or MMdag when doing HMC forces? Was having another look at parametrizing the differences between tmLQCD and MILC interfaces to simplify the code: apart from the parity differences, the only other main difference is that one seems to be the Hermitian conjugate of the other.

@maddyscientist
Copy link
Member

I was also confused by comments like this in the tmLQCD interface code:

    // QUDA applies the MMdag operator, we need QpQm^{-1) in the end
    // so we want QUDA to use the MdagM operator
    inv_param.dagger = QUDA_DAG_YES;

In general the default with QUDA CG for example is MdagM not MMdag, so wanting to make sure we're on the same page.

@Marcogarofalo
Copy link
Member

My understanding is that in tmLQCD we need $(Q_+Q_-)^{-1}$ where $Q_{\pm}=\gamma_5M_{\pm}$. The sign $\pm$ is the sign of the twisted mass $\mu$. When we use the QUDA CG we are doing $\gamma_5(M_+M_+^\dagger)^{-1}\gamma_5=(Q_+Q_-)^{-1}$ since $M_+^\dagger=\gamma_5M_-\gamma_5$. We call QUDA with QUDA_DAG_YES to do $(M_+M_+^\dagger)^{-1}$, the $\gamma_5$ are applied in tmLQCD before and after the inversion.

Then inside the force at

quda/lib/interface_quda.cpp

Lines 4657 to 4660 in 19ce4dd

dirac->Dagger(QUDA_DAG_YES);
gamma5(p[i].Even(), x[i].Odd());
dirac->Dslash(x[i].Even(), p[i].Even(), QUDA_EVEN_PARITY);
gamma5(x[i].Even(), x[i].Even());

We want $(M_-^{ee})^{-1}M^{eo}$ so in QUDA we are doing $\gamma_5 (M_+^{ee})^{-1}M^{oe} \gamma_5$.
With $M_\pm=M_\pm^{oo}-M^{oe}(M_\pm^{ee})^{-1}M^{eo}$.

Maybe some comments are misleading

@maddyscientist
Copy link
Member

@Marcogarofalo I've now parametrized the differences between MILC and tmLQCD. For tmLQCD, I'm now expecting that QudaInvertParam::matpc_type = QUDA_ODD_ODD_ASYMMETRIC and QudaInvertParam::dagger = QUDA_DAG_YES. Can you have a look at 0b99a1d you agree?

@Marcogarofalo
Copy link
Member

@maddyscientist Yes they are the ones that are passed when in tmLQCD we call QUDA CG, for the MG they are different but I set them after the MG and before calling the force function.

@maddyscientist
Copy link
Member

@weinbe2 this is ready for final sign off. I'll push clang-format shortly.

Copy link
Contributor

@weinbe2 weinbe2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hugely impressive work, thank you @kostrzewa , @Marcogarofalo , + @maddyscientist for additional clean-up and contributions. Merging momentarily!

@weinbe2 weinbe2 merged commit fd50676 into develop Dec 21, 2023
12 checks passed
@weinbe2 weinbe2 deleted the feature/tm_force branch December 21, 2023 04:30
@Marcogarofalo
Copy link
Member

Thank you All for making this possible!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants