Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/DD #1447

Open
wants to merge 120 commits into
base: develop
Choose a base branch
from
Open

Feature/DD #1447

wants to merge 120 commits into from

Conversation

sbacchio
Copy link
Member

@sbacchio sbacchio commented Mar 19, 2024

This is a first PR towards enabling domain decomposition (DD) features in QUDA.
The goal of this PR is to enable a red-black decomposition for the Dirac operator.

Remarks

  • We first consider the blocks to exactly fit in the local lattice. Generalization is left to a future PR.
  • We focus only on the application of the Dirac operator. Optimization of BLAS functions is left to a future PR.

Design

Under domain decomposition, the Dirac operator assumes a block structure, e.g.

$$ D = \begin{bmatrix} D_{rr} & D_{rb} \\ D_{br} & D_{bb} \end{bmatrix} $$

Thus we need a simple strategy to implement all possible operators and their application to a field.
Our strategy is to add the specifications about the domain decomposition directly to the vector (spinor) field.
E.g. the application of $D_{rb}$, i.e. $y = D_{rb} x = P_r D P_b x$, can be expressed by setting the input vector as "black", i.e. $x_b = P_b x$, and the output vector as "red", i.e. $y_r = P_r y$. A pseudocode would look like this:

x.dd_red_active();
y.dd_black_active();
applyD(y,x);

Then, we make the application of D DD-aware and only apply it to the in/out active points.

Summary of changes:

  • DDParam is added as a property of lattice fields (used only for ColoSpinorFields at the moment).
  • ... TODO

TODO list:

  • Add more comments and function documentation
  • Add checks for parameters, e.g. the block size should divide the local lattice size
  • Add calculation of first block parity Use global coordinates
  • Test application of individual pieces, i.e. D_rr, D_rb, D_br, D_bb
  • Test all operators
  • Test that performance without DD is not affected (i.e. compared to current develop)
  • Test MR solver and usage in MG as smoother
  • Properly disable comms when not needed (e.g. block fits local lattice)
  • Improve performance by unrolling threads block-wise
  • Disable usage of DD for all PC Mat
  • ...

Current issues:

  • Tests are failing if the block size is odd in the x direction for the PC operator, e.g. dslash_test --xdim 6 --ydim 4 --zdim 4 --tdim 4 --dd-block-size 3 2 2 2 --dd-red-black=true --test 0. It works if not PC, e.g. --test 2 or if it is odd in any other direction. Now it is not allowed to have odd block size in the x-direction.
  • Tests are failing if export QUDA_REORDER_LOCATION=CPU is set (dslash_test failing with QUDA_REORDER_LOCATION=CPU #1466).
  • Tests are failing for MatPC with Caught signal 8 (Floating point exception: integer divide by zero) . See e.g. dslash_test --xdim 8 --ydim 8 --zdim 8 --tdim 8 --dd-red-black=true --test 1 Not testing for PC operators

@sbacchio sbacchio requested a review from maddyscientist May 8, 2024 07:34
@sbacchio sbacchio marked this pull request as ready for review November 1, 2024 08:59
@sbacchio sbacchio requested review from a team as code owners November 1, 2024 08:59
@maddyscientist
Copy link
Member

I'm starting to do some testing on this PR, and I see that not all Dirac operators have the file parallelization. Is this something you can do? See the trace below, for a full build of QUDA, we can see that these files dominate compilation time, and result in a much longer compilation time. At the same time, it's clear how much faster the split Dirac operator files compile on a multi-core system 😄

image

@pittlerf
Copy link
Contributor

pittlerf commented Nov 5, 2024

I'm starting to do some testing on this PR, and I see that not all Dirac operators have the file parallelization. Is this something you can do? See the trace below, for a full build of QUDA, we can see that these files dominate compilation time, and result in a much longer compilation time. At the same time, it's clear how much faster the split Dirac operator files compile on a multi-core system 😄

image

Hi, yes, sorry, I will do the remaining ones.

@maddyscientist
Copy link
Member

I have fixed the failing staggered dslash tests (was due to an issue with the long-link field being created erroneously when using regular (unimproved) staggered fermions.

@sbacchio
Copy link
Member Author

sbacchio commented Nov 6, 2024

Hi Kate, that's an impressive tracing :) Ferenc will work on the others soon. About the tests, I still see some staggered tests failing (invert and eigensolve). I guess due to similar reasons. Maybe it's better if you have a look at those too :)

@sbacchio sbacchio changed the title Feature/DD (WIP) Feature/DD Nov 6, 2024
@maddyscientist
Copy link
Member

Looking great with latest pushes. Just dslash_twisted_mass_preconditioned.cu left I think

image

@sbacchio
Copy link
Member Author

sbacchio commented Nov 7, 2024

Great! and I see now also all 7 checks passed :) Thanks @pittlerf !
@maddyscientist do you have also a trace of the compilation time of the current develop? Just to appreciate the overall improvement :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants