Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add ThreadLocalCache #1380

Merged
merged 31 commits into from
Dec 20, 2023
Merged

add ThreadLocalCache #1380

merged 31 commits into from
Dec 20, 2023

Conversation

jcosborn
Copy link
Contributor

This adds a dedicated thread-local cache object that can use shared memory for storage. It is distinct from SharedMemoryCache in that there is no sharing among threads, which simplifies the interface, and also doesn't need a sync operation available. Since sharing isn't needed, targets can choose to not use shared memory if it is advantageous to do so. Note that thread_array could be replaced by this in the future, but is not being done here.

TODO: add HIP version
I'll add HIP once the CUDA version is settled.

@maddyscientist
Copy link
Member

@jcosborn it doesn't appear that the ThreadLocalCache is decomposing the underlying type to ensure memory coalescing when accessing the shared memory on the CUDA target. This will result in shared memory bank conflicts and likely significant drops in performance versus. Am I reading this correctly?

@jcosborn
Copy link
Contributor Author

Yes, it doesn't do coalescing now. I had thought you said it wasn't necessary here, but maybe I misunderstood. I can easily add that back in.

@maddyscientist
Copy link
Member

I likely misunderstood, IIRC we discussed that while I was in the Houston convention center with a lot of background noise 😄

If you can add it back, that would be great, thanks.

@jcosborn
Copy link
Contributor Author

Coalescing is back in now. The interface had to change a bit since subscripting can't return a reference anymore.

This is somewhat redundant with SharedMemoryCache now. One option is to factor out a common shared memory base that they and thread array could all inherit from. If that seems desirable, I can work on that before copying over to HIP.

@maddyscientist
Copy link
Member

Hi @jcosborn. I think your suggestion of factoring out the common shared memory base is a good one. Also good to add a generic version of the ThreadLocalCache that doesn't use shared memory and just uses the stack, and verify that it functionally works.

@jcosborn jcosborn marked this pull request as ready for review September 2, 2023 20:04
@jcosborn jcosborn requested review from a team as code owners September 2, 2023 20:04
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why even have this file or is it just meant as a handle for future possibilities?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be safely deleted, but I left it in case it was needed again or to possibly make merging easier.

@weinbe2
Copy link
Contributor

weinbe2 commented Sep 7, 2023

@jcosborn maybe I'm missing an implementation detail, but how do you envision handling a case where you need a shared memory cache for three different objects? i.e., for conversations sake, one for a Link, one for a double, and one for an int?

I see you already have ways to handle this in the case where you have multiple Links (in the HISQ force, for ex), but I'm not so sure what you'd do in a mixed case --- maybe something like tuple<Link, double> as the offset for the int, assuming the tuple will give an appropriate size (I forget).

@jcosborn
Copy link
Contributor Author

jcosborn commented Sep 7, 2023

I think this fixes the overlapping shared mem issue, though I didn't see it fail before on my GeForce, or on a V100, and I'm not sure why.

@weinbe2 If you look at the fix
https://github.com/lattice/quda/pull/1380/files#diff-ca4a459b7ab46afbc002b7039de573e2868bd29ef1459c6ca28fdf91fb27757aR139
Stap uses computeStapleRectangleOps which is array<int,4> as an offset, and Rect uses the Stap type as an offset (which will include its offset as part of the type), so Rect will be offset by the size of both types.

@maddyscientist
Copy link
Member

What tests were you running James when you didn't see failures? Anyway, I've confirmed this fixes the issue at my end.

@weinbe2
Copy link
Contributor

weinbe2 commented Sep 7, 2023

@weinbe2 If you look at the fix https://github.com/lattice/quda/pull/1380/files#diff-ca4a459b7ab46afbc002b7039de573e2868bd29ef1459c6ca28fdf91fb27757aR139 Stap uses computeStapleRectangleOps which is array<int,4> as an offset, and Rect uses the Stap type as an offset (which will include its offset as part of the type), so Rect will be offset by the size of both types.

Understood, thank you, it wasn't immediately clear to me from the implementation that it would "understand" recursive offsets (but I should have looked harder, too).

@jcosborn
Copy link
Contributor Author

jcosborn commented Sep 8, 2023

@maddyscientist Ah, I was just running the test suite, but I see the tests you mention aren't in it. Should they be added?

@jcosborn
Copy link
Contributor Author

I think all the comments and issues we discussed are addressed now, and it passes my tests (except for the DW issues I've reported in #1410).

@jcosborn
Copy link
Contributor Author

This is ready for review now.

@maddyscientist
Copy link
Member

This looks good to me. @weinbe2 can you sign off from your end?

@@ -19,6 +19,7 @@ namespace quda
// matrix+matrix = 18 floating-point ops
// => Total number of floating point ops per function call
// dims * (2*18 + 4*198) = dims*828
using computeStapleOps = thread_array<int, 4>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker, but if this is being defined here, we should use this type down on line 29

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with your logic here @weinbe2: it's not a case that we're using this type in this kernel function, rather this user defined type is set to match the type used in the kernel function. Making line 29 use computeStapleOps would serve to obfuscate the code.

@@ -94,6 +95,7 @@ namespace quda
// matrix+matrix = 18 floating-point ops
// => Total number of floating point ops per function call
// dims * (8*18 + 28*198) = dims*5688
using computeStapleRectangleOps = thread_array<int, 4>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use this on line 107 or delete this line

@weinbe2
Copy link
Contributor

weinbe2 commented Dec 15, 2023

Looks good! I left a few little comments but it should all be straightforward. I'm not sure if this PR still needs a clang-format.

I'm doing a quick ctest on my end to cover what the cscs run didn't get through, plus multigrid. Assuming there are no surprises (I'm not particularly worried, but famous last words) this will be good to go.

@weinbe2
Copy link
Contributor

weinbe2 commented Dec 15, 2023

Did the Ls hotfix get merged in? I'm seeing some ctest failures for DWF, all of which look familiar. If it wasn't merged in yet, please merge develop back in and otherwise disregard what I have below.


If it has been merged in, though, it looks like we may have a fresh issue:

         23 - invert_test_mobius_eofa_sym_single (Failed)
---
ESC[0;31m[  FAILED  ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_single_l2_heavy_quark, where GetParam() = (0, 2, 3, 4, 1, 1, (-2147483648, -2147483648, -2147483648), 5)
ESC[0;31m[  FAILED  ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_single_solution_accumulator_pipeline8_l2_heavy_quark, where GetParam() = (0, 2, 3, 4, 1, 8, (-2147483648, -2147483648, -2147483648), 5)
ESC[0;31m[  FAILED  ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_half_l2_heavy_quark, where GetParam() = (0, 2, 3, 2, 1, 1, (-2147483648, -2147483648, -2147483648), 5)
ESC[0;31m[  FAILED  ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_half_solution_accumulator_pipeline8_l2_heavy_quark, where GetParam() = (0, 2, 3, 2, 1, 8, (-2147483648, -2147483648, -2147483648), 5)
         26 - invert_test_domain_wall_double (Failed)
---
ESC[0;31m[  FAILED  ] ESC[m6 tests, listed below:
ESC[0;31m[  FAILED  ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_double_l2_heavy_quark, where GetParam() = (0, 2, 3, 8, 1, 1, (-2147483648, -2147483648, -2147483648), 5)
ESC[0;31m[  FAILED  ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_double_solution_accumulator_pipeline8_l2_heavy_quark, where GetParam() = (0, 2, 3, 8, 1, 8, (-2147483648, -2147483648, -2147483648), 5)
ESC[0;31m[  FAILED  ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_single_l2_heavy_quark, where GetParam() = (0, 2, 3, 4, 1, 1, (-2147483648, -2147483648, -2147483648), 5)
ESC[0;31m[  FAILED  ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_single_solution_accumulator_pipeline8_l2_heavy_quark, where GetParam() = (0, 2, 3, 4, 1, 8, (-2147483648, -2147483648, -2147483648), 5)
ESC[0;31m[  FAILED  ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_half_l2_heavy_quark, where GetParam() = (0, 2, 3, 2, 1, 1, (-2147483648, -2147483648, -2147483648), 5)
ESC[0;31m[  FAILED  ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_half_solution_accumulator_pipeline8_l2_heavy_quark, where GetParam() = (0, 2, 3, 2, 1, 8, (-2147483648, -2147483648, -2147483648), 5)

In all cases, it looks like it's very close to converging, representative:

Solution = mat_pc, Solve = normop_pc, Solver = cg, Sloppy precision = double
WARNING: CG: Restarting without reliable updates for heavy-quark residual (total #inc 1)
CG: Convergence at 71 iterations, L2 relative residual: iterated = 6.754295e-13, true = 6.754295e-13 (requested = 1.000000e-12), heavy-quark residual = 7.197024e-13 (requested = 1.000000e-12)
Done: 71 iter / 0.010148 secs = 38.7384 Gflops
Residuals: (L2 relative) tol 1.000000e-12, QUDA = 6.754295e-13, host = 1.091589e-12; (heavy-quark) tol 1.000000e-12, QUDA = 7.197024e-13
/scratch/local/quda/tests/invert_test_gtest.hpp:146: Failure
Expected: (rsd[0]) <= (tol), actual: 1.09159e-12 vs 1e-12

@weinbe2
Copy link
Contributor

weinbe2 commented Dec 15, 2023

MG is good!

@jcosborn
Copy link
Contributor Author

The Ls hotfix wasn't in yet. I just merged develop, so that should be there now.

Copy link
Contributor

@weinbe2 weinbe2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks great, I see the CSCS ctest is fully passing now. Approved!

@weinbe2 weinbe2 merged commit 1914dc3 into develop Dec 20, 2023
12 checks passed
@weinbe2 weinbe2 deleted the feature/sycl-merge branch December 20, 2023 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants