add ThreadLocalCache #1380

jcosborn · 2023-05-18T15:50:23Z

This adds a dedicated thread-local cache object that can use shared memory for storage. It is distinct from SharedMemoryCache in that there is no sharing among threads, which simplifies the interface, and also doesn't need a sync operation available. Since sharing isn't needed, targets can choose to not use shared memory if it is advantageous to do so. Note that thread_array could be replaced by this in the future, but is not being done here.

TODO: add HIP version
I'll add HIP once the CUDA version is settled.

maddyscientist · 2023-05-22T17:15:33Z

@jcosborn it doesn't appear that the ThreadLocalCache is decomposing the underlying type to ensure memory coalescing when accessing the shared memory on the CUDA target. This will result in shared memory bank conflicts and likely significant drops in performance versus. Am I reading this correctly?

jcosborn · 2023-05-22T17:20:39Z

Yes, it doesn't do coalescing now. I had thought you said it wasn't necessary here, but maybe I misunderstood. I can easily add that back in.

maddyscientist · 2023-05-22T17:31:39Z

I likely misunderstood, IIRC we discussed that while I was in the Houston convention center with a lot of background noise 😄

If you can add it back, that would be great, thanks.

jcosborn · 2023-05-24T18:07:26Z

Coalescing is back in now. The interface had to change a bit since subscripting can't return a reference anymore.

This is somewhat redundant with SharedMemoryCache now. One option is to factor out a common shared memory base that they and thread array could all inherit from. If that seems desirable, I can work on that before copying over to HIP.

maddyscientist · 2023-06-02T01:03:45Z

Hi @jcosborn. I think your suggestion of factoring out the common shared memory base is a good one. Also good to add a generic version of the ThreadLocalCache that doesn't use shared memory and just uses the stack, and verify that it functionally works.

maddyscientist · 2023-09-05T23:23:39Z

include/targets/cuda/thread_local_cache.h

Why even have this file or is it just meant as a handle for future possibilities?

It can be safely deleted, but I left it in case it was needed again or to possibly make merging easier.

include/targets/generic/shared_memory_cache_helper.h

include/targets/generic/helpers.h

weinbe2 · 2023-09-07T20:35:48Z

@jcosborn maybe I'm missing an implementation detail, but how do you envision handling a case where you need a shared memory cache for three different objects? i.e., for conversations sake, one for a Link, one for a double, and one for an int?

I see you already have ways to handle this in the case where you have multiple Links (in the HISQ force, for ex), but I'm not so sure what you'd do in a mixed case --- maybe something like tuple<Link, double> as the offset for the int, assuming the tuple will give an appropriate size (I forget).

jcosborn · 2023-09-07T22:16:59Z

I think this fixes the overlapping shared mem issue, though I didn't see it fail before on my GeForce, or on a V100, and I'm not sure why.

@weinbe2 If you look at the fix
https://github.com/lattice/quda/pull/1380/files#diff-ca4a459b7ab46afbc002b7039de573e2868bd29ef1459c6ca28fdf91fb27757aR139
Stap uses computeStapleRectangleOps which is array<int,4> as an offset, and Rect uses the Stap type as an offset (which will include its offset as part of the type), so Rect will be offset by the size of both types.

maddyscientist · 2023-09-07T23:02:33Z

What tests were you running James when you didn't see failures? Anyway, I've confirmed this fixes the issue at my end.

weinbe2 · 2023-09-07T23:05:43Z

@weinbe2 If you look at the fix https://github.com/lattice/quda/pull/1380/files#diff-ca4a459b7ab46afbc002b7039de573e2868bd29ef1459c6ca28fdf91fb27757aR139 Stap uses computeStapleRectangleOps which is array<int,4> as an offset, and Rect uses the Stap type as an offset (which will include its offset as part of the type), so Rect will be offset by the size of both types.

Understood, thank you, it wasn't immediately clear to me from the implementation that it would "understand" recursive offsets (but I should have looked harder, too).

jcosborn · 2023-09-08T14:29:25Z

@maddyscientist Ah, I was just running the test suite, but I see the tests you mention aren't in it. Should they be added?

include/kernels/gauge_wilson_flow.cuh

jcosborn · 2023-11-30T23:55:24Z

I think all the comments and issues we discussed are addressed now, and it passes my tests (except for the DW issues I've reported in #1410).

jcosborn · 2023-12-13T22:25:28Z

This is ready for review now.

maddyscientist · 2023-12-13T22:50:06Z

This looks good to me. @weinbe2 can you sign off from your end?

weinbe2 · 2023-12-15T00:07:45Z

include/kernels/gauge_utils.cuh

@@ -19,6 +19,7 @@ namespace quda
  // matrix+matrix = 18 floating-point ops
  // => Total number of floating point ops per function call
  // dims * (2*18 + 4*198) = dims*828
+  using computeStapleOps = thread_array<int, 4>;


Not a blocker, but if this is being defined here, we should use this type down on line 29

I disagree with your logic here @weinbe2: it's not a case that we're using this type in this kernel function, rather this user defined type is set to match the type used in the kernel function. Making line 29 use computeStapleOps would serve to obfuscate the code.

weinbe2 · 2023-12-15T00:08:11Z

include/kernels/gauge_utils.cuh

@@ -94,6 +95,7 @@ namespace quda
  // matrix+matrix = 18 floating-point ops
  // => Total number of floating point ops per function call
  // dims * (8*18 + 28*198) = dims*5688
+  using computeStapleRectangleOps = thread_array<int, 4>;


Use this on line 107 or delete this line

weinbe2 · 2023-12-15T00:28:54Z

Looks good! I left a few little comments but it should all be straightforward. I'm not sure if this PR still needs a clang-format.

I'm doing a quick ctest on my end to cover what the cscs run didn't get through, plus multigrid. Assuming there are no surprises (I'm not particularly worried, but famous last words) this will be good to go.

weinbe2 · 2023-12-15T01:50:47Z

Did the Ls hotfix get merged in? I'm seeing some ctest failures for DWF, all of which look familiar. If it wasn't merged in yet, please merge develop back in and otherwise disregard what I have below.

If it has been merged in, though, it looks like we may have a fresh issue:

         23 - invert_test_mobius_eofa_sym_single (Failed)
---
ESC[0;31m[  FAILED  ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_single_l2_heavy_quark, where GetParam() = (0, 2, 3, 4, 1, 1, (-2147483648, -2147483648, -2147483648), 5)
ESC[0;31m[  FAILED  ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_single_solution_accumulator_pipeline8_l2_heavy_quark, where GetParam() = (0, 2, 3, 4, 1, 8, (-2147483648, -2147483648, -2147483648), 5)
ESC[0;31m[  FAILED  ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_half_l2_heavy_quark, where GetParam() = (0, 2, 3, 2, 1, 1, (-2147483648, -2147483648, -2147483648), 5)
ESC[0;31m[  FAILED  ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_half_solution_accumulator_pipeline8_l2_heavy_quark, where GetParam() = (0, 2, 3, 2, 1, 8, (-2147483648, -2147483648, -2147483648), 5)

         26 - invert_test_domain_wall_double (Failed)
---
ESC[0;31m[  FAILED  ] ESC[m6 tests, listed below:
ESC[0;31m[  FAILED  ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_double_l2_heavy_quark, where GetParam() = (0, 2, 3, 8, 1, 1, (-2147483648, -2147483648, -2147483648), 5)
ESC[0;31m[  FAILED  ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_double_solution_accumulator_pipeline8_l2_heavy_quark, where GetParam() = (0, 2, 3, 8, 1, 8, (-2147483648, -2147483648, -2147483648), 5)
ESC[0;31m[  FAILED  ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_single_l2_heavy_quark, where GetParam() = (0, 2, 3, 4, 1, 1, (-2147483648, -2147483648, -2147483648), 5)
ESC[0;31m[  FAILED  ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_single_solution_accumulator_pipeline8_l2_heavy_quark, where GetParam() = (0, 2, 3, 4, 1, 8, (-2147483648, -2147483648, -2147483648), 5)
ESC[0;31m[  FAILED  ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_half_l2_heavy_quark, where GetParam() = (0, 2, 3, 2, 1, 1, (-2147483648, -2147483648, -2147483648), 5)
ESC[0;31m[  FAILED  ] ESC[mHeavyQuarkEvenOdd/InvertTest.verify/cg_mat_pc_normop_pc_half_solution_accumulator_pipeline8_l2_heavy_quark, where GetParam() = (0, 2, 3, 2, 1, 8, (-2147483648, -2147483648, -2147483648), 5)

In all cases, it looks like it's very close to converging, representative:

Solution = mat_pc, Solve = normop_pc, Solver = cg, Sloppy precision = double
WARNING: CG: Restarting without reliable updates for heavy-quark residual (total #inc 1)
CG: Convergence at 71 iterations, L2 relative residual: iterated = 6.754295e-13, true = 6.754295e-13 (requested = 1.000000e-12), heavy-quark residual = 7.197024e-13 (requested = 1.000000e-12)
Done: 71 iter / 0.010148 secs = 38.7384 Gflops
Residuals: (L2 relative) tol 1.000000e-12, QUDA = 6.754295e-13, host = 1.091589e-12; (heavy-quark) tol 1.000000e-12, QUDA = 7.197024e-13
/scratch/local/quda/tests/invert_test_gtest.hpp:146: Failure
Expected: (rsd[0]) <= (tol), actual: 1.09159e-12 vs 1e-12

weinbe2 · 2023-12-15T02:07:58Z

MG is good!

jcosborn · 2023-12-19T22:29:47Z

The Ls hotfix wasn't in yet. I just merged develop, so that should be there now.

weinbe2

Everything looks great, I see the CSCS ctest is fully passing now. Approved!

jcosborn added 2 commits May 18, 2023 10:24

add ThreadLocalCache

aff7f03

fix CI error

8f297f1

jcosborn added 3 commits May 22, 2023 16:21

add coalescing to ThreadLocalCache

1a3614c

don't return reference for ThreadLocalCache indexing

c2235ee

update docs

f686285

jcosborn added 13 commits July 11, 2023 13:09

add shared memory helper

b16b9b2

add shared memory helper to HIP

d6ac933

refactor SharedMemoryCache to use new SharedMemory object

f29ff00

remove unused parameter

f2419e1

update thread_array

dd7b975

Merge branch 'develop' into feature/sycl-merge

3fee9f9

update remaining shared memory uses

72cc52b

fix HIP target

13f6701

fix HIP build

7df0d93

update shared memory object

c73d2b4

fix clang build

bab726e

code cleanup

b576582

format

5889bc4

jcosborn marked this pull request as ready for review September 2, 2023 20:04

jcosborn requested review from a team as code owners September 2, 2023 20:04

maddyscientist reviewed Sep 5, 2023

View reviewed changes

maddyscientist reviewed Sep 6, 2023

View reviewed changes

include/targets/generic/shared_memory_cache_helper.h Show resolved Hide resolved

maddyscientist reviewed Sep 6, 2023

View reviewed changes

include/targets/generic/helpers.h Outdated Show resolved Hide resolved

fix overlapping shared mem

5ce1230

maddyscientist added 2 commits September 19, 2023 22:38

Bug fix for VUV/VLV kernels now that dynamic shared memory is used

56e0ee6

Add some shared memory checks when launching kernels

c2d336c

maddyscientist reviewed Oct 4, 2023

View reviewed changes

include/kernels/gauge_wilson_flow.cuh Outdated Show resolved Hide resolved

jcosborn added 6 commits October 18, 2023 21:27

Merge branch 'develop' into feature/sycl-merge

a5cb472

Merge branch 'develop' into feature/sycl-merge

ce5cf6b

fix some shared bytes amounts

2048bd1

move some code to address comments

d5e73b6

fix typo

13075c7

clang format

fc3ea42

maddyscientist approved these changes Dec 13, 2023

View reviewed changes

weinbe2 reviewed Dec 15, 2023

View reviewed changes

Apply clang format

bad098a

Merge branch 'develop' into feature/sycl-merge

418824e

weinbe2 approved these changes Dec 20, 2023

View reviewed changes

weinbe2 merged commit 1914dc3 into develop Dec 20, 2023
12 checks passed

weinbe2 deleted the feature/sycl-merge branch December 20, 2023 20:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add ThreadLocalCache #1380

add ThreadLocalCache #1380

jcosborn commented May 18, 2023

maddyscientist commented May 22, 2023

jcosborn commented May 22, 2023

maddyscientist commented May 22, 2023

jcosborn commented May 24, 2023

maddyscientist commented Jun 2, 2023

maddyscientist Sep 5, 2023

jcosborn Sep 6, 2023

weinbe2 commented Sep 7, 2023

jcosborn commented Sep 7, 2023

maddyscientist commented Sep 7, 2023

weinbe2 commented Sep 7, 2023

jcosborn commented Sep 8, 2023

jcosborn commented Nov 30, 2023

jcosborn commented Dec 13, 2023

maddyscientist commented Dec 13, 2023

weinbe2 Dec 15, 2023

maddyscientist Dec 15, 2023

weinbe2 Dec 15, 2023

weinbe2 commented Dec 15, 2023

weinbe2 commented Dec 15, 2023

weinbe2 commented Dec 15, 2023

jcosborn commented Dec 19, 2023

weinbe2 left a comment

add ThreadLocalCache #1380

add ThreadLocalCache #1380

Conversation

jcosborn commented May 18, 2023

maddyscientist commented May 22, 2023

jcosborn commented May 22, 2023

maddyscientist commented May 22, 2023

jcosborn commented May 24, 2023

maddyscientist commented Jun 2, 2023

maddyscientist Sep 5, 2023

Choose a reason for hiding this comment

jcosborn Sep 6, 2023

Choose a reason for hiding this comment

weinbe2 commented Sep 7, 2023

jcosborn commented Sep 7, 2023

maddyscientist commented Sep 7, 2023

weinbe2 commented Sep 7, 2023

jcosborn commented Sep 8, 2023

jcosborn commented Nov 30, 2023

jcosborn commented Dec 13, 2023

maddyscientist commented Dec 13, 2023

weinbe2 Dec 15, 2023

Choose a reason for hiding this comment

maddyscientist Dec 15, 2023

Choose a reason for hiding this comment

weinbe2 Dec 15, 2023

Choose a reason for hiding this comment

weinbe2 commented Dec 15, 2023

weinbe2 commented Dec 15, 2023

weinbe2 commented Dec 15, 2023

jcosborn commented Dec 19, 2023

weinbe2 left a comment

Choose a reason for hiding this comment