-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add ThreadLocalCache #1380
add ThreadLocalCache #1380
Conversation
@jcosborn it doesn't appear that the |
Yes, it doesn't do coalescing now. I had thought you said it wasn't necessary here, but maybe I misunderstood. I can easily add that back in. |
I likely misunderstood, IIRC we discussed that while I was in the Houston convention center with a lot of background noise 😄 If you can add it back, that would be great, thanks. |
Coalescing is back in now. The interface had to change a bit since subscripting can't return a reference anymore. This is somewhat redundant with SharedMemoryCache now. One option is to factor out a common shared memory base that they and thread array could all inherit from. If that seems desirable, I can work on that before copying over to HIP. |
Hi @jcosborn. I think your suggestion of factoring out the common shared memory base is a good one. Also good to add a generic version of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why even have this file or is it just meant as a handle for future possibilities?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be safely deleted, but I left it in case it was needed again or to possibly make merging easier.
@jcosborn maybe I'm missing an implementation detail, but how do you envision handling a case where you need a shared memory cache for three different objects? i.e., for conversations sake, one for a I see you already have ways to handle this in the case where you have multiple |
I think this fixes the overlapping shared mem issue, though I didn't see it fail before on my GeForce, or on a V100, and I'm not sure why. @weinbe2 If you look at the fix |
What tests were you running James when you didn't see failures? Anyway, I've confirmed this fixes the issue at my end. |
Understood, thank you, it wasn't immediately clear to me from the implementation that it would "understand" recursive offsets (but I should have looked harder, too). |
@maddyscientist Ah, I was just running the test suite, but I see the tests you mention aren't in it. Should they be added? |
I think all the comments and issues we discussed are addressed now, and it passes my tests (except for the DW issues I've reported in #1410). |
This is ready for review now. |
This looks good to me. @weinbe2 can you sign off from your end? |
@@ -19,6 +19,7 @@ namespace quda | |||
// matrix+matrix = 18 floating-point ops | |||
// => Total number of floating point ops per function call | |||
// dims * (2*18 + 4*198) = dims*828 | |||
using computeStapleOps = thread_array<int, 4>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a blocker, but if this is being defined here, we should use this type down on line 29
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree with your logic here @weinbe2: it's not a case that we're using this type in this kernel function, rather this user defined type is set to match the type used in the kernel function. Making line 29 use computeStapleOps
would serve to obfuscate the code.
@@ -94,6 +95,7 @@ namespace quda | |||
// matrix+matrix = 18 floating-point ops | |||
// => Total number of floating point ops per function call | |||
// dims * (8*18 + 28*198) = dims*5688 | |||
using computeStapleRectangleOps = thread_array<int, 4>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use this on line 107 or delete this line
Looks good! I left a few little comments but it should all be straightforward. I'm not sure if this PR still needs a I'm doing a quick |
Did the If it has been merged in, though, it looks like we may have a fresh issue:
In all cases, it looks like it's very close to converging, representative:
|
MG is good! |
The Ls hotfix wasn't in yet. I just merged develop, so that should be there now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks great, I see the CSCS ctest
is fully passing now. Approved!
This adds a dedicated thread-local cache object that can use shared memory for storage. It is distinct from SharedMemoryCache in that there is no sharing among threads, which simplifies the interface, and also doesn't need a sync operation available. Since sharing isn't needed, targets can choose to not use shared memory if it is advantageous to do so. Note that thread_array could be replaced by this in the future, but is not being done here.
TODO: add HIP version
I'll add HIP once the CUDA version is settled.