-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[alpaka] Port Clusterizer to Alpaka #172
Conversation
…s in idx start values, and block-local versus grid-local indices.
… Address this issue in a relatively clean way, with addition of helper function. Now have perf identical between legacy CUDA version and Alpaka CUDA.
…y and be able to compare perf.
…at run-time (removing -Wno-vla), but no choice to be able to do same as legacy.
I think the performance issues mentioned yesterday can be treated separately |
I need to investigate more on alpaka::getWorkDiv. If it confirms itself costly, it is straightforward to factorize it out the helper functions. |
for (uint32_t i = threadIdx; i < std::min(endElementIdx, maxNumberOfElements); ++i) { | ||
if (endElementIdx > maxNumberOfElements) { | ||
endElementIdx = maxNumberOfElements; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have kept endElementIdx = std::min(endElementIdx, maxNumberOfElements)
, but probably doesn't make much difference.
uint32_t firstElementIdx = firstElementIdxNoStride[0u]; | ||
uint32_t endElementIdx = endElementIdxNoStride[0u]; | ||
for (uint32_t i = firstElementIdx; i < numElements; ++i) { | ||
if (!cms::alpakatools::get_next_element_1D_index_stride( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the main reason for this function (instead of for_each_element_...
) the fact that loop body does break
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the issue here was that 2 loops are introduced with Alpaka: one for the strided access, and an inner one for the CPU elements. Whether using a helper function and a lambda, or directly doing the looping on call site, one faces this 2-loops situation.
This was an issue when the legacy code had break statements, since obviously only the innermost loop was exited. A trivial way around I had done first, was of course to use a boolean, to exit the innermost loop when needed.
But in a sense, the gymnastics with the boolean had to be repeated each time there was such a block on call site, hence code duplication.
This helper function allows to have the same logic as 2 loops, but with only 1 loop and no extra storage, which makes the portability of the legacy code easier: nothing has to be changed regarding break statements on call site and they can be kept 1:1.
alpaka::block::sync::syncBlockThreads(acc); | ||
|
||
// renumber | ||
auto&& ws = alpaka::block::shared::st::allocVar<uint16_t[32], __COUNTER__>(acc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 32 here is related to the warp size, right? We should probably look later on how to make that dependent on acc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes totally.
namespace gpuClustering { | ||
|
||
#ifdef GPU_DEBUG | ||
ALPAKA_STATIC_ACC_MEM_GLOBAL uint32_t gMaxHit = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does ALPAKA_STATIC_ACC_MEM_GLOBAL
depend on the accelerator type? If yes, I suppose this should get different symbols for each accelerator (can be addressed later since it's "only" for debugging).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I think it should be moved into ALPAKA_ACCELERATOR_NAMESPACE
because otherwise same symbol would be defined e.g. for serial and TBB backends (leading to ODR violation) when we (later) try to link all of them together. But as I said before, not important for now because if being needed only for debugging, but perhaps a comment here would help to "remember" later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes makes sense, ok will add that
static_assert(MaxNumModules == 2000u, | ||
"MaxNumModules not copied to device code to preserve same interface. Hardcoded value " | ||
"assuming MaxNumModules == 2000."); | ||
auto loc = alpaka::atomic::atomicOp<alpaka::atomic::op::Inc>(acc, moduleStart, 2000u); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why MaxNumModules
did not work directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I was getting error: identifier "gpuClustering::MaxNumModules" is undefined in device code
But since is is constexpr it should have worked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recall now a similar issue at some point with Kokkos, we used a trick along
constexpr auto MAX = MaxNumModules;
auto loc = alpaka::atomic::atomicOp<alpaka::atomic::op::Inc>(acc, moduleStart, MAX);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha yes good idea, bah will do that.
But this is frustrating as since it is constexpr it should have worked, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fully agree it is annoying, but I have no clue why the direct constexpr
doesn't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
constexpr auto local = kGlobal;
Yes to basically avoid having this code duplication in many kernels.
It should be defined in CUDACore?
The main interest here versus just calling std::decay_t<T>
directly is that the function is constexpr, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just that I couldn't get std::decay_t(kGlobal)
to work, it seems to require the type explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha ok, yes I had tried it out before posting my comment, and indeed std::decay_t<uint32_t>(MaxNumModules)
seemed to be the only way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the trick that the by_value()
effectively returns a copy of the argument, and the temporary (to which the const reference is made to) is kept alive until the function returns?
It should be defined in CUDACore?
I would place it in AlpakaCore
(in retrospect CMSUnrollLoop.h
would be better placed there as well). Ideally users should not need to include any platform-specific headers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the trick that the
by_value()
effectively returns a copy of the argument, and the temporary (to which the const reference is made to) is kept alive until the function returns?
I guess. And then I hope the compiler elides the whole "making a copy of the argument" business.
@ghugo83 a suggestion for the future: please clean up the commit history before a PR is merged (applies to both #167 and #172):
An alternative is to simply squash all commits into a single one when making or merging a PR. Note: the commit history in CMSSW is a horrible mess, and should not be used as a guideline... |
Am I correct that this PR does not enable running the pixel clusterizer in the |
Yes ok will do that for the next PRs.
Yes, in this PR it is enabled via the clusterizer test but not through the alpaka binary, I will PR the next developments in further PR. |
The actual diff introducing the clusterizer can be seen at: https://github.com/ghugo83/pixeltrack-standalone/compare/clusterizer_none...ghugo83:clusterizer?expand=1
Port clusterizer to Alpaka (plugin-SiPixelClusterizer), and its test.
Compiles and runs smoothly, results cross-checked.
Performance:
NB: This clusterizer branch is based on top of the branch from earlier this week, which was introducing general helper functions + adjustments in prefixScan: https://github.com/cms-patatrack/pixeltrack-standalone/pull/167/files#diff-b9eb60024878e85d22ffd1316bd2c4af1e85dd2ae6de9069e8fa2e01d8935f71