Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[alpaka] Port Clusterizer to Alpaka #172

Merged
merged 24 commits into from
Apr 1, 2021

Conversation

ghugo83
Copy link
Contributor

@ghugo83 ghugo83 commented Feb 19, 2021

The actual diff introducing the clusterizer can be seen at: https://github.com/ghugo83/pixeltrack-standalone/compare/clusterizer_none...ghugo83:clusterizer?expand=1

Port clusterizer to Alpaka (plugin-SiPixelClusterizer), and its test.

Compiles and runs smoothly, results cross-checked.

Performance:

NB: This clusterizer branch is based on top of the branch from earlier this week, which was introducing general helper functions + adjustments in prefixScan: https://github.com/cms-patatrack/pixeltrack-standalone/pull/167/files#diff-b9eb60024878e85d22ffd1316bd2c4af1e85dd2ae6de9069e8fa2e01d8935f71

…s in idx start values, and block-local versus grid-local indices.
… Address this issue in a relatively clean way, with addition of helper function. Now have perf identical between legacy CUDA version and Alpaka CUDA.
…at run-time (removing -Wno-vla), but no choice to be able to do same as legacy.
@ghugo83
Copy link
Contributor Author

ghugo83 commented Mar 2, 2021

I think the performance issues mentioned yesterday can be treated separately

@ghugo83
Copy link
Contributor Author

ghugo83 commented Mar 2, 2021

I need to investigate more on alpaka::getWorkDiv. If it confirms itself costly, it is straightforward to factorize it out the helper functions.
But in that case, we will have a deeper issue anyway, since already replacing it by a hardcoded value in 1 call per block seems to have an important effect. To be investigated, will be treated separately.

Makefile Outdated Show resolved Hide resolved
for (uint32_t i = threadIdx; i < std::min(endElementIdx, maxNumberOfElements); ++i) {
if (endElementIdx > maxNumberOfElements) {
endElementIdx = maxNumberOfElements;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have kept endElementIdx = std::min(endElementIdx, maxNumberOfElements), but probably doesn't make much difference.

uint32_t firstElementIdx = firstElementIdxNoStride[0u];
uint32_t endElementIdx = endElementIdxNoStride[0u];
for (uint32_t i = firstElementIdx; i < numElements; ++i) {
if (!cms::alpakatools::get_next_element_1D_index_stride(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the main reason for this function (instead of for_each_element_...) the fact that loop body does break?

Copy link
Contributor Author

@ghugo83 ghugo83 Mar 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the issue here was that 2 loops are introduced with Alpaka: one for the strided access, and an inner one for the CPU elements. Whether using a helper function and a lambda, or directly doing the looping on call site, one faces this 2-loops situation.
This was an issue when the legacy code had break statements, since obviously only the innermost loop was exited. A trivial way around I had done first, was of course to use a boolean, to exit the innermost loop when needed.
But in a sense, the gymnastics with the boolean had to be repeated each time there was such a block on call site, hence code duplication.

This helper function allows to have the same logic as 2 loops, but with only 1 loop and no extra storage, which makes the portability of the legacy code easier: nothing has to be changed regarding break statements on call site and they can be kept 1:1.

alpaka::block::sync::syncBlockThreads(acc);

// renumber
auto&& ws = alpaka::block::shared::st::allocVar<uint16_t[32], __COUNTER__>(acc);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 32 here is related to the warp size, right? We should probably look later on how to make that dependent on acc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes totally.

namespace gpuClustering {

#ifdef GPU_DEBUG
ALPAKA_STATIC_ACC_MEM_GLOBAL uint32_t gMaxHit = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does ALPAKA_STATIC_ACC_MEM_GLOBAL depend on the accelerator type? If yes, I suppose this should get different symbols for each accelerator (can be addressed later since it's "only" for debugging).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I think it should be moved into ALPAKA_ACCELERATOR_NAMESPACE because otherwise same symbol would be defined e.g. for serial and TBB backends (leading to ODR violation) when we (later) try to link all of them together. But as I said before, not important for now because if being needed only for debugging, but perhaps a comment here would help to "remember" later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes makes sense, ok will add that

static_assert(MaxNumModules == 2000u,
"MaxNumModules not copied to device code to preserve same interface. Hardcoded value "
"assuming MaxNumModules == 2000.");
auto loc = alpaka::atomic::atomicOp<alpaka::atomic::op::Inc>(acc, moduleStart, 2000u);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why MaxNumModules did not work directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I was getting error: identifier "gpuClustering::MaxNumModules" is undefined in device code
But since is is constexpr it should have worked

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall now a similar issue at some point with Kokkos, we used a trick along

constexpr auto MAX = MaxNumModules;
auto loc = alpaka::atomic::atomicOp<alpaka::atomic::op::Inc>(acc, moduleStart, MAX);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha yes good idea, bah will do that.
But this is frustrating as since it is constexpr it should have worked, no?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully agree it is annoying, but I have no clue why the direct constexpr doesn't work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

constexpr auto local = kGlobal;
Yes to basically avoid having this code duplication in many kernels.
It should be defined in CUDACore?
The main interest here versus just calling std::decay_t<T> directly is that the function is constexpr, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just that I couldn't get std::decay_t(kGlobal) to work, it seems to require the type explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha ok, yes I had tried it out before posting my comment, and indeed std::decay_t<uint32_t>(MaxNumModules) seemed to be the only way to go.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the trick that the by_value() effectively returns a copy of the argument, and the temporary (to which the const reference is made to) is kept alive until the function returns?

It should be defined in CUDACore?

I would place it in AlpakaCore (in retrospect CMSUnrollLoop.h would be better placed there as well). Ideally users should not need to include any platform-specific headers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the trick that the by_value() effectively returns a copy of the argument, and the temporary (to which the const reference is made to) is kept alive until the function returns?

I guess. And then I hope the compiler elides the whole "making a copy of the argument" business.

src/alpaka/plugin-SiPixelClusterizer/gpuClustering.h Outdated Show resolved Hide resolved
@makortel makortel merged commit 171f24d into cms-patatrack:master Apr 1, 2021
@fwyzard
Copy link
Contributor

fwyzard commented Apr 4, 2021

@ghugo83 a suggestion for the future: please clean up the commit history before a PR is merged (applies to both #167 and #172):

  • at least resolve the commits that are added and reverted;
  • if possible, avoid merges inside the PRs;
  • messages like small corrections are not very helpful; if they are small corrections, better to squash them in the comit that introduced the original code;
  • I think Matti prefers to have commit messages prefixed by [alpaka] to identify which program they apply to;
  • in general, group and squash commits that are not supposed to work independently.

An alternative is to simply squash all commits into a single one when making or merging a PR.

Note: the commit history in CMSSW is a horrible mess, and should not be used as a guideline...

@fwyzard
Copy link
Contributor

fwyzard commented Apr 4, 2021

Am I correct that this PR does not enable running the pixel clusterizer in the alpaka binary ?

@ghugo83
Copy link
Contributor Author

ghugo83 commented Apr 5, 2021

An alternative is to simply squash all commits into a single one when making or merging a PR.

Yes ok will do that for the next PRs.

Am I correct that this PR does not enable running the pixel clusterizer in the alpaka binary ?

Yes, in this PR it is enabled via the clusterizer test but not through the alpaka binary, I will PR the next developments in further PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants