[alpaka] Port Clusterizer to Alpaka #172

ghugo83 · 2021-02-19T14:56:16Z

The actual diff introducing the clusterizer can be seen at: https://github.com/ghugo83/pixeltrack-standalone/compare/clusterizer_none...ghugo83:clusterizer?expand=1

Port clusterizer to Alpaka (plugin-SiPixelClusterizer), and its test.

Compiles and runs smoothly, results cross-checked.

Performance:

Legacy CUDA version versus Alpaka CUDA throughputs: ~12% overhead for now.
In the CPU case instead, issues to address regarding performance. More details on first experiments to improve performance at: https://indico.cern.ch/event/1002342/contributions/4253787/attachments/2199164/3719665/alpaka_in_pixeltrackstandalone.pdf

NB: This clusterizer branch is based on top of the branch from earlier this week, which was introducing general helper functions + adjustments in prefixScan: https://github.com/cms-patatrack/pixeltrack-standalone/pull/167/files#diff-b9eb60024878e85d22ffd1316bd2c4af1e85dd2ae6de9069e8fa2e01d8935f71

…s in idx start values, and block-local versus grid-local indices.

…logic

…results.

… Address this issue in a relatively clean way, with addition of helper function. Now have perf identical between legacy CUDA version and Alpaka CUDA.

…y and be able to compare perf.

…at run-time (removing -Wno-vla), but no choice to be able to do same as legacy.

ghugo83 · 2021-03-02T15:57:28Z

I think the performance issues mentioned yesterday can be treated separately

ghugo83 · 2021-03-02T16:00:20Z

I need to investigate more on alpaka::getWorkDiv. If it confirms itself costly, it is straightforward to factorize it out the helper functions.
But in that case, we will have a deeper issue anyway, since already replacing it by a hardcoded value in 1 call per block seems to have an important effect. To be investigated, will be treated separately.

Makefile

makortel · 2021-03-30T19:20:28Z

src/alpaka/AlpakaCore/alpakaWorkDivHelper.h

-        for (uint32_t i = threadIdx; i < std::min(endElementIdx, maxNumberOfElements); ++i) {
+        if (endElementIdx > maxNumberOfElements) {
+          endElementIdx = maxNumberOfElements;
+        }


I would have kept endElementIdx = std::min(endElementIdx, maxNumberOfElements), but probably doesn't make much difference.

makortel · 2021-03-30T22:08:08Z

src/alpaka/plugin-SiPixelClusterizer/gpuClusterChargeCut.h

+      uint32_t firstElementIdx = firstElementIdxNoStride[0u];
+      uint32_t endElementIdx = endElementIdxNoStride[0u];
+      for (uint32_t i = firstElementIdx; i < numElements; ++i) {
+        if (!cms::alpakatools::get_next_element_1D_index_stride(


Is the main reason for this function (instead of for_each_element_...) the fact that loop body does break?

Yes the issue here was that 2 loops are introduced with Alpaka: one for the strided access, and an inner one for the CPU elements. Whether using a helper function and a lambda, or directly doing the looping on call site, one faces this 2-loops situation.
This was an issue when the legacy code had break statements, since obviously only the innermost loop was exited. A trivial way around I had done first, was of course to use a boolean, to exit the innermost loop when needed.
But in a sense, the gymnastics with the boolean had to be repeated each time there was such a block on call site, hence code duplication.

This helper function allows to have the same logic as 2 loops, but with only 1 loop and no extra storage, which makes the portability of the legacy code easier: nothing has to be changed regarding break statements on call site and they can be kept 1:1.

makortel · 2021-03-30T22:09:41Z

src/alpaka/plugin-SiPixelClusterizer/gpuClusterChargeCut.h

+      alpaka::block::sync::syncBlockThreads(acc);
+
+      // renumber
+      auto&& ws = alpaka::block::shared::st::allocVar<uint16_t[32], __COUNTER__>(acc);


The 32 here is related to the warp size, right? We should probably look later on how to make that dependent on acc.

Yes totally.

makortel · 2021-03-30T23:24:34Z

src/alpaka/plugin-SiPixelClusterizer/gpuClustering.h

+namespace gpuClustering {
+
+#ifdef GPU_DEBUG
+  ALPAKA_STATIC_ACC_MEM_GLOBAL uint32_t gMaxHit = 0;


Does ALPAKA_STATIC_ACC_MEM_GLOBAL depend on the accelerator type? If yes, I suppose this should get different symbols for each accelerator (can be addressed later since it's "only" for debugging).

It is defined here: https://github.com/ghugo83/alpaka/blob/develop/include/alpaka/core/Common.hpp#L88

Thanks. I think it should be moved into ALPAKA_ACCELERATOR_NAMESPACE because otherwise same symbol would be defined e.g. for serial and TBB backends (leading to ODR violation) when we (later) try to link all of them together. But as I said before, not important for now because if being needed only for debugging, but perhaps a comment here would help to "remember" later?

Yes makes sense, ok will add that

makortel · 2021-03-30T23:25:53Z

src/alpaka/plugin-SiPixelClusterizer/gpuClustering.h

+            static_assert(MaxNumModules == 2000u,
+                          "MaxNumModules not copied to device code to preserve same interface. Hardcoded value "
+                          "assuming MaxNumModules == 2000.");
+            auto loc = alpaka::atomic::atomicOp<alpaka::atomic::op::Inc>(acc, moduleStart, 2000u);


Why MaxNumModules did not work directly?

Yes I was getting error: identifier "gpuClustering::MaxNumModules" is undefined in device code
But since is is constexpr it should have worked

I recall now a similar issue at some point with Kokkos, we used a trick along

constexpr auto MAX = MaxNumModules; auto loc = alpaka::atomic::atomicOp<alpaka::atomic::op::Inc>(acc, moduleStart, MAX);

Ha yes good idea, bah will do that.
But this is frustrating as since it is constexpr it should have worked, no?

I fully agree it is annoying, but I have no clue why the direct constexpr doesn't work.

constexpr auto local = kGlobal;
Yes to basically avoid having this code duplication in many kernels.
It should be defined in CUDACore?
The main interest here versus just calling std::decay_t<T> directly is that the function is constexpr, right?

It's just that I couldn't get std::decay_t(kGlobal) to work, it seems to require the type explicitly.

Ha ok, yes I had tried it out before posting my comment, and indeed std::decay_t<uint32_t>(MaxNumModules) seemed to be the only way to go.

Is the trick that the by_value() effectively returns a copy of the argument, and the temporary (to which the const reference is made to) is kept alive until the function returns?

It should be defined in CUDACore?

I would place it in AlpakaCore (in retrospect CMSUnrollLoop.h would be better placed there as well). Ideally users should not need to include any platform-specific headers.

Is the trick that the by_value() effectively returns a copy of the argument, and the temporary (to which the const reference is made to) is kept alive until the function returns?

I guess. And then I hope the compiler elides the whole "making a copy of the argument" business.

src/alpaka/plugin-SiPixelClusterizer/gpuClustering.h

fwyzard · 2021-04-04T08:02:51Z

@ghugo83 a suggestion for the future: please clean up the commit history before a PR is merged (applies to both #167 and #172):

at least resolve the commits that are added and reverted;
if possible, avoid merges inside the PRs;
messages like small corrections are not very helpful; if they are small corrections, better to squash them in the comit that introduced the original code;
I think Matti prefers to have commit messages prefixed by [alpaka] to identify which program they apply to;
in general, group and squash commits that are not supposed to work independently.

An alternative is to simply squash all commits into a single one when making or merging a PR.

Note: the commit history in CMSSW is a horrible mess, and should not be used as a guideline...

fwyzard · 2021-04-04T08:03:48Z

Am I correct that this PR does not enable running the pixel clusterizer in the alpaka binary ?

ghugo83 · 2021-04-05T14:43:32Z

An alternative is to simply squash all commits into a single one when making or merging a PR.

Yes ok will do that for the next PRs.

Am I correct that this PR does not enable running the pixel clusterizer in the alpaka binary ?

Yes, in this PR it is enabled via the clusterizer test but not through the alpaka binary, I will PR the next developments in further PR.

ghugo83 added 19 commits February 11, 2021 17:12

[alpaka] Add clusterizer. Compiles, but still issues to fix at run time.

6c3315a

Define GPU_DEBUG true

8e829ef

Merge branch 'workdiv_helper' into clusterizer

10c30fd

Change clusterizer code to use new helepr functions. There were issue…

9a5bf29

…s in idx start values, and block-local versus grid-local indices.

Cannot use break and continue in lambda, hence replace by equivalent …

7fa84a1

…logic

fix issues for run in cuda case, spotted in debug phase.

4157e45

Fix break behavior, which surpringly does not seem to have effect on …

baa89ea

…results.

Fixes run time issues in serial and TBB cases.

fc7820f

Merge remote-tracking branch 'origin/master' into clusterizer_none

2bf4318

Merge branch 'clusterizer_none' into clusterizer

5e90ffd

remove debug printouts

7f82936

Add handling of a constexpr threadDimension. Issue with tbb run time?

56aa5f4

Clean intricated conditional statements.

58ba6ad

Remove GPU_DEBUG true

8baaa52

Forgot to remove commented debug line

e92a8ed

Cannot break out of lambda. Need to have the presence of a foor loop.…

9513add

… Address this issue in a relatively clean way, with addition of helper function. Now have perf identical between legacy CUDA version and Alpaka CUDA.

clang-format

8dfc6d5

threadDimension set to 1 for CPU case, to match what is done in legac…

6bab054

…y and be able to compare perf.

Set iter to same value as for legacy. Not nice to have size of array …

1dc7e39

…at run-time (removing -Wno-vla), but no choice to be able to do same as legacy.

ghugo83 mentioned this pull request Feb 22, 2021

Update Alpaka to 0.6.0, and Cupla to the latest dev snapshot #174

Merged

fwyzard added the alpaka label Feb 22, 2021

ghugo83 added 2 commits March 2, 2021 16:45

Place std::min outside of inner loop

f64f181

Rename gpuClustering_t.cc into clustering.cc

6b16234

Merge remote-tracking branch 'origin/master' into clusterizer_pr

7e52389

ghugo83 mentioned this pull request Mar 30, 2021

Adjustments in AlpakaCore/prefixScan and its test + Add helper functions to handle workdiv #167

Merged

makortel reviewed Mar 31, 2021

View reviewed changes

ghugo83 added 2 commits March 31, 2021 16:28

small corrections

153235a

small corrections

08cfa6c

makortel merged commit 171f24d into cms-patatrack:master Apr 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[alpaka] Port Clusterizer to Alpaka #172

[alpaka] Port Clusterizer to Alpaka #172

ghugo83 commented Feb 19, 2021 •

edited

Loading

ghugo83 commented Mar 2, 2021

ghugo83 commented Mar 2, 2021

makortel Mar 30, 2021

makortel Mar 30, 2021

ghugo83 Mar 31, 2021 •

edited

Loading

makortel Mar 30, 2021

ghugo83 Mar 31, 2021

makortel Mar 30, 2021

ghugo83 Mar 31, 2021

makortel Apr 1, 2021

ghugo83 Apr 1, 2021

makortel Mar 30, 2021

ghugo83 Mar 31, 2021

makortel Apr 1, 2021

ghugo83 Apr 1, 2021

makortel Apr 1, 2021

ghugo83 Apr 6, 2021

fwyzard Apr 6, 2021

ghugo83 Apr 6, 2021

makortel Apr 6, 2021

fwyzard Apr 6, 2021

fwyzard commented Apr 4, 2021 •

edited

Loading

fwyzard commented Apr 4, 2021

ghugo83 commented Apr 5, 2021

[alpaka] Port Clusterizer to Alpaka #172

[alpaka] Port Clusterizer to Alpaka #172

Conversation

ghugo83 commented Feb 19, 2021 • edited Loading

ghugo83 commented Mar 2, 2021

ghugo83 commented Mar 2, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghugo83 Mar 31, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fwyzard commented Apr 4, 2021 • edited Loading

fwyzard commented Apr 4, 2021

ghugo83 commented Apr 5, 2021

ghugo83 commented Feb 19, 2021 •

edited

Loading

ghugo83 Mar 31, 2021 •

edited

Loading

fwyzard commented Apr 4, 2021 •

edited

Loading