Adjustments in AlpakaCore/prefixScan and its test + Add helper functions to handle workdiv #167

ghugo83 · 2021-02-01T19:34:07Z

This contains a number of fixes in the prefixScan AlpakaCore version and its test.

The diff should appear clean after #165 is merged. In the meantime, can also simply just click at the diff of src/alpaka/AlpakaCore/prefixScan.h and src/alpaka/test/alpaka/prefixScan_t.cc.

Minor corrections:

Remove unsupported use of #pragma unroll.
Cannot use __restrict__ in warpPrefixScan and blockPrefixScan, as they are called with blockPrefixScan(acc, psum, psum, numBlocks, ws), in MultiBlock second step, hence with ci = co = psum.
Remove unused variables.
Handle element-dependency in testPrefixScan for the serial and TBB cases.
Take into account the newly added back-ends in the workDiv computation.
Make workloads identical for SERIAL, TBB, and CUDA cases.
This allows to cleanly compare the performances between the 3 cases (after obviously removing all printf and assert, and choosing meaningful workDiv).

Important fixes:

Memory allocations sizes should not contain the sizeof(type). Only the number of elements is needed with Alpaka, as there is already a * sizeof(type) done internally. This resulted in x4 too much memory allocated.
Should not call alpaka::wait::wait(queue); in between kernels. This had an important impact on perf (but ok this is just for prototyping purposes anyway).
Use dynamic shared memory allocation in the secondStepMultiBlock block for psum, as is done in CUDACore, instead of global memory.
Additionally, the number of elements in the psum array was set to num_items, while only numBlocks elements was needed. This is also addressed (dynamic shared memory size is handled with Alpaka with a BlockSharedMemDynSizeBytes struct, templated on the relevant kernel struct).
Finally, the new interface has the additional advantage of being closer to what is used in CUDACore.
This point is included in Add AlpakaCore/HistoContainer.h + HistoContainer_t, OneHistoContainer_t and OneToManyAssoc_t tests #165 .

…ble to perform comparisons. Also keep same workload between SERIAL, TBB and CUDA versions, to be able to do perf comparisons (present version has serial runs faster than tbb, itself faster than CUDA runs, because of this reason.

…e __restrict__ in warpPrefixScan and blockPrefixScan, as they are called with blockPrefixScan(acc, psum, psum, numBlocks, ws), in MultiBlock second step, hence where ci = co = psum.

… work division. + IMPORTANT: memory allocation should not include the sizeof(type), this is already taken into account internally with Alpaka. This results in allocations x4 bigger than needed.

…els. Strong effect on perf here.

makortel · 2021-02-02T16:56:23Z

Following the merge of #165 I rebased this PR on top of that to simplify the diff here.

makortel · 2021-02-02T17:11:41Z

Cannot use __restrict__ in warpPrefixScan and blockPrefixScan, as they are called with blockPrefixScan(acc, psum, psum, numBlocks, ws), in MultiBlock second step, hence with ci = co = psum.

I see the CUDA prefixScan.h in CMSSW (still) has the __restrict__
https://github.com/cms-sw/cmssw/blob/bf052a322636c2fcc229a048da44f95977bc3845/HeterogeneousCore/CUDAUtilities/interface/prefixScan.h#L13
(abviously the CUDA version here does the same).

@fwyzard @VinInn Should we remove the __restrict__ from warpPrefixScan in CMSSW as well?

makortel · 2021-02-02T17:14:09Z

Remove unsupported use of #pragma unroll.

@ghugo83 Could you clarify? Is this only about warnings when compiling with gcc? If yes, how about copying the solution from cms-sw/cmssw#32499 here?

ghugo83 · 2021-02-02T17:15:34Z

Following the merge of #165 I rebased this PR on top of that to simplify the diff here.

Ha yes sorry, was at a meeting. I could have just merged the latest master in my local branch and pushed again.

ghugo83 · 2021-02-03T11:43:15Z

@ghugo83 Could you clarify? Is this only about warnings when compiling with gcc? If yes, how about copying the solution from cms-sw/cmssw#32499 here?

Yes the #pragma unroll is ignored.
Didn't see cms-sw/cmssw#32499, just copied it here.

…sw#32499 (supported compilers: CUDA nvcc, gcc, clang).

…d access + to directly handle element AND THREAD indices in strided access.

…Use the new helper functions for strided access in all call sites.

…itional helper function extended to 2D case yet.

…atest call sites.

ghugo83 · 2021-02-15T16:40:49Z

This was time to add helper functions to cleanly handle workDiv, as suggested by @makortel .
Ideally, it would have been better to add the helper functions in a separate PR. But there were points here which actually needed to be fixed by adding new helper functions anyway.

makortel

I think this is a good step forward. I was first a bit confused by all the for_each_element_* variants, but I don't have any concrete suggestions now. My feeling is that this complexity is partially driven by the existing code (given that we were able to express everything in terms of Kokkos whose interface in this respect is more "restricted").

makortel · 2021-03-25T01:07:37Z

src/alpaka/AlpakaCore/alpakaWorkDivHelper.h

+    ALPAKA_FN_ACC void for_each_element_in_thread_1D_index_in_block(const T_Acc& acc,
+                                                                    const uint32_t maxNumberOfElements,
+                                                                    const uint32_t elementIdxShift,
+                                                                    const Func& func) {


I was a little bit surprised that const Func& works also for mutating lambdas. I guess it works because the lambdas here capture by reference, and the language allows (too easly) a const pointer to non-const object.

Yes will change it to just a pass by value, not as if the function object was heavy anyway

Or universal reference Func&& func? I suppose that would be most agnostic to the details (in principle the function could get a lambda/functor with heavy-to-copy member data).

makortel · 2021-03-25T01:19:10Z

src/alpaka/test/alpaka/OneHistoContainer_t.cc

-    }
+    cms::alpakatools::for_each_element_in_thread_1D_index_in_block(acc, 32, [&](uint32_t i) {
+      ws[i] = 0;  // used by prefix scan...
+    });


The old code did not have a loop over elements. I'm wondering which one is more correct.

Both are fine:
The extra loop in Alpaka is due to the loop over elements (obviously not interesting for the GPU backend, but needs to be there for the sake of generality).
https://github.com/ghugo83/pixeltrack-standalone/blob/prefixscan_fix/src/alpaka/AlpakaCore/alpakaWorkDivHelper.h#L163
Whenever the thread index (block-local) is >= 32, the thread index is still called and compared with 32 (like in legacy). But the loop over elements is never entered (and no ws[i]=0 allocation is done), because in that case, firstElementIdx[0u] >= endElementIdx[0u]

ghugo83 · 2021-03-26T18:04:06Z

Thanks @makortel for the review!
Anything else?

makortel · 2021-03-27T01:23:53Z

src/alpaka/AlpakaCore/alpakaWorkDivHelper.h

+    ALPAKA_FN_ACC void for_each_element_in_thread_1D_index_in_block(const T_Acc& acc,
+                                                                    const uint32_t maxNumberOfElements,
+                                                                    const uint32_t elementIdxShift,
+                                                                    const Func func) {


I think this should be either const Func& (as it was) or Func&&. It being const prevents mutating functors (which I'd think to be fine), and taking it by value implies a copy (which is not needed).

Hmm the copies are eluded anyway?
Don't know, in practice, passing the lambda by value clearly leads to better performance than passing by const &, or passing by && and then forwarding.

Passing the lambda by && and then forwarding, and inlining these functions, seems to be a good compromise, will go for that.

By quick look (that I should have done earlier), std and SYCL seem to take such functors by "non-const value", and Kokkos and Alpaka by "const reference". Just do double check, did you observe that passing the lambda by value gives better performance than reference?

If I've understood correctly, calling std::forward() many times on the same object is not strictly speaking safe (has the same issue as std::move()), even if it may work in practice in many cases.

Sorry for going back-and-forth, how about going back to the by-value (but without const), or keep the universal reference and remove the std::forward?

Re-looking better at this, I do not see clear diff in perf.

By value still makes the assumption that the function is light, which should be the case most of the time, but is not that nice.

Func&& without forwarding loses a bit of interest, as the parameter (even a temporary) would be bound to an lvalue, no?

Bah maybe this is too much hair-pulling

I think this should be either const Func& (as it was) or Func&&

I think I could revert to the initial implementation with const& ?

Also, as this does not change the helper functions interface anyway, maybe we could accept this, move on, and re-evaluate later when the full track reconstruction has been ported to have a better metric on the potential impact on perf?

By value still makes the assumption that the function is light, which should be the case most of the time, but is not that nice.

I agree, although that is the precedent that std sets (not that we should always agree with std).

Func&& without forwarding loses a bit of interest, as the parameter (even a temporary) would be bound to an lvalue, no?

Within the function body the func of Func&& func would indeed be an lvalue. But the function argument itself would bind to anything without requiring the the func to be const.

Bah maybe this is too much hair-pulling

I think this should be either const Func& (as it was) or Func&&

I think I could revert to the initial implementation with const& ?

Also, as this does not change the helper functions interface anyway, maybe we could accept this, move on, and re-evaluate later when the full track reconstruction has been ported to have a better metric on the potential impact on perf?

I agree, we should move on. I'm going to run tests for this and merge then. (although the initial implementation was/is const Func instead of const Func&)

This reverts commit b4c3682.

This reverts commit 36f8c3d.

ghugo83 · 2021-03-29T17:25:38Z

In terms of interface, I have realized that the shiftIndex = 0 overload:

ALPAKA_FN_ACC  void for_each_element_in_thread_1D_index_in_block(const T_Acc& acc,
                                                       const uint32_t maxNumberOfElements,
                                                       const Func& func) {

can be a bit confusing on call site, hence I could remove that overload and only use:

ALPAKA_FN_ACC  void for_each_element_in_thread_1D_index_in_block(const T_Acc& acc,
                                                                    const uint32_t elementIdxShift,
                                                                    const uint32_t maxNumberOfElements,
                                                                    const Func& func) {

(also placing the index shift argument before the maxNumberOfElements argument)

But I have that done on the fly in the latest dev branch, does not compulsory needs to be addressed here

ghugo83 · 2021-03-30T08:37:32Z

Thanks @makortel
Also this one pending: #172 , with diff with present master properly updated

ghugo83 added 8 commits February 1, 2021 16:57

Remove unsupported #pragma unroll. Remove unused variables. Cannot us…

6c09567

…e __restrict__ in warpPrefixScan and blockPrefixScan, as they are called with blockPrefixScan(acc, psum, psum, numBlocks, ws), in MultiBlock second step, hence where ci = co = psum.

Only calculate warpPrefixScan workDiv for GPU case

0038c02

Cleaner to directly call with cms::alpakatools::

bea0450

testPrefixScan: take into account dependency on number of elements in…

4a4ca88

… work division. + IMPORTANT: memory allocation should not include the sizeof(type), this is already taken into account internally with Alpaka. This results in allocations x4 bigger than needed.

IMPORTANT: Should not have alpaka::wait::wait(queue); in between kern…

06d0c82

…els. Strong effect on perf here.

Comments

805756b

clang-format

99cb983

makortel force-pushed the prefixscan_fix branch from 99cb983 to 75139ff Compare February 2, 2021 16:55

makortel added the alpaka label Feb 2, 2021

[alpaka] Add CMS_UNROLL_LOOP macro written by fwyzard from cms-sw/cms…

9dc2943

…sw#32499 (supported compilers: CUDA nvcc, gcc, clang).

ghugo83 force-pushed the prefixscan_fix branch from 75139ff to 9dc2943 Compare February 3, 2021 11:47

ghugo83 added 9 commits February 12, 2021 15:13

Add helper functions to directly handle element indices in non-stride…

d6d8c0a

…d access + to directly handle element AND THREAD indices in strided access.

Fixed issues in helper

8fb9967

Also add possibility to offset the indices in the helper functions + …

d1f739b

…Use the new helper functions for strided access in all call sites.

[alpaka] All call sites are now using the helper functions

72b7c47

[alpakatest] Call sites are now using the 1D helper functions. No add…

a1f88bc

…itional helper function extended to 2D case yet.

Simplify helper functions

3a9481b

Added comments in helpers

2f2b54b

Latest fix in helper + Propagate latest helper functions to all alpak…

9d0b0d5

…atest call sites.

clang-format

1c98947

ghugo83 changed the title ~~Adjustments in AlpakaCore/prefixScan and its test~~ Adjustments in AlpakaCore/prefixScan and its test + Add helper functions to handle workdiv Feb 15, 2021

ghugo83 mentioned this pull request Feb 15, 2021

Add an helper function to directly handle thread/element indices in strided case #169

Open

ghugo83 mentioned this pull request Feb 22, 2021

Update Alpaka to 0.6.0, and Cupla to the latest dev snapshot #174

Merged

makortel reviewed Mar 25, 2021

View reviewed changes

Pass function by value in helper functions

48d4d15

makortel reviewed Mar 27, 2021

View reviewed changes

ghugo83 added 4 commits March 29, 2021 12:44

Pass lamdas by && in helper functions

36f8c3d

Inlining these helper functions leads to better perf

b4c3682

Revert "Inlining these helper functions leads to better perf"

7c64537

This reverts commit b4c3682.

Revert "Pass lamdas by && in helper functions"

89d69fe

This reverts commit 36f8c3d.

makortel merged commit c2bcd58 into cms-patatrack:master Mar 30, 2021

fwyzard mentioned this pull request Apr 4, 2021

[alpaka] Port Clusterizer to Alpaka #172

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjustments in AlpakaCore/prefixScan and its test + Add helper functions to handle workdiv #167

Adjustments in AlpakaCore/prefixScan and its test + Add helper functions to handle workdiv #167

ghugo83 commented Feb 1, 2021 •

edited

Loading

makortel commented Feb 2, 2021

makortel commented Feb 2, 2021

makortel commented Feb 2, 2021

ghugo83 commented Feb 2, 2021

ghugo83 commented Feb 3, 2021

ghugo83 commented Feb 15, 2021

makortel left a comment

makortel Mar 25, 2021

ghugo83 Mar 26, 2021

makortel Mar 26, 2021

makortel Mar 25, 2021

ghugo83 Mar 26, 2021 •

edited

Loading

ghugo83 commented Mar 26, 2021 •

edited

Loading

makortel Mar 27, 2021 •

edited

Loading

ghugo83 Mar 29, 2021 •

edited

Loading

ghugo83 Mar 29, 2021

makortel Mar 29, 2021

ghugo83 Mar 29, 2021 •

edited

Loading

makortel Mar 29, 2021

ghugo83 commented Mar 29, 2021

ghugo83 commented Mar 30, 2021

Adjustments in AlpakaCore/prefixScan and its test + Add helper functions to handle workdiv #167

Adjustments in AlpakaCore/prefixScan and its test + Add helper functions to handle workdiv #167

Conversation

ghugo83 commented Feb 1, 2021 • edited Loading

makortel commented Feb 2, 2021

makortel commented Feb 2, 2021

makortel commented Feb 2, 2021

ghugo83 commented Feb 2, 2021

ghugo83 commented Feb 3, 2021

ghugo83 commented Feb 15, 2021

makortel left a comment

Choose a reason for hiding this comment

makortel Mar 25, 2021

Choose a reason for hiding this comment

ghugo83 Mar 26, 2021

Choose a reason for hiding this comment

makortel Mar 26, 2021

Choose a reason for hiding this comment

makortel Mar 25, 2021

Choose a reason for hiding this comment

ghugo83 Mar 26, 2021 • edited Loading

Choose a reason for hiding this comment

ghugo83 commented Mar 26, 2021 • edited Loading

makortel Mar 27, 2021 • edited Loading

Choose a reason for hiding this comment

ghugo83 Mar 29, 2021 • edited Loading

Choose a reason for hiding this comment

ghugo83 Mar 29, 2021

Choose a reason for hiding this comment

makortel Mar 29, 2021

Choose a reason for hiding this comment

ghugo83 Mar 29, 2021 • edited Loading

Choose a reason for hiding this comment

makortel Mar 29, 2021

Choose a reason for hiding this comment

ghugo83 commented Mar 29, 2021

ghugo83 commented Mar 30, 2021

ghugo83 commented Feb 1, 2021 •

edited

Loading

ghugo83 Mar 26, 2021 •

edited

Loading

ghugo83 commented Mar 26, 2021 •

edited

Loading

makortel Mar 27, 2021 •

edited

Loading

ghugo83 Mar 29, 2021 •

edited

Loading

ghugo83 Mar 29, 2021 •

edited

Loading