Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjustments in AlpakaCore/prefixScan and its test + Add helper functions to handle workdiv #167

Merged
merged 23 commits into from
Mar 30, 2021

Conversation

ghugo83
Copy link
Contributor

@ghugo83 ghugo83 commented Feb 1, 2021

This contains a number of fixes in the prefixScan AlpakaCore version and its test.

The diff should appear clean after #165 is merged. In the meantime, can also simply just click at the diff of src/alpaka/AlpakaCore/prefixScan.h and src/alpaka/test/alpaka/prefixScan_t.cc.

Minor corrections:

  • Remove unsupported use of #pragma unroll.

  • Cannot use __restrict__ in warpPrefixScan and blockPrefixScan, as they are called with blockPrefixScan(acc, psum, psum, numBlocks, ws), in MultiBlock second step, hence with ci = co = psum.

  • Remove unused variables.

  • Handle element-dependency in testPrefixScan for the serial and TBB cases.

  • Take into account the newly added back-ends in the workDiv computation.

  • Make workloads identical for SERIAL, TBB, and CUDA cases.
    This allows to cleanly compare the performances between the 3 cases (after obviously removing all printf and assert, and choosing meaningful workDiv).

Important fixes:

  • Memory allocations sizes should not contain the sizeof(type). Only the number of elements is needed with Alpaka, as there is already a * sizeof(type) done internally. This resulted in x4 too much memory allocated.

  • Should not call alpaka::wait::wait(queue); in between kernels. This had an important impact on perf (but ok this is just for prototyping purposes anyway).

  • Use dynamic shared memory allocation in the secondStepMultiBlock block for psum, as is done in CUDACore, instead of global memory.
    Additionally, the number of elements in the psum array was set to num_items, while only numBlocks elements was needed. This is also addressed (dynamic shared memory size is handled with Alpaka with a BlockSharedMemDynSizeBytes struct, templated on the relevant kernel struct).
    Finally, the new interface has the additional advantage of being closer to what is used in CUDACore.
    This point is included in Add AlpakaCore/HistoContainer.h + HistoContainer_t, OneHistoContainer_t and OneToManyAssoc_t tests #165 .

…ble to perform comparisons. Also keep same workload between SERIAL, TBB and CUDA versions, to be able to do perf comparisons (present version has serial runs faster than tbb, itself faster than CUDA runs, because of this reason.
…e __restrict__ in warpPrefixScan and blockPrefixScan, as they are called with blockPrefixScan(acc, psum, psum, numBlocks, ws), in MultiBlock second step, hence where ci = co = psum.
… work division. + IMPORTANT: memory allocation should not include the sizeof(type), this is already taken into account internally with Alpaka. This results in allocations x4 bigger than needed.
@makortel
Copy link
Collaborator

makortel commented Feb 2, 2021

Following the merge of #165 I rebased this PR on top of that to simplify the diff here.

@makortel
Copy link
Collaborator

makortel commented Feb 2, 2021

  • Cannot use __restrict__ in warpPrefixScan and blockPrefixScan, as they are called with blockPrefixScan(acc, psum, psum, numBlocks, ws), in MultiBlock second step, hence with ci = co = psum.

I see the CUDA prefixScan.h in CMSSW (still) has the __restrict__
https://github.com/cms-sw/cmssw/blob/bf052a322636c2fcc229a048da44f95977bc3845/HeterogeneousCore/CUDAUtilities/interface/prefixScan.h#L13
(abviously the CUDA version here does the same).

@fwyzard @VinInn Should we remove the __restrict__ from warpPrefixScan in CMSSW as well?

@makortel makortel added the alpaka label Feb 2, 2021
@makortel
Copy link
Collaborator

makortel commented Feb 2, 2021

  • Remove unsupported use of #pragma unroll.

@ghugo83 Could you clarify? Is this only about warnings when compiling with gcc? If yes, how about copying the solution from cms-sw/cmssw#32499 here?

@ghugo83
Copy link
Contributor Author

ghugo83 commented Feb 2, 2021

Following the merge of #165 I rebased this PR on top of that to simplify the diff here.

Ha yes sorry, was at a meeting. I could have just merged the latest master in my local branch and pushed again.

@ghugo83
Copy link
Contributor Author

ghugo83 commented Feb 3, 2021

@ghugo83 Could you clarify? Is this only about warnings when compiling with gcc? If yes, how about copying the solution from cms-sw/cmssw#32499 here?

Yes the #pragma unroll is ignored.
Didn't see cms-sw/cmssw#32499, just copied it here.

@ghugo83 ghugo83 changed the title Adjustments in AlpakaCore/prefixScan and its test Adjustments in AlpakaCore/prefixScan and its test + Add helper functions to handle workdiv Feb 15, 2021
@ghugo83
Copy link
Contributor Author

ghugo83 commented Feb 15, 2021

This was time to add helper functions to cleanly handle workDiv, as suggested by @makortel .
Ideally, it would have been better to add the helper functions in a separate PR. But there were points here which actually needed to be fixed by adding new helper functions anyway.

Copy link
Collaborator

@makortel makortel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good step forward. I was first a bit confused by all the for_each_element_* variants, but I don't have any concrete suggestions now. My feeling is that this complexity is partially driven by the existing code (given that we were able to express everything in terms of Kokkos whose interface in this respect is more "restricted").

ALPAKA_FN_ACC void for_each_element_in_thread_1D_index_in_block(const T_Acc& acc,
const uint32_t maxNumberOfElements,
const uint32_t elementIdxShift,
const Func& func) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a little bit surprised that const Func& works also for mutating lambdas. I guess it works because the lambdas here capture by reference, and the language allows (too easly) a const pointer to non-const object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes will change it to just a pass by value, not as if the function object was heavy anyway

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or universal reference Func&& func? I suppose that would be most agnostic to the details (in principle the function could get a lambda/functor with heavy-to-copy member data).

}
cms::alpakatools::for_each_element_in_thread_1D_index_in_block(acc, 32, [&](uint32_t i) {
ws[i] = 0; // used by prefix scan...
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code did not have a loop over elements. I'm wondering which one is more correct.

Copy link
Contributor Author

@ghugo83 ghugo83 Mar 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both are fine:
The extra loop in Alpaka is due to the loop over elements (obviously not interesting for the GPU backend, but needs to be there for the sake of generality).
https://github.com/ghugo83/pixeltrack-standalone/blob/prefixscan_fix/src/alpaka/AlpakaCore/alpakaWorkDivHelper.h#L163
Whenever the thread index (block-local) is >= 32, the thread index is still called and compared with 32 (like in legacy). But the loop over elements is never entered (and no ws[i]=0 allocation is done), because in that case, firstElementIdx[0u] >= endElementIdx[0u]

@ghugo83
Copy link
Contributor Author

ghugo83 commented Mar 26, 2021

Thanks @makortel for the review!
Anything else?

ALPAKA_FN_ACC void for_each_element_in_thread_1D_index_in_block(const T_Acc& acc,
const uint32_t maxNumberOfElements,
const uint32_t elementIdxShift,
const Func func) {
Copy link
Collaborator

@makortel makortel Mar 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be either const Func& (as it was) or Func&&. It being const prevents mutating functors (which I'd think to be fine), and taking it by value implies a copy (which is not needed).

Copy link
Contributor Author

@ghugo83 ghugo83 Mar 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm the copies are eluded anyway?
Don't know, in practice, passing the lambda by value clearly leads to better performance than passing by const &, or passing by && and then forwarding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing the lambda by && and then forwarding, and inlining these functions, seems to be a good compromise, will go for that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By quick look (that I should have done earlier), std and SYCL seem to take such functors by "non-const value", and Kokkos and Alpaka by "const reference". Just do double check, did you observe that passing the lambda by value gives better performance than reference?

If I've understood correctly, calling std::forward() many times on the same object is not strictly speaking safe (has the same issue as std::move()), even if it may work in practice in many cases.

Sorry for going back-and-forth, how about going back to the by-value (but without const), or keep the universal reference and remove the std::forward?

Copy link
Contributor Author

@ghugo83 ghugo83 Mar 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-looking better at this, I do not see clear diff in perf.

  • By value still makes the assumption that the function is light, which should be the case most of the time, but is not that nice.

  • Func&& without forwarding loses a bit of interest, as the parameter (even a temporary) would be bound to an lvalue, no?

Bah maybe this is too much hair-pulling

I think this should be either const Func& (as it was) or Func&&

I think I could revert to the initial implementation with const& ?

Also, as this does not change the helper functions interface anyway, maybe we could accept this, move on, and re-evaluate later when the full track reconstruction has been ported to have a better metric on the potential impact on perf?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • By value still makes the assumption that the function is light, which should be the case most of the time, but is not that nice.

I agree, although that is the precedent that std sets (not that we should always agree with std).

  • Func&& without forwarding loses a bit of interest, as the parameter (even a temporary) would be bound to an lvalue, no?

Within the function body the func of Func&& func would indeed be an lvalue. But the function argument itself would bind to anything without requiring the the func to be const.

Bah maybe this is too much hair-pulling

I think this should be either const Func& (as it was) or Func&&

I think I could revert to the initial implementation with const& ?

Also, as this does not change the helper functions interface anyway, maybe we could accept this, move on, and re-evaluate later when the full track reconstruction has been ported to have a better metric on the potential impact on perf?

I agree, we should move on. I'm going to run tests for this and merge then. (although the initial implementation was/is const Func instead of const Func&)

@ghugo83
Copy link
Contributor Author

ghugo83 commented Mar 29, 2021

In terms of interface, I have realized that the shiftIndex = 0 overload:

ALPAKA_FN_ACC  void for_each_element_in_thread_1D_index_in_block(const T_Acc& acc,
                                                       const uint32_t maxNumberOfElements,
                                                       const Func& func) {

can be a bit confusing on call site, hence I could remove that overload and only use:

ALPAKA_FN_ACC  void for_each_element_in_thread_1D_index_in_block(const T_Acc& acc,
                                                                    const uint32_t elementIdxShift,
                                                                    const uint32_t maxNumberOfElements,
                                                                    const Func& func) {

(also placing the index shift argument before the maxNumberOfElements argument)

But I have that done on the fly in the latest dev branch, does not compulsory needs to be addressed here

@makortel makortel merged commit c2bcd58 into cms-patatrack:master Mar 30, 2021
@ghugo83
Copy link
Contributor Author

ghugo83 commented Mar 30, 2021

Thanks @makortel
Also this one pending: #172 , with diff with present master properly updated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants