-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjustments in AlpakaCore/prefixScan and its test + Add helper functions to handle workdiv #167
Conversation
…ble to perform comparisons. Also keep same workload between SERIAL, TBB and CUDA versions, to be able to do perf comparisons (present version has serial runs faster than tbb, itself faster than CUDA runs, because of this reason.
…e __restrict__ in warpPrefixScan and blockPrefixScan, as they are called with blockPrefixScan(acc, psum, psum, numBlocks, ws), in MultiBlock second step, hence where ci = co = psum.
… work division. + IMPORTANT: memory allocation should not include the sizeof(type), this is already taken into account internally with Alpaka. This results in allocations x4 bigger than needed.
…els. Strong effect on perf here.
99cb983
to
75139ff
Compare
Following the merge of #165 I rebased this PR on top of that to simplify the diff here. |
I see the CUDA @fwyzard @VinInn Should we remove the |
@ghugo83 Could you clarify? Is this only about warnings when compiling with gcc? If yes, how about copying the solution from cms-sw/cmssw#32499 here? |
Ha yes sorry, was at a meeting. I could have just merged the latest master in my local branch and pushed again. |
Yes the #pragma unroll is ignored. |
…sw#32499 (supported compilers: CUDA nvcc, gcc, clang).
75139ff
to
9dc2943
Compare
…d access + to directly handle element AND THREAD indices in strided access.
…Use the new helper functions for strided access in all call sites.
…itional helper function extended to 2D case yet.
…atest call sites.
This was time to add helper functions to cleanly handle workDiv, as suggested by @makortel . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good step forward. I was first a bit confused by all the for_each_element_*
variants, but I don't have any concrete suggestions now. My feeling is that this complexity is partially driven by the existing code (given that we were able to express everything in terms of Kokkos whose interface in this respect is more "restricted").
ALPAKA_FN_ACC void for_each_element_in_thread_1D_index_in_block(const T_Acc& acc, | ||
const uint32_t maxNumberOfElements, | ||
const uint32_t elementIdxShift, | ||
const Func& func) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was a little bit surprised that const Func&
works also for mutating lambdas. I guess it works because the lambdas here capture by reference, and the language allows (too easly) a const pointer to non-const object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes will change it to just a pass by value, not as if the function object was heavy anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or universal reference Func&& func
? I suppose that would be most agnostic to the details (in principle the function could get a lambda/functor with heavy-to-copy member data).
} | ||
cms::alpakatools::for_each_element_in_thread_1D_index_in_block(acc, 32, [&](uint32_t i) { | ||
ws[i] = 0; // used by prefix scan... | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old code did not have a loop over elements. I'm wondering which one is more correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both are fine:
The extra loop in Alpaka is due to the loop over elements (obviously not interesting for the GPU backend, but needs to be there for the sake of generality).
https://github.com/ghugo83/pixeltrack-standalone/blob/prefixscan_fix/src/alpaka/AlpakaCore/alpakaWorkDivHelper.h#L163
Whenever the thread index (block-local) is >= 32, the thread index is still called and compared with 32 (like in legacy). But the loop over elements is never entered (and no ws[i]=0 allocation is done), because in that case, firstElementIdx[0u] >= endElementIdx[0u]
Thanks @makortel for the review! |
ALPAKA_FN_ACC void for_each_element_in_thread_1D_index_in_block(const T_Acc& acc, | ||
const uint32_t maxNumberOfElements, | ||
const uint32_t elementIdxShift, | ||
const Func func) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be either const Func&
(as it was) or Func&&
. It being const
prevents mutating functors (which I'd think to be fine), and taking it by value implies a copy (which is not needed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm the copies are eluded anyway?
Don't know, in practice, passing the lambda by value clearly leads to better performance than passing by const &
, or passing by &&
and then forwarding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing the lambda by && and then forwarding, and inlining these functions, seems to be a good compromise, will go for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By quick look (that I should have done earlier), std
and SYCL seem to take such functors by "non-const value", and Kokkos and Alpaka by "const reference". Just do double check, did you observe that passing the lambda by value gives better performance than reference?
If I've understood correctly, calling std::forward()
many times on the same object is not strictly speaking safe (has the same issue as std::move()
), even if it may work in practice in many cases.
Sorry for going back-and-forth, how about going back to the by-value (but without const
), or keep the universal reference and remove the std::forward
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-looking better at this, I do not see clear diff in perf.
-
By value still makes the assumption that the function is light, which should be the case most of the time, but is not that nice.
-
Func&& without forwarding loses a bit of interest, as the parameter (even a temporary) would be bound to an lvalue, no?
Bah maybe this is too much hair-pulling
I think this should be either const Func& (as it was) or Func&&
I think I could revert to the initial implementation with const&
?
Also, as this does not change the helper functions interface anyway, maybe we could accept this, move on, and re-evaluate later when the full track reconstruction has been ported to have a better metric on the potential impact on perf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- By value still makes the assumption that the function is light, which should be the case most of the time, but is not that nice.
I agree, although that is the precedent that std
sets (not that we should always agree with std
).
- Func&& without forwarding loses a bit of interest, as the parameter (even a temporary) would be bound to an lvalue, no?
Within the function body the func
of Func&& func
would indeed be an lvalue. But the function argument itself would bind to anything without requiring the the func
to be const.
Bah maybe this is too much hair-pulling
I think this should be either const Func& (as it was) or Func&&
I think I could revert to the initial implementation with
const&
?Also, as this does not change the helper functions interface anyway, maybe we could accept this, move on, and re-evaluate later when the full track reconstruction has been ported to have a better metric on the potential impact on perf?
I agree, we should move on. I'm going to run tests for this and merge then. (although the initial implementation was/is const Func
instead of const Func&
)
In terms of interface, I have realized that the shiftIndex = 0 overload:
can be a bit confusing on call site, hence I could remove that overload and only use:
(also placing the index shift argument before the maxNumberOfElements argument) But I have that done on the fly in the latest dev branch, does not compulsory needs to be addressed here |
This contains a number of fixes in the prefixScan AlpakaCore version and its test.
The diff should appear clean after #165 is merged. In the meantime, can also simply just click at the diff of src/alpaka/AlpakaCore/prefixScan.h and src/alpaka/test/alpaka/prefixScan_t.cc.
Minor corrections:
Remove unsupported use of
#pragma unroll
.Cannot use
__restrict__
inwarpPrefixScan
andblockPrefixScan
, as they are called withblockPrefixScan(acc, psum, psum, numBlocks, ws)
, in MultiBlock second step, hence withci = co = psum
.Remove unused variables.
Handle element-dependency in
testPrefixScan
for the serial and TBB cases.Take into account the newly added back-ends in the workDiv computation.
Make workloads identical for SERIAL, TBB, and CUDA cases.
This allows to cleanly compare the performances between the 3 cases (after obviously removing all
printf
andassert
, and choosing meaningful workDiv).Important fixes:
Memory allocations sizes should not contain the
sizeof(type)
. Only the number of elements is needed with Alpaka, as there is already a* sizeof(type)
done internally. This resulted in x4 too much memory allocated.Should not call
alpaka::wait::wait(queue);
in between kernels. This had an important impact on perf (but ok this is just for prototyping purposes anyway).Use dynamic shared memory allocation in the secondStepMultiBlock block for
psum
, as is done in CUDACore, instead of global memory.Additionally, the number of elements in the
psum
array was set tonum_items
, while onlynumBlocks
elements was needed. This is also addressed (dynamic shared memory size is handled with Alpaka with aBlockSharedMemDynSizeBytes
struct, templated on the relevant kernel struct).Finally, the new interface has the additional advantage of being closer to what is used in CUDACore.
This point is included in Add AlpakaCore/HistoContainer.h + HistoContainer_t, OneHistoContainer_t and OneToManyAssoc_t tests #165 .