-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AlpakaCore/HistoContainer.h + HistoContainer_t, OneHistoContainer_t and OneToManyAssoc_t tests #165
Conversation
…t. Still issue in TBB case.
…alls when queue is passed as an argument (even as a reference)
…->psws also needs to be set
…ated within a loop.
…w all assert pass and histo values are identical as with CUDA (for same input matrix v).
… and OneToManyAssoc_t
#include "AlpakaCore/alpakastdAlgorithm.h" | ||
#include "AlpakaCore/prefixScan.h" | ||
|
||
using namespace ALPAKA_ACCELERATOR_NAMESPACE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not do using namespace ...
in headers.
Which reminds me that one using namespace alpaka_common;
slipped through in #160, apparently I even forgot to comment about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha yes will remove
assert(ih < int(nh)); | ||
h->count(acc, v[i], ih); | ||
} | ||
endElementIdx += gridDimension; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought, I'd find it a bit clearer if this increment would be moved in the outer for-loop statement
for (uint32_t threadIdx = firstElementIdxNoStride[0u]; threadIdx < nt; threadIdx += gridDimension, endElementIdx += gridDimension) {
I was very confused first, partially because I was missing this increment of endElementIdx
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes sure. Anyway will probably need to add an additional helper, as the present one is not adapted for the += gridDimension
stride.
|
||
for (uint32_t i = firstElementIdxGlobal[0u]; i < endElementIdxGlobal[0u]; ++i) { | ||
h->off[i] = 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CUDA implementation uses memset. I suppose we could use alpaka::mem::view::set
, but would that work only with buffers from alpaka::mem::buf::alloc()
(i.e. not with bare pointers)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes at first I had just tried a alpaka::mem::view::set
on the raw pointer (equivalent to the poff
in the CUDA version), but that was not working. Also with the additional complication that we do not want to set h
to null, but only h->off.
Since ideally we want to keep the same interface as for CUDA (passing around a pointer to Histo as argument of fillManyFromVector
), this was the only way to access h->off
: on the device, which in a way, makes sense.
|
||
const int num_items = Histo::totbins(); | ||
|
||
auto psum_dBuf = alpaka::mem::buf::alloc<uint32_t, Idx>(device, Vec1::all(num_items)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using HistoContainer::psws
like in CUDA implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically the prefixScan have slightly different interfaces in the CUDA and ALPAKA versions. In the Alpaka version, there is no psws
argument for doing a prefixScan.
I had looked at that specific point, and from what I remember, psws
also exists in the Alpaka implementation of prefixScan
, but as an internal variable (ws
or sth).
The issue is that here we are actually interested in storing the ws
value to Histo (even if I had looked up, and at least in the pixeltrack-standalone, myHisto->psws
seems to never be used eventually).
To preserve the same Histo interface, I have, by safety, decided to set it later on (it actually very simply corresponds to the number of blocks). It is what storePrefixScanWorkingSpace
is doing.
I have compared the Histo off
and psws
data members between CUDA and Alpaka versions and they are identical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess in ideal it would be better to have the exact same interface for a prefixScan with CUDA and Alpaka, but maybe there are reasons I miss.
In any case was planning to do a review of prefixScan for my understanding:
- There are calls to
alpaka::wait::wait(queue);
in between kernels, which I do not understand and maybe are not needed (we should only need those after a copy from device to host, when interested in what has been copied, or before a host function returns; well, whenever the host needs info from device). - I see multiplications of the number of elements by sizeof(type), while with Alpaka only the number of elements should be specified (the multiplication with sizeof(type) is already done internally). Hence I guess the tests are done with 'too many' elements.
- I would like to check the SERIAL and TBB cases and handling of elements to be more confident about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the Alpaka version, there is no
psws
argument for doing a prefixScan.
Ok, so the situation is a bit more complicated.
The CUDA version takes pc
, that is used for the "synchronization point" to know which of the threads belong to the "last block". Alpaka version does not need that because that synchronization is handled with splitting the code in two kernels.
The Alpaka version takes psum
(well the ...FirstStep
doesn't seem to really need it), which in CUDA is handled as extern __shared__ T psum[]
(does Alpaka support dynamic size array in shared memory?).
In both cases the actual value of pc
/psum
is irrelevant outside of the kernels, so maybe we could re-purpose HistoContainer::ppsws
for the psum
here? I think we are not interested in the actual content of psws
, it seems to be used only as temporary workspace (and e.g. in Kokkos version HistoContainer
does not have psws
member at all).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case was planning to do a review of prefixScan for my understanding:
Sounds like it could be a good idea.
- There are calls to
alpaka::wait::wait(queue);
in between kernels, which I do not understand and maybe are not needed (we should only need those after a copy from device to host, when interested in what has been copied, or before a host function returns; well, whenever the host needs info from device).
I agree, we should not have alpaka::wait::wait()
calls within these helper functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are not interested in the actual content of psws, it seems to be used only as temporary workspace
Yes I had only made a grep within the pixeltrack-standalone project, and there the Histogram data member called psws was never used.
Just doing a git grep in CMSSW, it appears it is never used there either.
I tend to think it is easier to just systematically keep the same interfaces, as it often eventually save issues dowsntream, but true that here, we can indeed also skip it, and simply just remove psws data member from Histo class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can indeed also skip it, and simply just remove psws data member from Histo class.
Note that in https://github.com/cms-patatrack/pixeltrack-standalone/pull/165/files#r565772291 I suggested to keep it and use it for the psum
argument of the multiBlockPrefixScan*()
. Then you could avoid this memory allocation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still like to avoid this memory allocation (either by keeping Histo::psws
or trying dynamic shared memory allocation in multiBlockPrefixScanSecondStep
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still like to avoid this memory allocation (either by keeping Histo::psws or trying dynamic shared memory allocation in multiBlockPrefixScanSecondStep.
Yes doing that now.
Basically, since pc
/ psum
are used internally only in PrefixScan
, I do not really see why we should store psum
inside Histo (even in place of psws
). Additionally, psws
was just a pointer to an uint32_t
, whereas psum
should be an array of num_items
elements.
That's why I have just removed psws
from Histo.
I thinkpsum
should be addressed internally in PrefixScan
, by indeed dynamic shared memory allocation, as for CUDA version.
I was thinking doing a prefixScan
related PR, also modifying the prefixScan call sites, but ok this point can also be addressed in this PR. Doing a commit ~now :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, psws
could not be used for psum
because that is used as an array (whereas pc
is used as an atomic counter).
queue, | ||
alpaka::kernel::createTaskKernel<Acc1>( | ||
workDivWith1Block, multiBlockPrefixScanSecondStep<uint32_t>(), poff, poff, psum_d, num_items, nblocks)); | ||
alpaka::wait::wait(queue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this wait()
really needed or could it be left for the caller to decide whether to block or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that was a painful segfault to fix. Basically here, workdiv and kernels are defined in host function launchFinalize
scope.
If that function returns before the device work is completed, we run into segfaults / spurious results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume a clean way would be to only have one host function instead of intricated host functions in which device work is defined.
But since here this is just a test where we do not care about perf, and want to keep the code similar to the CUDA versions, I just added those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds strange. I would have naively expected alpaka::queue::enqueue()
etc to copy the work division. Or could it be the allocated buffer that must stay alive until the work finishes (that I believe)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By quick look through Alpaka code they seem to copy both the work division and the functor within the call alpaka::kernel::createTaskKernel()
call. This would need to be investigated further (because I'd really want these wait()
s to go), but that can be done after this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds strange. I would have naively expected alpaka::queue::enqueue() etc to copy the work division. Or could it be the allocated buffer that must stay alive until the work finishes (that I believe)?
Yes, so looking at the execution task class (here for example with GPU CUDA backend):
http://alpaka-group.github.io/alpaka/TaskKernelGpuUniformCudaHipRt_8hpp_source.html#l00134
- the kernel function object and its arguments, are move / copy constructed and stored as TaskKernel data members.
- the workdiv is also move / copy constructed, and stored as data member of the base class (TaskKernel is derived from WorkDiv). http://alpaka-group.github.io/alpaka/WorkDivMembers_8hpp_source.html#l00052
Now, a raw pointer is passed as argument (poff
), and was obtained from the Alpaka-equivalent to the get() (getPtrNative) from a reference-counting buffer handle.
So, if the handle is reference-counted to 0 before the device work finishes, we end up with a dangling pointer, and we are still screwed.
However, that is not the case, because the owning pointer is h_d
, and is defined out of the scope of this function.
So evth should be fine.
Trying to run without the alpaka::wait::wait(queue)
, I still sometimes get a segfault with ./HistoContainer_t.tbb
. It is a spurious segfault, which makes it hard to know whether sth fixes it or not. It seems to be somehow stemming from a synchronization issue indeed, but the issue is not compulsory (should not) be here.
Hence will remove the alpaka::wait::wait(queue)
.
|
||
alpaka::queue::enqueue(queue, | ||
alpaka::kernel::createTaskKernel<Acc1>(workDiv, fillFromVector(), h, nh, v, offsets)); | ||
alpaka::wait::wait(queue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this wait()
really needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as before: end of host function scope.
|
||
alpaka::mem::view::copy(queue, v_d, v_buf, N); | ||
|
||
alpaka::mem::view::set(queue, h_d, 0, 1u); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a corresponding memset in CUDA code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes true, I have added it because while using h_d
several times in a loop, we want to be sure that the next iteration starts with sth clean.
This was also causing segfault.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice, I would just declare h_d
inside the loop (and in general obviously just keep variable declarations as close as possible to when they are used).
But here I would avoid changing too much the code in addition to the port to Alpaka, to be able to keep the portability 'portable'.
|
||
alpaka::queue::enqueue( | ||
queue, | ||
alpaka::kernel::createTaskKernel<Acc1>(workDiv, setZeroBins(), alpaka::mem::view::getPtrNative(hist_dbuf))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In CUDA all of these are in a single kernel, separated by __syncthreads()
. Why are they split here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes here I have followed what was also done in the Kokkos version, since in this test did not care about perf.
Can also push a one-kernel version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you followed Kokkos version. I don't remember why we split the code there. Probably doesn't matter much (for a unit test).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For memo: In the end, I have made the OneHistoContainer test based on a single kernel version, to be closest as possible to CUDA test version.
alpaka::mem::view::copy(queue, v_dbuf, tr_hbuf, N); | ||
|
||
auto a_dbuf = alpaka::mem::buf::alloc<Assoc, Idx>(device, 1u); | ||
alpaka::mem::view::set(queue, a_dbuf, 0, 1u); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a corresponding memset in CUDA code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes just ended up being cleaner.
|
||
uint32_t endElementIdx = endElementIdxNoStride[0u]; | ||
for (uint32_t threadIdx = firstElementIdxNoStride[0u]; threadIdx < nt; threadIdx += gridDimension) { | ||
for (uint32_t i = threadIdx; i < std::min(endElementIdx, nt); ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this pattern
const auto gridDim = ...;
const auto [firstNoStride, endNoStride] = ...;
for (auto threadIdx = firstNoStride, endElementIdx = endNoStride; threadIdx < MAX; threadIdx += gridDimension, endElementIdx += gridDimension) {
for (auto i = threadIdx; i < std::min(endElementIdx, MAX); ++i ) {
<body>
}
}
repeats in almost(?) every kernel in this PR. That's 4+ lines of repetitive and error prone code. I'm wondering if it would be worth to abstract that along
// the name should be more descriptive...
template <typename T_Acc, typename N, typename Func>
void for_each_element(const T_Acc& acc, const N nitems, Func func) {
const auto gridDim = ...;
const auto [firstNoStride, endNoStride] = ...;
for (auto threadIdx = firstNoStride, endElementIdx = endNoStride; threadIdx < nitems; threadIdx += gridDimension, endElementIdx += gridDimension) {
for (auto i = threadIdx; i < std::min(endElementIdx, nitems); ++i ) {
func(i);
}
}
}
that could be called here along
const uint32_t nt = offsets[nh];
for_each_element(acc, nt, [&](uint32_t i) {
auto off = alpaka_std::upper_bound(offsets, offsets + nh + 1, i);
...
}
? I know this starts to look like we would be building our own abstraction layer on top of Alpaka, but to me the boilerplace calls for something.
Written that I'm fine if this is left for a future PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes absolutely, was thinking marking this as a to-do comment for this PR.
Maybe in an additional PR indeed is better.
…All workdiv / function object / arguments info are copied anyway, and the owning pointer to the histogram is defined outside the function scope
…pend ALPAKA_ACCELERATOR_NAMESPACE when needed. Could also place entire callers within ALPAKA_ACCELERATOR_NAMESPACE namespace.
Ok just removed the |
As a memo 2 points will be addressed outside of this PR:
|
@@ -8,10 +8,10 @@ | |||
#include "AlpakaCore/alpakaWorkDivHelper.h" | |||
#include "AlpakaCore/HistoContainer.h" | |||
|
|||
using namespace ALPAKA_ACCELERATOR_NAMESPACE; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but actually here (and in other unit test files) the using namespace ...
is ok. It's just in headers where it causes problems (although it does make the namespace of the various types more clear in source files as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but at this stage, found it more consistent to remove it everywhere, rather than having a mix of 'using namespace' and ALPAKA_ACCELERATOR_NAMESPACE::
Good question. In the Kokkos port (which has a similar structure) the aim was to either put functions in the |
I think the template approach would actually become feasible (with look&feel close to Kokkos version) if we'd have only one Accelerator type for each backend (#144). |
Yes usually I would not have used 'using namespace', but here the fact that the namespace is acc-dependent feels a bit weird.
Ok. Or why not templating on each possible Acc type? |
…g to call element_global_index_range for each possible max number of elements.
…elper function in cms::alpakatools, but this is already nicer.
…the number of blocks used in a prefix scan.
…which already makes things clearer. NB: TO DO: add a dedicated helper function.
Ok for the OneHistoContainer test, just added the 1-kernel version instead, as it makes it easier to debug and compare with CUDA tests. |
The In the original idea (and what happens in Kokkos version) the namespace ALPAKA_ACCELERATOR_NAMESPACE {
namespace cms::alpakatools {
}
} (which we could do here too)
I'm not really sure what you mean but I'll guess anyway. The annoyance stems from the function deciding the dimension of an index, and the caller of the function not having to know that. One option would be for the caller to give all the Another option that came to my mind would be to define our own "tag type" for each accelerator, and then a traits class template that could be used to get all the |
ALPAKA_FN_ACC std::pair<Vec<T_Dim>, Vec<T_Dim>> element_global_index_range(const T_Acc& acc, | ||
const Vec<T_Dim>& maxNumberOfElements) { | ||
template <typename T_Acc, typename T_Dim = alpaka::dim::Dim<T_Acc>> | ||
ALPAKA_FN_ACC std::pair<Vec<T_Dim>, Vec<T_Dim>> element_global_index_range_uncut(const T_Acc& acc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would name this function as element_global_index_range
, and the other something along element_global_index_range_max
(preferring to tell what a function does rather than what it does not do). But a minor point for a prototype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok no pb
yes was to insist on the risky version (the one which has not truncation with max), but yes the other way round can be more elegant. Will change it.
|
||
const int num_items = Histo::totbins(); | ||
|
||
auto psum_dBuf = alpaka::mem::buf::alloc<uint32_t, Idx>(device, Vec1::all(num_items)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still like to avoid this memory allocation (either by keeping Histo::psws
or trying dynamic shared memory allocation in multiBlockPrefixScanSecondStep
.
…nstead of global memory. This changes the prefixSCan interface (closer to CUDA version), hence the call sites. Important: To be noted is that in any case, the amount of memory needed was not num_items * sizeof(T), only num_blocks * sizeof(T) is sufficient.
…and element_global_index_range_truncated to compute range truncated by max number of elements of interest.
Yes you are right.
Yes just ended up having |
template <typename Histo> | ||
ALPAKA_FN_HOST ALPAKA_FN_INLINE __attribute__((always_inline)) void launchFinalize( | ||
Histo *__restrict__ h, | ||
const ALPAKA_ACCELERATOR_NAMESPACE::DevAcc1 &device, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After removing the memory allocation, the device
parameter is not needed anymore
const ALPAKA_ACCELERATOR_NAMESPACE::DevAcc1 &device, |
uint32_t const *__restrict__ offsets, | ||
uint32_t totSize, | ||
unsigned int nthreads, | ||
const ALPAKA_ACCELERATOR_NAMESPACE::DevAcc1 &device, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
device
parameter can be removed from here as well.
const ALPAKA_ACCELERATOR_NAMESPACE::DevAcc1 &device, |
alpaka::queue::enqueue(queue, | ||
alpaka::kernel::createTaskKernel<ALPAKA_ACCELERATOR_NAMESPACE::Acc1>( | ||
workDiv, countFromVector(), h, nh, v, offsets)); | ||
launchFinalize(h, device, queue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following removal of device
.
launchFinalize(h, device, queue); | |
launchFinalize(h, queue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha yes true, this argument is not necessary anymore now, thanks. I just removed it.
Run smoothly and results are consistent with 'pure CUDA' versions.